This project investigates the validity of using two-dimensional (2D) tissue sections to accurately represent the three-dimensional (3D) spatial statistical properties of cellular organization within tissues. A common practice in biology is to analyze thin tissue slices, but this can lead to biased interpretations of cell distribution patterns.
The primary aim of this pipeline is twofold:
- To quantitatively compare spatial metrics (primarily the Pair Correlation Function) derived from true 3D spatial data of cells with those obtained from their 2D projections (slices).
- To explore and potentially elaborate methods to understand and mitigate the biases introduced by 2D sampling.
The Pair Correlation Function, often denoted as
Interpretation:
-
$g(r) > 1$ : Indicates that cells are more likely to be found at distance$r$ from each other than by chance, suggesting clustering or aggregation at that length scale. -
$g(r) < 1$ : Indicates that cells are less likely to be found at distance$r$ , suggesting dispersion or regularity (e.g., cells avoiding each other) at that scale. -
$g(r) = 1$ : Indicates that cells are randomly distributed with respect to each other at distance$r$ .
By calculating
The analysis is performed through a series of interconnected scripts, primarily written in Python and R. The general workflow is as follows:
Data Directory Structure (Illustrative):
data/init/
: Should contain the initial raw image data (e.g., TIFF stacks for markers and masks).data/
: Stores processed data files generated by the pipeline (e.g.,cell_coordinates.csv
,cell_intensities.csv
,expression_annotated_corrected.csv
). Specific subdirectories (e.g.,data/nat_blood/
) are used for different datasets.code/
: Contains all the Python and R scripts for the analysis.images/
: Stores output plots and figures. Specific subdirectories (e.g.,images/Nat_blood/
) are used for different datasets.temp/
: May be used for intermediate statistical results.
Core Workflow Scripts:
-
code/extract.py
(Python)- Purpose: Extracts initial cell data from raw 3D image files (e.g., multi-channel TIFF stacks and segmentation masks).
- Inputs: Image files (e.g.,
../data/init/full_mask_final_segmentation_hwatershed_bg500_90%.tif
and marker-specific image stacks). - Processing:
- Reads image masks to identify individual cells and their boundaries.
- Reads corresponding marker intensity images.
- Calculates 3D centroid coordinates (x, y, z) and area for each cell.
- Calculates mean marker intensities for each cell across all channels.
- Outputs:
../data/<dataset_name>/cell_coordinates.csv
: Contains columns likearea
,x
,y
,z
.../data/<dataset_name>/cell_intensities.csv
: Contains columns representing different markers, with values being mean intensities for each cell.
-
code/correct.r
(R)- Purpose: Corrects raw cell marker intensities for spectral spillover between detection channels.
- Inputs:
../data/<dataset_name>/cell_intensities.csv
(fromextract.py
).../data/<dataset_name>/compensationMatrix.csv
: A matrix defining the spillover coefficients between channels.../data/<dataset_name>/model_panel.csv
(or similar): A panel file mapping raw channel names/metal tags to cleaned marker names.
- Processing:
- Applies the inverse of the compensation matrix to the intensity data.
- Renames marker columns based on the panel file.
- Optionally filters out specific markers (e.g., DNA intercalators like Hoechst, or unused channels).
- Generates heatmaps (
../images/<dataset_name>/heat_unc.png
,../images/<dataset_name>/heat_cor.png
) showing marker correlations before and after compensation.
- Outputs:
../data/<dataset_name>/expression_annotated_corrected.csv
: The compensated and cleaned expression matrix.
-
code/cluster.py
(Python)- Purpose: Performs unsupervised clustering of cells based on their corrected marker expression profiles and annotates these clusters.
- Inputs:
../data/<dataset_name>/expression_annotated_corrected.csv
(fromcorrect.r
).../data/<dataset_name>/cell_coordinates.csv
(for spatial visualization).
- Processing:
- Uses the
scanpy
library. - Scales expression data.
- Performs Principal Component Analysis (PCA).
- Builds a neighborhood graph and applies Leiden clustering.
- Visualizes clusters using t-SNE (
../images/<dataset_name>/tsne_unlabeled.png
,../images/<dataset_name>/tsne_stacked.png
). - Identifies marker genes for clusters (
../images/<dataset_name>/ranked_genes_dotplot.png
,../images/<dataset_name>/ranked_genes_violin.png
). - Allows for manual annotation of clusters to cell types based on marker expression (requires user to define
marker_to_cell
dictionary in the script). - Generates 3D spatial plots of cell clusters and annotated cell types (
../images/<dataset_name>/cell_clusters_3d.png
,../images/<dataset_name>/cell_types_3d.png
).
- Uses the
- Outputs:
../data/<dataset_name>/adata.h5ad
: A Scanpy AnnData object containing the processed and annotated data.- Updates
../data/<dataset_name>/cell_coordinates.csv
with new columns forcluster
andcell
(annotated cell type). - Various diagnostic plots.
-
code/choose.py
(Python - as described in original README, not provided for detailed review)- Purpose: To select specific cell clusters (obtained from
cluster.py
) for downstream spatial analysis. - Inputs: Likely
../data/<dataset_name>/cell_coordinates.csv
(with cluster information). - Processing: Filters the coordinate data to retain only cells belonging to the chosen cluster(s).
- Outputs: A filtered version of cell coordinates, possibly
../data/<dataset_name>/cell_coordinates_selected.csv
.
- Purpose: To select specific cell clusters (obtained from
-
code/prespace.r
(R)- Purpose: This script serves as a preparatory step for the main spatial analysis. It performs an initial exploration of spatial characteristics, including calculating the 3D PCF, PCFs for multiple random 2D slices, and Clark-Evans (CE) aggregation indices. A key function is to help determine an optimal maximum radius (
r_max
) for PCF calculations, which can then be used inspace.r
. It can also identify specific 2D slices based on cell count criteria. - Inputs:
- Path to the cell coordinates CSV file (e.g.,
../data/<dataset_name>/cell_coordinates_selected.csv
). - Optional:
n_range
(min and max number of cells) to select slices for specific analysis or visualization.
- Path to the cell coordinates CSV file (e.g.,
- Processing:
- Sources functions from
code/module.r
andcode/visual.r
. - Calculates the 3D PCF for the input cell data.
- Generates a specified number of random 2D slices by rotating the 3D data and uses
pc_for_slice
(frommodule.r
) to compute their 2D PCFs and CE indices. - Collects PCF data, cell counts, and CE indices for all slices.
- If
n_range
is provided, it identifies rotation angles for slices meeting the cell count criteria and can visualize a representative slice usingangle_analysis
(frommodule.r
). - Generates a cumulative PCF plot comparing the 3D PCF to the distribution of 2D PCFs using
cumul_pcf
(fromvisual.r
). - Prints a suggested "Optimal max_r" to the console.
- Sources functions from
- Outputs:
- Console output: Suggested
r_max
value. - Plots: Cumulative PCF plot, image of a selected slice (if
n_range
is used). - Data:
ce_tbl.csv
(Clark-Evans indices),pcf_naive.csv
(basic PCF characteristics) stored in../data/naive/<dataset_folder>/<cell_type>/
.
- Console output: Suggested
- Purpose: This script serves as a preparatory step for the main spatial analysis. It performs an initial exploration of spatial characteristics, including calculating the 3D PCF, PCFs for multiple random 2D slices, and Clark-Evans (CE) aggregation indices. A key function is to help determine an optimal maximum radius (
-
Spatial Slicing and PCF Analysis (R:
code/space.r
,code/module.r
,code/visual.r
)code/module.r
: This is a library of core functions for spatial statistics.- Key Functions:
rotation()
: Creates 3D rotation matrices.slice()
: Extracts cells within a 2D slice of defined thickness from a (potentially rotated) 3D point cloud.pc_for_slice()
: Takes a 3D cell coordinate matrix, applies rotation, extracts a slice, computes the 2D PCF for that slice usingspatstat
, and calculates the Clark and Evans aggregation index.ce_idx()
: Computes the Clark and Evans index for 2D or 3D point sets.angle_analysis()
: Selects and visualizes a slice based on specific rotation angles.
- Key Functions:
code/visual.r
(R - details inferred, not fully reviewed): Contains functions for visualizing spatial statistics results.- Key Functions (used by
space.r
/prespace.r
):mean_pcf()
: Plots the true 3D PCF along with the mean and individual PCFs from 2D slices.cumul_pcf()
: Plots the 3D PCF against a cumulative distribution of 2D PCFs.
- Key Functions (used by
code/space.r
(R):- Purpose: This is the main script for detailed PCF comparison. It calculates the 3D PCF and then PCFs for numerous random 2D slices, comparing them visually.
- Inputs:
- Path to the cell coordinates CSV file (e.g.,
../data/<dataset_name>/cell_coordinates_selected.csv
). r_max
: Maximum radius for PCF calculation (value can be guided byprespace.r
output).
- Path to the cell coordinates CSV file (e.g.,
- Processing:
- Sources functions from
code/module.r
andcode/visual.r
. - Calculates the true 3D PCF for the input cell data.
- Generates a large number of random 2D slices by rotating the 3D data.
- For each slice, computes the 2D PCF using
pc_for_slice
frommodule.r
. - Uses
mean_pcf
(fromvisual.r
) to generate a plot comparing the 3D PCF to the ensemble of 2D PCFs.
- Sources functions from
- Outputs:
- Plots comparing 3D and 2D PCFs (e.g.,
../images/<dataset_name>/mean_pcf_<cell_type>.png
). - PCF data and cell counts for 3D and 2D slices saved to
../data/pcf/<dataset_folder>/<cell_type>/
.
- Plots comparing 3D and 2D PCFs (e.g.,
Supporting and Specialized Scripts:
-
code/align.py
(Python) &code/module.py
(Python):- Purpose: These scripts are designed for aligning a series of 2D tissue slices to reconstruct a 3D volume if the input data is in that format (e.g., serial sections).
module.py
(Python) provides alignment functions (affine andSTalign
-based LDDMM) used byalign.py
. - Note: If starting with an intact 3D image volume (as processed by
extract.py
), this step might be used for specific sub-volume registration or might not be strictly part of the main PCF comparison pipeline.
- Purpose: These scripts are designed for aligning a series of 2D tissue slices to reconstruct a 3D volume if the input data is in that format (e.g., serial sections).
-
code/mimic.r
(R):- Purpose: Fits mathematical models (e.g., Gamma distribution-based, damped oscillation) to empirical PCF curves. This can help in parameterizing PCF shapes for comparison or interpretation.
-
code/evaluate.r
(R) &code/heat.r
(R):- Purpose: These scripts are used for downstream statistical evaluation and visualization, particularly for comparing results (e.g., p-values from tests comparing 2D vs. 3D metrics) across different datasets, cell types, or conditions.
evaluate.r
: Generates bar plots of p-values.heat.r
: Generates heatmaps of p-values.- Inputs: Typically tables of p-values or other statistical results (e.g.,
../temp/pval_tbl.csv
).
Prerequisites:
- Python Environment:
- Required libraries:
pandas
,matplotlib
,pathlib
,scanpy
,plotly
,skimage
,numpy
,torch
,STalign
(foralign.py
). - It's recommended to use a virtual environment (e.g., conda or venv).
scanpy
installation might require specific attention to its dependencies.STalign
might need to be installed from its source if not available via pip/conda.Kaleido
is needed byplotly
for static image export.
- Required libraries:
- R Environment:
- Required libraries:
pheatmap
,tidyverse
,spatstat
,dplyr
,FNN
,geometry
,minpack.lm
,ggplot2
,this.path
. - Ensure R is compiled with Cairo support for better graphics output if using
cairo_pdf
.
- Required libraries:
General Running Order (Conceptual):
A typical analysis run for a new dataset would involve the following sequence. Note: Script parameters (like input file paths, dataset names, specific marker lists) will need to be adjusted within each script or passed as arguments if the scripts are modified to accept them.
-
Prepare Input Data:
- Place raw 3D image stacks (markers) and segmentation mask (e.g., TIFF files) into a dataset-specific subdirectory under
data/init/
(e.g.,data/init/my_dataset/
). - Ensure you have a compensation matrix (
compensationMatrix.csv
) and a panel file (model_panel.csv
or similar) relevant to your markers, and place them indata/<dataset_name>/
.
- Place raw 3D image stacks (markers) and segmentation mask (e.g., TIFF files) into a dataset-specific subdirectory under
-
Run
code/extract.py
:- Modify
extract.py
to point to your raw image and mask files. - Execute:
python code/extract.py
- This will generate
cell_coordinates.csv
andcell_intensities.csv
in the appropriatedata/<dataset_name>/
directory.
- Modify
-
Run
code/correct.r
:- Modify
correct.r
to point to the correctcell_intensities.csv
,compensationMatrix.csv
, andpanel.csv
for your dataset. - Also, adjust output image paths if necessary.
- Execute in R environment:
Rscript code/correct.r
- This will generate
expression_annotated_corrected.csv
and correlation heatmaps.
- Modify
-
Run
code/cluster.py
:- Modify
cluster.py
to point to the correctexpression_annotated_corrected.csv
andcell_coordinates.csv
. - Crucially, you will need to define the
marker_to_cell
dictionary based on theranked_genes_dotplot.png
andtsne_stacked.png
outputs from an initial run of the clustering part of the script (comment out the annotation part first, run, inspect plots, then fill in the dictionary and uncomment the annotation part). - Adjust output paths.
- Execute:
python code/cluster.py
- This generates the
adata.h5ad
file, updatescell_coordinates.csv
with cluster/cell type info, and produces various plots.
- Modify
-
Run
code/choose.py
(if needed):- Modify and run this script if you need to select specific cell populations for the PCF analysis.
- Execute:
python code/choose.py
(assuming it's adapted for command-line use or run interactively).
-
Run
code/prespace.r
(Recommended Prerequisite forspace.r
):- Modify
prespace.r
to point to the correct (potentially filtered)cell_coordinates.csv
. - You can optionally provide
n_range
arguments if you want to select slices with a specific number of cells. - Execute in R environment:
Rscript code/prespace.r path/to/your/coordinates.csv [min_cells] [max_cells]
- Note the "Optimal max_r" printed to the console. This will be an input for
space.r
. - This step generates exploratory plots and data regarding slice characteristics.
- Modify
-
Run
code/space.r
(Main Spatial Analysis):- Modify
space.r
to point to the correctcell_coordinates.csv
. - Pass the path to the coordinates file and the
r_max
value (obtained fromprespace.r
or chosen otherwise) as command-line arguments. - Execute in R environment:
Rscript code/space.r path/to/your/coordinates.csv your_rmax_value
- This performs the detailed PCF comparisons between 3D and multiple 2D slices and generates the primary output plots and PCF data.
- Modify
-
Optional: Run
code/mimic.r
,code/evaluate.r
,code/heat.r
:- If you want to model PCF curves or perform further aggregate statistical analysis and visualization on results (e.g., from
../data/naive/
or../data/pcf/
directories, or custom statistical outputs), adapt and run these scripts.
- If you want to model PCF curves or perform further aggregate statistical analysis and visualization on results (e.g., from
Important Considerations:
- Paths: Most scripts use relative paths. Ensure you run them from the correct working directory (typically the root of the project or the
code/
directory, depending on how paths are set up within each script). The use ofthis.path::here()
in some R scripts helps in setting the working directory to the script's location. - Dataset Specificity: Many file paths and some parameters (e.g.,
marker_to_cell
incluster.py
) are hardcoded or specific to example datasets (like "Nat_blood", "Cell_crc"). You MUST adapt these for your own data. - Computational Resources: Processing large 3D images, performing alignments, and running clustering on many cells can be computationally intensive and require significant memory and time.
- Interactivity: Some steps, especially cell type annotation in
cluster.py
, benefit from interactive execution and inspection of intermediate plots.
The data/
directory is organized to hold various datasets, each typically in its own subdirectory (e.g., data/nat_blood/
, data/sci_brain/
). Common files found within these subdirectories include:
cell_coordinates.csv
: Stores the 3D spatial coordinates (x, y, z), cell area, and after processing bycluster.py
, cluster assignments and annotated cell types.cell_intensities.csv
: Raw mean intensities for each marker for every cell.expression_annotated_corrected.csv
: Compensated and cleaned marker expression values for each cell.compensationMatrix.csv
: Spillover matrix used for intensity correction.model_panel.csv
(or similar): Maps instrument channel names/metal tags to biological marker names.cell_coordinates_selected.csv
: A subset ofcell_coordinates.csv
, often after selecting specific cell types/clusters viachoose.py
.adata.h5ad
: Scanpy object storing processed expression data, PCA, t-SNE, and annotations for a dataset.
The images/
directory mirrors this structure for output plots.
- Valentina Gandin, Jun Kim, Liang-Zhong Yang, Yumin Lian, Takashi Kawase, Amy Hu, Konrad Rokicki, Greg Fleishman, Paul Tillberg, Alejandro Aguilera Castrejon, Carsen Stringer, Stephan Preibisch, and Zhe J. Liu. (2025). Deep-tissue transcriptomics and subcellular imaging at high spatial resolution. Science, 388(6744), eadq2084.
- Kuett, Laura, Catena, Raúl, Özcan, Alaz, Plüss, Alex, Ali, H. R. Sa’d, M. Al, Alon, S. Aparicio, S. Battistoni, G. Balasubramanian, S. Becker, R. Bodenmiller, B. Boyden, E. S. Bressan, D. Bruna, A. Burger, Marcel, Caldas, C. Callari, M. Cannell, I. G. Casbolt, H. Chornay, N. Cui, Y. Dariush, A. Dinh, K. Emenari, A. Eyal-Lubling, Y. Fan, J. Fatemi, A. Fisher, E. González-Solares, E. A. González-Fernández, C. Goodwin, D. Greenwood, W. Grimaldi, F. Hannon, G. J. Harris, S. Jauset, C. Joyce, J. A. Karagiannis, E. D. Kovačević, T. Kuett, L. Kunes, R. Yoldaş, A. Küpcü, Lai, D. Laks, E. Lee, H. Lee, M. Lerda, G. Li, Y. McPherson, A. Millar, N. Mulvey, C. M. Nugent, I. O’Flanagan, C. H. Paez-Ribes, M. Pearsall, I. Qosaj, F. Roth, A. J. Rueda, O. M. Ruiz, T. Sawicka, K. Sepúlveda, L. A. Shah, S. P. Shea, A. Sinha, A. Smith, A. Tavaré, S. Tietscher, S. Vázquez-García, I. Vogl, S. L.Walton, N. A. Wassie, A. T. Watson, S. S. Weselak, J. Wild, S. A. Williams, E. Windhager, J. Xia, C. Zheng, P. Zhuang, X. Schraml, Peter, Moch, Holger, de Souza, Natalie, and Bodenmiller, Bernd. (2021). Three-dimensional imaging mass cytometry for highly multiplexed molecular and cellular mapping of tissues and the tumor microenvironment. Nature Cancer, 3(1), 122–133.
- Lin, J. R., Wang, S., Coy, S., Chen, Y. A., Yapp, C., Tyler, M., Nariya, M. K., Heiser, C. N., Lau, K. S., Santagata, S., & Sorger, P. K. (2023). Multiplexed 3D atlas of state transitions and immune interaction in colorectal cancer. Cell, 186(2), 363–381.e19.