Skip to content

gregothebyteknight/pattern

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project: Analyzing Spatial Statistics of Cellular Distributions in 3D Tissues and 2D Sections

Objective

This project investigates the validity of using two-dimensional (2D) tissue sections to accurately represent the three-dimensional (3D) spatial statistical properties of cellular organization within tissues. A common practice in biology is to analyze thin tissue slices, but this can lead to biased interpretations of cell distribution patterns.

The primary aim of this pipeline is twofold:

  1. To quantitatively compare spatial metrics (primarily the Pair Correlation Function) derived from true 3D spatial data of cells with those obtained from their 2D projections (slices).
  2. To explore and potentially elaborate methods to understand and mitigate the biases introduced by 2D sampling.

Theoretical Background: The Pair Correlation Function (PCF)

The Pair Correlation Function, often denoted as $g(r)$, is a fundamental tool in spatial statistics. It quantifies the probability of finding a pair of objects (in this context, cells) separated by a specific distance $r$, relative to what would be expected under a completely random (spatially uniform, e.g., Poisson) distribution.

Interpretation:

  • $g(r) > 1$: Indicates that cells are more likely to be found at distance $r$ from each other than by chance, suggesting clustering or aggregation at that length scale.
  • $g(r) < 1$: Indicates that cells are less likely to be found at distance $r$, suggesting dispersion or regularity (e.g., cells avoiding each other) at that scale.
  • $g(r) = 1$: Indicates that cells are randomly distributed with respect to each other at distance $r$.

By calculating $g(r)$ for a full 3D cell dataset and comparing it to $g(r)$ values obtained from multiple 2D slices taken from that same dataset, this project assesses how well 2D representations capture the true underlying 3D spatial architecture.

Analysis Pipeline and Code Structure

The analysis is performed through a series of interconnected scripts, primarily written in Python and R. The general workflow is as follows:

Data Directory Structure (Illustrative):

  • data/init/: Should contain the initial raw image data (e.g., TIFF stacks for markers and masks).
  • data/: Stores processed data files generated by the pipeline (e.g., cell_coordinates.csv, cell_intensities.csv, expression_annotated_corrected.csv). Specific subdirectories (e.g., data/nat_blood/) are used for different datasets.
  • code/: Contains all the Python and R scripts for the analysis.
  • images/: Stores output plots and figures. Specific subdirectories (e.g., images/Nat_blood/) are used for different datasets.
  • temp/: May be used for intermediate statistical results.

Core Workflow Scripts:

  1. code/extract.py (Python)

    • Purpose: Extracts initial cell data from raw 3D image files (e.g., multi-channel TIFF stacks and segmentation masks).
    • Inputs: Image files (e.g., ../data/init/full_mask_final_segmentation_hwatershed_bg500_90%.tif and marker-specific image stacks).
    • Processing:
      • Reads image masks to identify individual cells and their boundaries.
      • Reads corresponding marker intensity images.
      • Calculates 3D centroid coordinates (x, y, z) and area for each cell.
      • Calculates mean marker intensities for each cell across all channels.
    • Outputs:
      • ../data/<dataset_name>/cell_coordinates.csv: Contains columns like area, x, y, z.
      • ../data/<dataset_name>/cell_intensities.csv: Contains columns representing different markers, with values being mean intensities for each cell.
  2. code/correct.r (R)

    • Purpose: Corrects raw cell marker intensities for spectral spillover between detection channels.
    • Inputs:
      • ../data/<dataset_name>/cell_intensities.csv (from extract.py).
      • ../data/<dataset_name>/compensationMatrix.csv: A matrix defining the spillover coefficients between channels.
      • ../data/<dataset_name>/model_panel.csv (or similar): A panel file mapping raw channel names/metal tags to cleaned marker names.
    • Processing:
      • Applies the inverse of the compensation matrix to the intensity data.
      • Renames marker columns based on the panel file.
      • Optionally filters out specific markers (e.g., DNA intercalators like Hoechst, or unused channels).
      • Generates heatmaps (../images/<dataset_name>/heat_unc.png, ../images/<dataset_name>/heat_cor.png) showing marker correlations before and after compensation.
    • Outputs:
      • ../data/<dataset_name>/expression_annotated_corrected.csv: The compensated and cleaned expression matrix.
  3. code/cluster.py (Python)

    • Purpose: Performs unsupervised clustering of cells based on their corrected marker expression profiles and annotates these clusters.
    • Inputs:
      • ../data/<dataset_name>/expression_annotated_corrected.csv (from correct.r).
      • ../data/<dataset_name>/cell_coordinates.csv (for spatial visualization).
    • Processing:
      • Uses the scanpy library.
      • Scales expression data.
      • Performs Principal Component Analysis (PCA).
      • Builds a neighborhood graph and applies Leiden clustering.
      • Visualizes clusters using t-SNE (../images/<dataset_name>/tsne_unlabeled.png, ../images/<dataset_name>/tsne_stacked.png).
      • Identifies marker genes for clusters (../images/<dataset_name>/ranked_genes_dotplot.png, ../images/<dataset_name>/ranked_genes_violin.png).
      • Allows for manual annotation of clusters to cell types based on marker expression (requires user to define marker_to_cell dictionary in the script).
      • Generates 3D spatial plots of cell clusters and annotated cell types (../images/<dataset_name>/cell_clusters_3d.png, ../images/<dataset_name>/cell_types_3d.png).
    • Outputs:
      • ../data/<dataset_name>/adata.h5ad: A Scanpy AnnData object containing the processed and annotated data.
      • Updates ../data/<dataset_name>/cell_coordinates.csv with new columns for cluster and cell (annotated cell type).
      • Various diagnostic plots.
  4. code/choose.py (Python - as described in original README, not provided for detailed review)

    • Purpose: To select specific cell clusters (obtained from cluster.py) for downstream spatial analysis.
    • Inputs: Likely ../data/<dataset_name>/cell_coordinates.csv (with cluster information).
    • Processing: Filters the coordinate data to retain only cells belonging to the chosen cluster(s).
    • Outputs: A filtered version of cell coordinates, possibly ../data/<dataset_name>/cell_coordinates_selected.csv.
  5. code/prespace.r (R)

    • Purpose: This script serves as a preparatory step for the main spatial analysis. It performs an initial exploration of spatial characteristics, including calculating the 3D PCF, PCFs for multiple random 2D slices, and Clark-Evans (CE) aggregation indices. A key function is to help determine an optimal maximum radius (r_max) for PCF calculations, which can then be used in space.r. It can also identify specific 2D slices based on cell count criteria.
    • Inputs:
      • Path to the cell coordinates CSV file (e.g., ../data/<dataset_name>/cell_coordinates_selected.csv).
      • Optional: n_range (min and max number of cells) to select slices for specific analysis or visualization.
    • Processing:
      • Sources functions from code/module.r and code/visual.r.
      • Calculates the 3D PCF for the input cell data.
      • Generates a specified number of random 2D slices by rotating the 3D data and uses pc_for_slice (from module.r) to compute their 2D PCFs and CE indices.
      • Collects PCF data, cell counts, and CE indices for all slices.
      • If n_range is provided, it identifies rotation angles for slices meeting the cell count criteria and can visualize a representative slice using angle_analysis (from module.r).
      • Generates a cumulative PCF plot comparing the 3D PCF to the distribution of 2D PCFs using cumul_pcf (from visual.r).
      • Prints a suggested "Optimal max_r" to the console.
    • Outputs:
      • Console output: Suggested r_max value.
      • Plots: Cumulative PCF plot, image of a selected slice (if n_range is used).
      • Data: ce_tbl.csv (Clark-Evans indices), pcf_naive.csv (basic PCF characteristics) stored in ../data/naive/<dataset_folder>/<cell_type>/.
  6. Spatial Slicing and PCF Analysis (R: code/space.r, code/module.r, code/visual.r)

    • code/module.r: This is a library of core functions for spatial statistics.
      • Key Functions:
        • rotation(): Creates 3D rotation matrices.
        • slice(): Extracts cells within a 2D slice of defined thickness from a (potentially rotated) 3D point cloud.
        • pc_for_slice(): Takes a 3D cell coordinate matrix, applies rotation, extracts a slice, computes the 2D PCF for that slice using spatstat, and calculates the Clark and Evans aggregation index.
        • ce_idx(): Computes the Clark and Evans index for 2D or 3D point sets.
        • angle_analysis(): Selects and visualizes a slice based on specific rotation angles.
    • code/visual.r (R - details inferred, not fully reviewed): Contains functions for visualizing spatial statistics results.
      • Key Functions (used by space.r/prespace.r):
        • mean_pcf(): Plots the true 3D PCF along with the mean and individual PCFs from 2D slices.
        • cumul_pcf(): Plots the 3D PCF against a cumulative distribution of 2D PCFs.
    • code/space.r (R):
      • Purpose: This is the main script for detailed PCF comparison. It calculates the 3D PCF and then PCFs for numerous random 2D slices, comparing them visually.
      • Inputs:
        • Path to the cell coordinates CSV file (e.g., ../data/<dataset_name>/cell_coordinates_selected.csv).
        • r_max: Maximum radius for PCF calculation (value can be guided by prespace.r output).
      • Processing:
        • Sources functions from code/module.r and code/visual.r.
        • Calculates the true 3D PCF for the input cell data.
        • Generates a large number of random 2D slices by rotating the 3D data.
        • For each slice, computes the 2D PCF using pc_for_slice from module.r.
        • Uses mean_pcf (from visual.r) to generate a plot comparing the 3D PCF to the ensemble of 2D PCFs.
      • Outputs:
        • Plots comparing 3D and 2D PCFs (e.g., ../images/<dataset_name>/mean_pcf_<cell_type>.png).
        • PCF data and cell counts for 3D and 2D slices saved to ../data/pcf/<dataset_folder>/<cell_type>/.

Supporting and Specialized Scripts:

  • code/align.py (Python) & code/module.py (Python):

    • Purpose: These scripts are designed for aligning a series of 2D tissue slices to reconstruct a 3D volume if the input data is in that format (e.g., serial sections). module.py (Python) provides alignment functions (affine and STalign-based LDDMM) used by align.py.
    • Note: If starting with an intact 3D image volume (as processed by extract.py), this step might be used for specific sub-volume registration or might not be strictly part of the main PCF comparison pipeline.
  • code/mimic.r (R):

    • Purpose: Fits mathematical models (e.g., Gamma distribution-based, damped oscillation) to empirical PCF curves. This can help in parameterizing PCF shapes for comparison or interpretation.
  • code/evaluate.r (R) & code/heat.r (R):

    • Purpose: These scripts are used for downstream statistical evaluation and visualization, particularly for comparing results (e.g., p-values from tests comparing 2D vs. 3D metrics) across different datasets, cell types, or conditions.
    • evaluate.r: Generates bar plots of p-values.
    • heat.r: Generates heatmaps of p-values.
    • Inputs: Typically tables of p-values or other statistical results (e.g., ../temp/pval_tbl.csv).

How to Use

Prerequisites:

  • Python Environment:
    • Required libraries: pandas, matplotlib, pathlib, scanpy, plotly, skimage, numpy, torch, STalign (for align.py).
    • It's recommended to use a virtual environment (e.g., conda or venv).
    • scanpy installation might require specific attention to its dependencies.
    • STalign might need to be installed from its source if not available via pip/conda.
    • Kaleido is needed by plotly for static image export.
  • R Environment:
    • Required libraries: pheatmap, tidyverse, spatstat, dplyr, FNN, geometry, minpack.lm, ggplot2, this.path.
    • Ensure R is compiled with Cairo support for better graphics output if using cairo_pdf.

General Running Order (Conceptual):

A typical analysis run for a new dataset would involve the following sequence. Note: Script parameters (like input file paths, dataset names, specific marker lists) will need to be adjusted within each script or passed as arguments if the scripts are modified to accept them.

  1. Prepare Input Data:

    • Place raw 3D image stacks (markers) and segmentation mask (e.g., TIFF files) into a dataset-specific subdirectory under data/init/ (e.g., data/init/my_dataset/).
    • Ensure you have a compensation matrix (compensationMatrix.csv) and a panel file (model_panel.csv or similar) relevant to your markers, and place them in data/<dataset_name>/.
  2. Run code/extract.py:

    • Modify extract.py to point to your raw image and mask files.
    • Execute: python code/extract.py
    • This will generate cell_coordinates.csv and cell_intensities.csv in the appropriate data/<dataset_name>/ directory.
  3. Run code/correct.r:

    • Modify correct.r to point to the correct cell_intensities.csv, compensationMatrix.csv, and panel.csv for your dataset.
    • Also, adjust output image paths if necessary.
    • Execute in R environment: Rscript code/correct.r
    • This will generate expression_annotated_corrected.csv and correlation heatmaps.
  4. Run code/cluster.py:

    • Modify cluster.py to point to the correct expression_annotated_corrected.csv and cell_coordinates.csv.
    • Crucially, you will need to define the marker_to_cell dictionary based on the ranked_genes_dotplot.png and tsne_stacked.png outputs from an initial run of the clustering part of the script (comment out the annotation part first, run, inspect plots, then fill in the dictionary and uncomment the annotation part).
    • Adjust output paths.
    • Execute: python code/cluster.py
    • This generates the adata.h5ad file, updates cell_coordinates.csv with cluster/cell type info, and produces various plots.
  5. Run code/choose.py (if needed):

    • Modify and run this script if you need to select specific cell populations for the PCF analysis.
    • Execute: python code/choose.py (assuming it's adapted for command-line use or run interactively).
  6. Run code/prespace.r (Recommended Prerequisite for space.r):

    • Modify prespace.r to point to the correct (potentially filtered) cell_coordinates.csv.
    • You can optionally provide n_range arguments if you want to select slices with a specific number of cells.
    • Execute in R environment: Rscript code/prespace.r path/to/your/coordinates.csv [min_cells] [max_cells]
    • Note the "Optimal max_r" printed to the console. This will be an input for space.r.
    • This step generates exploratory plots and data regarding slice characteristics.
  7. Run code/space.r (Main Spatial Analysis):

    • Modify space.r to point to the correct cell_coordinates.csv.
    • Pass the path to the coordinates file and the r_max value (obtained from prespace.r or chosen otherwise) as command-line arguments.
    • Execute in R environment: Rscript code/space.r path/to/your/coordinates.csv your_rmax_value
    • This performs the detailed PCF comparisons between 3D and multiple 2D slices and generates the primary output plots and PCF data.
  8. Optional: Run code/mimic.r, code/evaluate.r, code/heat.r:

    • If you want to model PCF curves or perform further aggregate statistical analysis and visualization on results (e.g., from ../data/naive/ or ../data/pcf/ directories, or custom statistical outputs), adapt and run these scripts.

Important Considerations:

  • Paths: Most scripts use relative paths. Ensure you run them from the correct working directory (typically the root of the project or the code/ directory, depending on how paths are set up within each script). The use of this.path::here() in some R scripts helps in setting the working directory to the script's location.
  • Dataset Specificity: Many file paths and some parameters (e.g., marker_to_cell in cluster.py) are hardcoded or specific to example datasets (like "Nat_blood", "Cell_crc"). You MUST adapt these for your own data.
  • Computational Resources: Processing large 3D images, performing alignments, and running clustering on many cells can be computationally intensive and require significant memory and time.
  • Interactivity: Some steps, especially cell type annotation in cluster.py, benefit from interactive execution and inspection of intermediate plots.

Data Description

The data/ directory is organized to hold various datasets, each typically in its own subdirectory (e.g., data/nat_blood/, data/sci_brain/). Common files found within these subdirectories include:

  • cell_coordinates.csv: Stores the 3D spatial coordinates (x, y, z), cell area, and after processing by cluster.py, cluster assignments and annotated cell types.
  • cell_intensities.csv: Raw mean intensities for each marker for every cell.
  • expression_annotated_corrected.csv: Compensated and cleaned marker expression values for each cell.
  • compensationMatrix.csv: Spillover matrix used for intensity correction.
  • model_panel.csv (or similar): Maps instrument channel names/metal tags to biological marker names.
  • cell_coordinates_selected.csv: A subset of cell_coordinates.csv, often after selecting specific cell types/clusters via choose.py.
  • adata.h5ad: Scanpy object storing processed expression data, PCA, t-SNE, and annotations for a dataset.

The images/ directory mirrors this structure for output plots.

Datasets used

  1. Valentina Gandin, Jun Kim, Liang-Zhong Yang, Yumin Lian, Takashi Kawase, Amy Hu, Konrad Rokicki, Greg Fleishman, Paul Tillberg, Alejandro Aguilera Castrejon, Carsen Stringer, Stephan Preibisch, and Zhe J. Liu. (2025). Deep-tissue transcriptomics and subcellular imaging at high spatial resolution. Science, 388(6744), eadq2084.
  2. Kuett, Laura, Catena, Raúl, Özcan, Alaz, Plüss, Alex, Ali, H. R. Sa’d, M. Al, Alon, S. Aparicio, S. Battistoni, G. Balasubramanian, S. Becker, R. Bodenmiller, B. Boyden, E. S. Bressan, D. Bruna, A. Burger, Marcel, Caldas, C. Callari, M. Cannell, I. G. Casbolt, H. Chornay, N. Cui, Y. Dariush, A. Dinh, K. Emenari, A. Eyal-Lubling, Y. Fan, J. Fatemi, A. Fisher, E. González-Solares, E. A. González-Fernández, C. Goodwin, D. Greenwood, W. Grimaldi, F. Hannon, G. J. Harris, S. Jauset, C. Joyce, J. A. Karagiannis, E. D. Kovačević, T. Kuett, L. Kunes, R. Yoldaş, A. Küpcü, Lai, D. Laks, E. Lee, H. Lee, M. Lerda, G. Li, Y. McPherson, A. Millar, N. Mulvey, C. M. Nugent, I. O’Flanagan, C. H. Paez-Ribes, M. Pearsall, I. Qosaj, F. Roth, A. J. Rueda, O. M. Ruiz, T. Sawicka, K. Sepúlveda, L. A. Shah, S. P. Shea, A. Sinha, A. Smith, A. Tavaré, S. Tietscher, S. Vázquez-García, I. Vogl, S. L.Walton, N. A. Wassie, A. T. Watson, S. S. Weselak, J. Wild, S. A. Williams, E. Windhager, J. Xia, C. Zheng, P. Zhuang, X. Schraml, Peter, Moch, Holger, de Souza, Natalie, and Bodenmiller, Bernd. (2021). Three-dimensional imaging mass cytometry for highly multiplexed molecular and cellular mapping of tissues and the tumor microenvironment. Nature Cancer, 3(1), 122–133.
  3. Lin, J. R., Wang, S., Coy, S., Chen, Y. A., Yapp, C., Tyler, M., Nariya, M. K., Heiser, C. N., Lau, K. S., Santagata, S., & Sorger, P. K. (2023). Multiplexed 3D atlas of state transitions and immune interaction in colorectal cancer. Cell, 186(2), 363–381.e19.

About

Second-order spatial analysis for multiplex imaging data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •