Danger
Please access this document in its canonical location as the currently accessed page may not be rendered correctly: Analysis and visualization of spatial transcriptomics data
Analysis and visualization of spatial transcriptomics data#
Author: Giovanni Palla
.. note::
For up-to-date analysis tutorials, kindly check out SquidPy tutorials.
This tutorial demonstrates how to work with spatial transcriptomics data within Scanpy.
We focus on 10x Genomics Visium data, and provide an example for MERFISH.
import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sc.logging.print_versions()
sc.set_figure_params(facecolor="white", figsize=(8, 8))
sc.settings.verbosity = 3
-----
anndata 0.11.0.dev78+g64ab900
scanpy 1.10.0rc2.dev6+g14555ba4.d20240226
-----
PIL 10.2.0
anyio NA
appnope 0.1.3
arrow 1.3.0
asciitree NA
asttokens NA
attr 23.2.0
attrs 23.2.0
babel 2.14.0
certifi 2023.11.17
cffi 1.16.0
charset_normalizer 3.3.2
cloudpickle 3.0.0
comm 0.2.1
cycler 0.12.1
cython_runtime NA
dask 2024.1.0
dateutil 2.8.2
debugpy 1.8.0
decorator 5.1.1
defusedxml 0.7.1
executing 2.0.1
fasteners 0.19
fastjsonschema NA
fqdn NA
h5py 3.10.0
idna 3.6
igraph 0.10.8
ipykernel 6.28.0
ipywidgets 8.1.1
isoduration NA
jedi 0.19.1
jinja2 3.1.3
joblib 1.3.2
json5 NA
jsonpointer 2.4
jsonschema 4.20.0
jsonschema_specifications NA
jupyter_events 0.9.0
jupyter_server 2.12.4
jupyterlab_server 2.25.2
kiwisolver 1.4.5
legacy_api_wrap NA
leidenalg 0.10.1
llvmlite 0.41.1
markupsafe 2.1.3
matplotlib 3.8.2
mpl_toolkits NA
msgpack 1.0.7
natsort 8.4.0
nbformat 5.9.2
numba 0.58.1
numcodecs 0.12.1
numpy 1.26.3
overrides NA
packaging 23.2
pandas 2.2.0
parso 0.8.3
patsy 0.5.6
pexpect 4.9.0
platformdirs 4.1.0
prometheus_client NA
prompt_toolkit 3.0.43
psutil 5.9.7
ptyprocess 0.7.0
pure_eval 0.2.2
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.17.2
pyparsing 3.1.1
pythonjsonlogger NA
pytz 2023.3.post1
referencing NA
requests 2.31.0
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
rpds NA
scipy 1.11.4
seaborn 0.13.1
send2trash NA
session_info 1.0.0
setuptools 68.2.2
setuptools_scm NA
sitecustomize NA
six 1.16.0
sklearn 1.3.2
sniffio 1.3.0
sparse 0.15.1
stack_data 0.6.3
statsmodels 0.14.1
tblib 3.0.0
texttable 1.7.0
threadpoolctl 3.2.0
tlz 0.12.0
toolz 0.12.0
tornado 6.4
traitlets 5.14.1
typing_extensions NA
uri_template NA
urllib3 2.1.0
wcwidth 0.2.13
webcolors 1.13
websocket 1.7.0
yaml 6.0.1
zarr 2.16.1
zipp NA
zmq 25.1.2
-----
IPython 8.20.0
jupyter_client 8.6.0
jupyter_core 5.7.1
jupyterlab 4.0.10
notebook 7.0.6
-----
Python 3.11.6 (main, Nov 2 2023, 04:39:43) [Clang 14.0.3 (clang-1403.0.22.14.1)]
macOS-13.6.1-arm64-arm-64bit
-----
Session information updated at 2024-02-26 17:28
Reading the data#
We will use a Visium spatial transcriptomics dataset of the human lymphnode, which is publicly available from the 10x genomics website: link.
The function datasets.visium_sge() downloads the dataset from 10x Genomics and returns an AnnData
object that contains counts, images and spatial coordinates. We will calculate standards QC metrics with pp.calculate_qc_metrics and percentage of mitochondrial read counts per sample.
When using your own Visium data, use sc.read_visium() function to import it.
adata = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
adata.var_names_make_unique()
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)
reading /Users/ilangold/Projects/Theis/scanpy-tutorials/spatial/data/V1_Human_Lymph_Node/filtered_feature_bc_matrix.h5
(0:00:01)
This is how the adata structure looks like for Visium data
adata
AnnData object with n_obs × n_vars = 4035 × 36601
obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt'
var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
uns: 'spatial'
obsm: 'spatial'
QC and preprocessing#
We perform some basic filtering of spots based on total counts and expressed genes
fig, axs = plt.subplots(1, 4, figsize=(15, 4))
sns.histplot(adata.obs["total_counts"], kde=False, ax=axs[0])
sns.histplot(
adata.obs["total_counts"][adata.obs["total_counts"] < 10000],
kde=False,
bins=40,
ax=axs[1],
)
sns.histplot(adata.obs["n_genes_by_counts"], kde=False, bins=60, ax=axs[2])
sns.histplot(
adata.obs["n_genes_by_counts"][adata.obs["n_genes_by_counts"] < 4000],
kde=False,
bins=60,
ax=axs[3],
)
sc.pp.filter_cells(adata, min_counts=5000)
sc.pp.filter_cells(adata, max_counts=35000)
adata = adata[adata.obs["pct_counts_mt"] < 20].copy()
print(f"#cells after MT filter: {adata.n_obs}")
sc.pp.filter_genes(adata, min_cells=10)
filtered out 44 cells that have less than 5000 counts
filtered out 130 cells that have more than 35000 counts
#cells after MT filter: 3861
filtered out 16916 genes that are detected in less than 10 cells
We proceed to normalize Visium counts data with the built-in normalize_total
method from Scanpy, and detect highly-variable genes (for later). Note that there are alternatives for normalization (see discussion in [Luecken19], and more recent alternatives such as SCTransform or GLM-PCA).
sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor="seurat", n_top_genes=2000)
normalizing counts per cell
finished (0:00:00)
extracting highly variable genes
finished (0:00:00)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
Manifold embedding and clustering based on transcriptional similarity#
To embed and cluster the manifold encoded by transcriptional similarity, we proceed as in the standard clustering tutorial.
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(
adata, key_added="clusters", flavor="igraph", directed=False, n_iterations=2
)
computing PCA
with n_comps=50
finished (0:00:30)
computing neighbors
using 'X_pca' with n_pcs = 50
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:05)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:12)
running Leiden clustering
finished: found 10 clusters and added
'clusters', the cluster labels (adata.obs, categorical) (0:00:00)
We plot some covariates to check if there is any particular structure in the UMAP associated with total counts and detected genes.
Visualization in spatial coordinates#
Let us now take a look at how total_counts
and n_genes_by_counts
behave in spatial coordinates. We will overlay the circular spots on top of the Hematoxylin and eosin stain (H&E) image provided, using the function sc.pl.spatial.
plt.rcParams["figure.figsize"] = (8, 8)
sc.pl.spatial(adata, img_key="hires", color=["total_counts", "n_genes_by_counts"])
The function sc.pl.spatial accepts 4 additional parameters:
img_key
: key where the img is stored in theadata.uns
elementcrop_coord
: coordinates to use for cropping (left, right, top, bottom)alpha_img
: alpha value for the transcparency of the imagebw
: flag to convert the image into gray scale
Furthermore, in sc.pl.spatial, the size
parameter changes its behaviour: it becomes a scaling factor for the spot sizes.
Before, we performed clustering in gene expression space, and visualized the results with UMAP. By visualizing clustered samples in spatial dimensions, we can gain insights into tissue organization and, potentially, into inter-cellular communication.
Spots belonging to the same cluster in gene expression space often co-occur in spatial dimensions. For instance, spots belonging to cluster 5 are often surrounded by spots belonging to cluster 0.
We can zoom in specific regions of interests to gain qualitative insights. Furthermore, by changing the alpha values of the spots, we can visualize better the underlying tissue morphology from the H&E image.
Cluster marker genes#
Let us further inspect cluster 5, which occurs in small groups of spots across the image.
Compute marker genes and plot a heatmap with expression levels of its top 10 marker genes across clusters.
sc.tl.rank_genes_groups(adata, "clusters", method="t-test")
sc.pl.rank_genes_groups_heatmap(adata, groups="9", n_genes=10, groupby="clusters")
ranking genes
finished: added to `.uns['rank_genes_groups']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)
WARNING: dendrogram data not found (using key=dendrogram_clusters). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_clusters']`
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: 3
We see that CR2 recapitulates the spatial structure.
MERFISH example#
In case you have spatial data generated with FISH-based techniques, just read the cordinate table and assign it to the adata.obsm
element.
Let’s take a look at the example from Xia et al. 2019.
First, we need to download the coordinate and counts data from the original publication: coordinates to ./data/pnas.1912459116.sd15.csv and counts to ./data/pnas.1912459116.sd12.csv
# If needed:
# %pip install openpyxl
coordinates = pd.read_excel("./data/pnas.1912459116.sd15.xlsx", index_col=0)
counts = sc.read_csv("./data/pnas.1912459116.sd12.csv").transpose()
adata_merfish = counts[coordinates.index, :].copy()
adata_merfish.obsm["spatial"] = coordinates.to_numpy()
We will perform standard preprocessing and dimensionality reduction.
sc.pp.normalize_per_cell(adata_merfish, counts_per_cell_after=1e6)
sc.pp.log1p(adata_merfish)
sc.pp.pca(adata_merfish, n_comps=15)
sc.pp.neighbors(adata_merfish)
sc.tl.umap(adata_merfish)
sc.tl.leiden(
adata_merfish,
key_added="clusters",
resolution=0.5,
n_iterations=2,
flavor="igraph",
directed=False,
)
normalizing by total count per cell
finished (0:00:00): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)
computing PCA
with n_comps=15
finished (0:00:08)
computing neighbors
using 'X_pca' with n_pcs = 15
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:01)
running Leiden clustering
finished: found 6 clusters and added
'clusters', the cluster labels (adata.obs, categorical) (0:00:00)
The experiment consisted in measuring gene expression counts from a single cell type (cultured U2-OS cells). Clusters consist of cell states at different stages of the cell cycle. We don’t expect to see specific structure in spatial dimensions given the experimental setup.
We can visualize the clusters obtained from running Leiden in UMAP space and spatial coordinates like this.
adata_merfish
AnnData object with n_obs × n_vars = 645 × 12903
obs: 'n_counts', 'clusters'
uns: 'log1p', 'pca', 'neighbors', 'umap', 'leiden'
obsm: 'spatial', 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
sc.pl.umap(adata_merfish, color="clusters")
sc.pl.embedding(adata_merfish, basis="spatial", color="clusters")
We hope you found the tutorial useful!
Report back to us which features/external tools you would like to see in Scanpy.
We are extending Scanpy and AnnData to support other spatial data types, such as Imaging Mass Cytometry and extend data structure to support spatial graphs and additional features. Stay tuned!