Analysis and visualization of spatial transcriptomics data¶

This tutorial demonstrates how to work with spatial transcriptomics data with Scanpy.

The analysis pipeline largely recapitulates the standard Scanpy workflow and builds upon our best practices tutorial.

Here, we show how to use Scanpy to analyse spatial data using our custom spatial visualization function and an external tool. In this tutorial we focus on 10x genomics Visium spatial transcriptomics data.

This tutorial was generated using the spatial branch of scanpy using the spatialDE package. To run the tutorial, please run the following commands to use the correct branch and install the dependency

pip install git+https://github.com/theislab/scanpy.git@spatial
pip install spatialde

Loading libraries¶

[1]:

import scanpy as sc
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from matplotlib import rcParams
import seaborn as sb

import SpatialDE

plt.rcParams['figure.figsize']=(8,8)

%load_ext autoreload
%autoreload 2

Reading the data¶

We will use a Visium spatial transcriptomics dataset of the human lymphnode, which is publicly available from the 10x genomics website: link

The function datasets.visium_sge() downloads the dataset from 10x genomics and returns an AnnData object that contains counts, images and spatial coordinates. We will calculate standards QC metrics with pp.calculate_qc_metrics and percentage of mitochondrial read counts per sample.

When using your own Visium data, use Scanpy’s `read_visium() <https://icb-scanpy.readthedocs-hosted.com/en/latest/api/scanpy.read_visium.html>`__ function to import it.

[2]:

adata = sc.datasets.visium_sge('V1_Human_Lymph_Node')
adata.var_names_make_unique()

sc.pp.calculate_qc_metrics(adata, inplace=True)
adata.var['mt'] = [gene.startswith('MT-') for gene in adata.var_names]
adata.obs['mt_frac'] = adata[:, adata.var['mt']].X.sum(1).A.squeeze()/adata.obs['total_counts']

QC and preprocessing¶

Furthermore, we will perform some basic filtering of spots based on total counts and expressed genes

[3]:

fig, axs = plt.subplots(1,4, figsize=(15,4))
fig.suptitle('Covariates for filtering')
sb.distplot(adata.obs['total_counts'], kde=False, ax = axs[0])
sb.distplot(adata.obs['total_counts'][adata.obs['total_counts']<10000], kde=False, bins=40, ax = axs[1])
sb.distplot(adata.obs['n_genes_by_counts'], kde=False, bins=60, ax = axs[2])
sb.distplot(adata.obs['n_genes_by_counts'][adata.obs['n_genes_by_counts']<4000], kde=False, bins=60, ax = axs[3])

[3]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a2c0c9150>

_images/analysis-visualization-spatial_7_1.png

[4]:

sc.pp.filter_cells(adata, min_counts = 5000)
print(f'Number of cells after min count filter: {adata.n_obs}')
sc.pp.filter_cells(adata, max_counts = 35000)
print(f'Number of cells after max count filter: {adata.n_obs}')
adata = adata[adata.obs['mt_frac'] < 0.2]
print(f'Number of cells after MT filter: {adata.n_obs}')
sc.pp.filter_cells(adata, min_genes = 3000)
print(f'Number of cells after gene filter: {adata.n_obs}')
sc.pp.filter_genes(adata, min_cells=10)
print(f'Number of genes after cell filter: {adata.n_vars}')

Number of cells after min count filter: 3988
Number of cells after max count filter: 3962
Number of cells after MT filter: 3962
Number of cells after gene filter: 3895
Number of genes after cell filter: 18457

We proceed to normalize Visium counts data with the built-in normalize_total method from Scanpy, and detect highly-variable genes (for later). Note that there are more sensible alternatives for normalization (see discussion in sc-tutorial paper and more recent alternatives such as SCTransform or GLM-PCA).

[5]:

sc.pp.normalize_total(adata, inplace = True)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor='seurat', n_top_genes=2000, inplace=True)

Dimensionality reduction and clustering¶

We can then proceed to run dimensionality reduction and clustering. We will plot some covariates to check if there is any particular structure in UMAP space associated with total counts and detected genes.

[6]:

sc.pp.pca(adata, n_comps=50, use_highly_variable=True, svd_solver='arpack')
sc.pp.neighbors(adata)

sc.tl.umap(adata)
sc.tl.louvain(adata, key_added='clusters')

[7]:

sc.pl.umap(adata, color=['total_counts', 'n_genes_by_counts'])
sc.pl.umap(adata, color='clusters', palette=sc.pl.palettes.default_20)

_images/analysis-visualization-spatial_13_0.png

_images/analysis-visualization-spatial_13_1.png

Visualization in spatial dimensions¶

We visualized two covariates (total counts per spot and number of genes by counts) in UMAP space. We can have a look at how the covariates are projected onto the spatial coordinates. We will overlay the circular spots on top of the Hematoxylin and eosin stain (H&E) image provided, using the function pl.spatial.

[8]:

sc.pl.spatial(adata, img_key = "hires",color=['total_counts', 'n_genes_by_counts'])

_images/analysis-visualization-spatial_15_0.png

the function pl.spatial accepts 4 additional parameters: * img_key str: key where the img is stored in the adata.uns element * crop_coord tuple: coordinates to use for cropping (left, right, top, bottom) * alpha_img float: alpha value for the transcparency of the image * bw bool: flag to convert the image into gray scale

furthermore, in pl.spatial the size parameter changes its behaviour: it becomes a scaling factor for the spot sizes.

Before, we performed clustering in gene expression space, and visualized the results with UMAP. By visualizing clustered samples in spatial dimensions, we can gain insights into tissue organization and inter-cellular communication.

[9]:

sc.pl.spatial(adata, img_key = "hires", color="clusters", size=1.5)

_images/analysis-visualization-spatial_17_0.png

From a visual perspective, spots belonging to the same cluster in gene expression space seems to co-occur in spatial dimensions. For instance, spots belonging to cluster 4 seem to always be surrounded by spots belonging to cluster 5.

We can zoom in specific regions of interests to gain qualitative insights. Furthermore, by changing the alpha values of the spots, we can visualize better the underlying tissue morphology from the H&E image.

[10]:

sc.pl.spatial(adata, img_key = "hires", color="clusters", groups = ["4","5"], crop_coord = [1200,1700,1900,1000], alpha = .5, size = 1.3)

_images/analysis-visualization-spatial_19_0.png

Identification of marker genes and spatially variable genes¶

After clustering, we can proceed to identify cluster marker genes.

[11]:

sc.tl.rank_genes_groups(adata, "clusters", inplace = True)
sc.pl.rank_genes_groups_heatmap(adata, groups = "4", groupby = "clusters", show = False)

WARNING: dendrogram data not found (using key=dendrogram_clusters). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: 4

_images/analysis-visualization-spatial_21_1.png

Given the spatial organization of cluster 4, which seemed to occur in small groups of spots across the image, we have plotted a heatmap with expression levels of its top 10 marker genes across clusters.

We can have a look at one specific gene and notice that it already show some interesting structural expression patterns in the image.

[12]:

sc.pl.spatial(adata, img_key = "hires", color="SERPINE2")

_images/analysis-visualization-spatial_23_0.png

Spatial transcriptomics allows researchers to investigate how gene expression trends varies in space, thus identifying spatial patterns of gene expression. For this purpose, we use SpatialDE (paper - code), a Gaussian process-based statistical framework that aims to identify spatially variable genes.

First, we convert normalized counts and coordinates to pandas dataframe, needed for inputs to spatialDE. Then we can run the method.

[13]:

counts = pd.DataFrame(adata.X.todense(), columns=adata.var_names, index=adata.obs_names)
coord = pd.DataFrame(adata.obsm["X_spatial"], columns=["x_coord", "y_coord"], index = adata.obs_names)

results = SpatialDE.run(coord, counts)

We concatenate the results with the variable table adata.var

[14]:

results.index = results["g"]
adata.var = pd.concat([adata.var, results.loc[adata.var.index.values,:]], axis = 1)

Then we can inspect significant genes that varies in space and visualize them with sc.pl.spatial function.

[15]:

results.sort_values("qval").head(10)

[15]:

	FSV	M	g	l	max_delta	max_ll	max_mu_hat	max_s2_t_hat	model	n	s2_FSV	s2_logdelta	time	BIC	max_ll_null	LLR	pval	qval
g
ITGAX	0.170702	4	ITGAX	248.801360	4.832369	-3302.354218	0.978878	0.095954	SE	3895	0.000005	0.000257	0.002626	6637.778231	-3430.115338	127.761120	0.0	0.0
CTGF	0.208433	4	CTGF	248.801360	3.777540	-3601.962867	1.059363	0.131464	SE	3895	0.000004	0.000154	0.002593	7236.995531	-3820.898121	218.935254	0.0	0.0
RPS12	0.578740	4	RPS12	248.801360	0.724028	1282.631794	4.262369	0.896279	SE	3895	0.000002	0.000044	0.002194	-2532.193792	132.889696	1149.742098	0.0	0.0
PIK3IP1	0.165093	4	PIK3IP1	474.169746	4.960988	-3121.508186	1.247603	0.075819	SE	3895	0.000013	0.000647	0.002727	6276.086168	-3327.150590	205.642404	0.0	0.0
CSF2RB	0.118427	4	CSF2RB	474.169746	7.302389	-3455.655137	1.032037	0.058573	SE	3895	0.000017	0.001423	0.002386	6944.380070	-3549.412548	93.757410	0.0	0.0
RAC2	0.188012	4	RAC2	474.169746	4.236631	-1862.354983	2.130281	0.095923	SE	3895	0.000017	0.000701	0.002674	3757.779762	-2074.893341	212.538358	0.0	0.0
DDX17	0.080435	4	DDX17	474.169746	11.214828	-1099.271702	2.480615	0.084385	SE	3895	0.000009	0.001430	0.007070	2231.613199	-1177.162770	77.891068	0.0	0.0
EZR	0.237460	4	EZR	248.801360	3.194199	-2561.906858	1.833875	0.199101	SE	3895	0.000004	0.000124	0.002783	5156.883512	-2825.270277	263.363418	0.0	0.0
CTLA4	0.114898	4	CTLA4	474.169746	7.556782	-2927.151343	0.548249	0.036851	SE	3895	0.000013	0.001080	0.002408	5887.372483	-3057.224784	130.073441	0.0	0.0
ACTB	0.459669	4	ACTB	248.801360	1.169239	1190.410976	4.432608	0.932020	SE	3895	0.000002	0.000047	0.002367	-2347.752157	332.192634	858.218343	0.0	0.0

[16]:

sc.pl.spatial(adata,img_key = "hires",color = ["CTLA4", "EZR"], alpha = 0.7)

_images/analysis-visualization-spatial_30_0.png

Recently, other tools have been proposed for the identification of spatially variable genes, such as: * SPARK paper - code * trendsceek paper - code * HMRF paper - code

MERFISH example¶

In case you have spatial data generated with FISH-based techniques, just read the cordinate table and assign it to the adata.obsm element.

Let’s see an example from Xia et al. 2019.

First, we need to download the coordinate and counts data from the original publication. We will download the files in the current directory…

[18]:

import urllib.request
url_coord = 'https://www.pnas.org/highwire/filestream/887973/field_highwire_adjunct_files/15/pnas.1912459116.sd15.xlsx'
filename_coord = 'pnas.1912459116.sd15.xlsx'
urllib.request.urlretrieve(url_coord, filename_coord)

url_counts = 'https://www.pnas.org/highwire/filestream/887973/field_highwire_adjunct_files/12/pnas.1912459116.sd12.csv'
filename_counts = 'pnas.1912459116.sd12.csv'
urllib.request.urlretrieve(url_counts, filename_counts)

[18]:

('pnas.1912459116.sd12.csv', <http.client.HTTPMessage at 0x1a3a7f8850>)

..and read the data in a AnnData object.

[19]:

coordinates = pd.read_excel("./pnas.1912459116.sd15.xlsx", index_col = 0)
counts = sc.read_csv("./pnas.1912459116.sd12.csv").transpose()

adata_merfish = counts[coordinates.index,:]
adata_merfish.obsm["X_spatial"] = coordinates.to_numpy()

We will perform standard preprocessing and dimensionality reduction.

[20]:

sc.pp.normalize_per_cell(adata_merfish, counts_per_cell_after=1e6)
sc.pp.log1p(adata_merfish)

sc.pp.pca(adata_merfish, n_comps=15)
sc.pp.neighbors(adata_merfish)
sc.tl.louvain(adata_merfish, key_added='groups', resolution=0.5)

The experiment consisted in measuring gene expression counts from a single cell type (cultured U2-OS cells). Clusters consist of cell states at different stages of the cell cycle. We don’t expect to see specific structure in spatial dimensions given the experimental setup.

We can visualize the clusters obtained from running the Louvain algorithm in tSNE space…

[21]:

sc.tl.tsne(adata_merfish, perplexity = 100,  n_jobs=12)
sc.pl.tsne(adata_merfish, color = "groups")

_images/analysis-visualization-spatial_39_0.png

… and spatial coordinates.

[22]:

sc.pl.spatial(adata_merfish, color = "groups")

_images/analysis-visualization-spatial_41_0.png

Conclusion and future plans¶

This is the end of the tutorial, we hope you’ll find it useful and report back to us which features/external tools you would like to see in Scanpy. We are extending Scanpy and AnnData to support other spatial data types, such as Imaging Mass Cytometry and extend data structure to support spatial graphs and additional features. Stay tuned!