Scanpy tutorials

See this page for more context.

Preprocessing and clustering 3k PBMCs

Translator: Alex Wolf

In May 2017, this started out as a demonstration that Scanpy would allow to reproduce most of Seurat’s (Satija et al., 2015) guided clustering tutorial. We gratefully acknowledge the authors of Seurat for the tutorial. In the meanwhile, we have added and removed a few pieces.

The data consists in 3k PBMCs from a Healthy Donor and is freely available from 10x Genomics (here from this webpage). On a unix system, you can uncomment and run the following to download and unpack the data

[1]:
# !mkdir data
# !wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
# !cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz

Note

Download the notebook by clicking on the Edit on GitHub button. On GitHub, you can download using the Raw button via right-click and Save Link As. Alternatively, download the whole scanpy-tutorial repository.

Note

In Jupyter notebooks and lab, you can see the documentation for a python function by hitting SHIFT + TAB. Hit it twice to expand the view.

[2]:
import numpy as np
import pandas as pd
import scanpy as sc
[3]:
sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
sc.settings.set_figure_params(dpi=80)
scanpy==1.4.5 anndata==0.6.22.post1 umap==0.3.8 numpy==1.16.3 scipy==1.3.0 pandas==0.23.4 scikit-learn==0.22 statsmodels==0.10.0 python-igraph==0.7.1 louvain==0.6.1
[4]:
results_file = './write/pbmc3k.h5ad'  # the file that will store the analysis results
[5]:
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/',  # the directory with the `.mtx` file
    var_names='gene_symbols',                  # use gene symbols for the variable names (variables-axis index)
    cache=True)                                # write a cache file for faster subsequent reading
... reading from cache file cache/data-filtered_gene_bc_matrices-hg19-matrix.h5ad
[6]:
adata.var_names_make_unique()  # this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`
[7]:
adata
[7]:
AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

Preprocessing

Show those genes that yield the highest fraction of counts in each single cells, across all cells.

[8]:
sc.pl.highest_expr_genes(adata, n_top=20, )
normalizing counts per cell
    finished (0:00:00)
_images/pbmc3k_14_1.png

Basic filtering.

[9]:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
filtered out 19024 genes that are detected in less than 3 cells

Let us assemple some information about mitochondrial genes, which are important for quality control.

Citing from “Simple Single Cell” workflows (Lun, McCarthy & Marioni, 2017):

High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.

Note you can also use the function pp.calculate_qc_metrics to compute the fraction of mitochondrial genes and additional measures.

[10]:
mito_genes = adata.var_names.str.startswith('MT-')
# for each cell compute fraction of counts in mito genes vs. all genes
# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
adata.obs['percent_mito'] = np.sum(
    adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
# add the total counts per cell as observations-annotation to adata
adata.obs['n_counts'] = adata.X.sum(axis=1).A1

A violin plot of the computed quality measures.

[11]:
sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'],
             jitter=0.4, multi_panel=True)
_images/pbmc3k_21_0.png

Remove cells that have too many mitochondrial genes expressed or too many total counts.

[12]:
sc.pl.scatter(adata, x='n_counts', y='percent_mito')
sc.pl.scatter(adata, x='n_counts', y='n_genes')
_images/pbmc3k_23_0.png
_images/pbmc3k_23_1.png
[13]:
adata
[13]:
AnnData object with n_obs × n_vars = 2700 × 13714
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids', 'n_cells'

Actually do the filtering.

[14]:
adata = adata[adata.obs.n_genes < 2500, :]
adata = adata[adata.obs.percent_mito < 0.05, :]

Total-count normalize (library-size correct) the data matrix \(\mathbf{X}\) to 10,000 reads per cell, so that counts become comparable among cells.

[15]:
sc.pp.normalize_total(adata, target_sum=1e4)
normalizing counts per cell
    finished (0:00:01)

Logarithmize the data.

[16]:
sc.pp.log1p(adata)

Set the .raw attribute of AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object.

While many people consider the normalized data matrix as the “relevant data” for visualization and differential testing, some would prefer to store the unnormalized data in .raw.

[17]:
adata.raw = adata

Note

If you don’t proceed below with correcting the data with sc.pp.regress_out and scaling it via sc.pp.scale, you can also get away without using .raw at all.

The result of the following highly-variable-genes detection is stored as an annotation in .var.highly_variable and auto-detected by PCA and hence, sc.pp.neighbors and subsequent manifold/graph tools. In that case, the step actually do the filtering below is unnecessary, too.

Identify highly-variable genes.

[18]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
[19]:
sc.pl.highly_variable_genes(adata)
_images/pbmc3k_36_0.png

Actually do the filtering.

[20]:
adata = adata[:, adata.var.highly_variable]

Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. Scale the data to unit variance.

[21]:
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
regressing out ['n_counts', 'percent_mito']
    sparse input is densified and may lead to high memory use
    finished (0:00:04)

Scale each gene to unit variance. Clip values exceeding standard deviation 10.

[22]:
sc.pp.scale(adata, max_value=10)

Principal component analysis

Reduce the dimensionality of the data by running principal component analysis (PCA), which reveals the main axes of variation and denoises the data.

[23]:
sc.tl.pca(adata, svd_solver='arpack')
computing PCA with n_comps = 50
    on highly variable genes
    finished (0:00:00)

We can make a scatter plot in the PCA coordinates, but we will not use that later on.

[24]:
sc.pl.pca(adata, color='CST3')
_images/pbmc3k_47_0.png

Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells, e.g. used in the clustering function sc.tl.louvain() or tSNE sc.tl.tsne(). In our experience, often, a rough estimate of the number of PCs does fine.

[25]:
sc.pl.pca_variance_ratio(adata, log=True)
_images/pbmc3k_49_0.png

Save the result.

[26]:
adata.write(results_file)
[27]:
adata
[27]:
AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'

Computing the neighborhood graph

Let us compute the neighborhood graph of cells using the PCA representation of the data matrix. You might simply use default values here. For the sake of reproducing Seurat’s results, let’s take the following values.

[28]:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
computing neighbors
    using 'X_pca' with n_pcs = 40
    finished: added to `.uns['neighbors']`
    'distances', distances for each pair of neighbors
    'connectivities', weighted adjacency matrix (0:00:03)

Embedding the neighborhood graph

We advertise embedding the graph in 2 dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preservers trajectories. In some ocassions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:

tl.paga(adata)
pl.paga(adata, plot=False)  # remove `plot=False` if you want to see the coarse-grained graph
tl.umap(adata, init_pos='paga')
[29]:
sc.tl.umap(adata)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:04)
[30]:
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'])
_images/pbmc3k_59_0.png

As we set the .raw attribute of adata, the previous plots showed the “raw” (normalized, logarithmized, but uncorrected) gene expression. You can also plot the scaled and corrected gene expression by explicitly stating that you don’t want to use .raw.

[31]:
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'], use_raw=False)
_images/pbmc3k_61_0.png

Clustering the neighborhood graph

As Seurat and many others, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag *et al.* (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.

[32]:
sc.tl.leiden(adata)
running Leiden clustering
    finished: found 8 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)

Plot the clusters, which agree quite well with the result of Seurat.

[33]:
sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])
_images/pbmc3k_66_0.png

Save the result.

[34]:
adata.write(results_file)

Finding marker genes

Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.

[35]:
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)
_images/pbmc3k_71_1.png
[36]:
sc.settings.verbosity = 2  # reduce the verbosity

The result of a Wilcoxon rank-sum (Mann-Whitney-U) test is very similar. We recommend using the latter in publications, see e.g., Sonison & Robinson (2018). You might also consider much more powerful differential testing packages like MAST, limma, DESeq2 and, for python, the recent diffxpy.

[37]:
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
ranking genes
    finished (0:00:02)
_images/pbmc3k_74_1.png

Save the result.

[38]:
adata.write(results_file)

As an alternative, let us rank genes using logistic regression. For instance, this has been suggested by Natranos et al. (2018). The essential difference is that here, we use a multi-variate appraoch whereas conventional differential tests are uni-variate. Clark et al. (2014) has more details.

[39]:
sc.tl.rank_genes_groups(adata, 'leiden', method='logreg')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
ranking genes
    finished (0:00:05)
_images/pbmc3k_78_1.png

With the exceptions of IL7R, which is only found by the t-test and FCER1A, which is only found by the other two appraoches, all marker genes are recovered in all approaches.

Louvain Group Markers Cell Type
0 IL7R CD4 T cells
1 CD14, LYZ CD14+ Monocytes
2 MS4A1 B cells
3 CD8A CD8 T cells
4 GNLY, NKG7 NK cells
5 FCGR3A, MS4A7 FCGR3A+ Monocytes
6 FCER1A, CST3 Dendritic Cells
7 PPBP Megakaryocytes

Let us also define a list of marker genes for later reference.

[40]:
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']

Reload the object that has been save with the Wilcoxon Rank-Sum test result.

[41]:
adata = sc.read(results_file)

Show the 10 top ranked genes per cluster 0, 1, …, 7 in a dataframe.

[42]:
pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)
[42]:
0 1 2 3 4 5 6 7
0 RPS12 LYZ CD74 CCL5 NKG7 LST1 HLA-DPA1 PF4
1 LDHB S100A9 CD79A NKG7 GZMB FCER1G HLA-DPB1 GNG11
2 RPS25 S100A8 HLA-DRA B2M GNLY AIF1 HLA-DRA SDPR
3 RPS27 TYROBP CD79B CST7 CTSW COTL1 HLA-DRB1 PPBP
4 RPS6 FTL HLA-DPB1 IL32 PRF1 FCGR3A CD74 NRGN

Get a table with the scores and groups.

[43]:
result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.DataFrame(
    {group + '_' + key[:1]: result[key][group]
    for group in groups for key in ['names', 'pvals']}).head(5)
[43]:
0_n 0_p 1_n 1_p 2_n 2_p 3_n 3_p 4_n 4_p 5_n 5_p 6_n 6_p 7_n 7_p
0 RPS12 4.806932e-219 LYZ 1.007060e-252 CD74 3.043536e-182 CCL5 1.340647e-119 NKG7 4.689070e-95 LST1 5.650219e-85 HLA-DPA1 5.422417e-21 PF4 4.722886e-10
1 LDHB 5.645430e-216 S100A9 3.664292e-248 CD79A 6.860832e-170 NKG7 5.020256e-98 GZMB 2.381363e-89 FCER1G 1.697236e-81 HLA-DPB1 7.591860e-21 GNG11 4.733899e-10
2 RPS25 3.355677e-195 S100A8 9.457377e-239 HLA-DRA 8.398068e-166 B2M 3.800068e-82 GNLY 9.322195e-87 AIF1 1.377723e-79 HLA-DRA 1.306768e-19 SDPR 4.733899e-10
3 RPS27 3.322679e-185 TYROBP 2.209430e-224 CD79B 1.171444e-153 CST7 5.664351e-78 CTSW 1.035081e-85 COTL1 9.684016e-78 HLA-DRB1 1.865104e-19 PPBP 4.744938e-10
4 RPS6 2.400370e-183 FTL 3.910903e-219 HLA-DPB1 6.167786e-148 IL32 6.704290e-73 PRF1 3.364126e-85 FCGR3A 2.516161e-76 CD74 5.853161e-19 NRGN 4.800511e-10

Compare to a single cluster.

[44]:
sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)
ranking genes
    finished (0:00:01)
_images/pbmc3k_89_1.png

If we want a more detailed view for a certain group, use sc.pl.rank_genes_groups_violin.

[45]:
sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=8)
_images/pbmc3k_91_0.png

Reload the object that computed differential expression by comparing to the rest of the groups.

[46]:
adata = sc.read(results_file)
[47]:
sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=8)
_images/pbmc3k_94_0.png

If you want to compare a certain gene across groups, use the following.

[48]:
sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby='leiden')
_images/pbmc3k_96_0.png

Actually mark the cell types.

[49]:
new_cluster_names = [
    'CD4 T', 'CD14 Monocytes',
    'B', 'CD8 T',
    'NK', 'FCGR3A Monocytes',
    'Dendritic', 'Megakaryocytes']
adata.rename_categories('leiden', new_cluster_names)
[50]:
sc.pl.umap(adata, color='leiden', legend_loc='on data', title='', frameon=False, save='.pdf')
WARNING: saving figure to file figures/umap.pdf
_images/pbmc3k_99_1.png

Now that we annotated the cell types, let us visualize the marker genes.

[51]:
ax = sc.pl.dotplot(adata, marker_genes, groupby='leiden')
_images/pbmc3k_101_0.png

There is also a very compact violin plot.

[52]:
ax = sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90)
_images/pbmc3k_103_0.png

During the course of this analysis, the AnnData accumlated the following annotations.

[53]:
adata
[53]:
AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'percent_mito', 'n_counts', 'leiden'
    var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
[54]:
adata.write(results_file, compression='gzip')  # `compression='gzip'` saves disk space, but slows down writing and subsequent reading

Get a rough overview of the file using h5ls, which has many options - for more details see here. The file format might still be subject to further optimization in the future. All reading functions will remain backwards-compatible, though.

If you want to share this file with people who merely want to use it for visualization, a simple way to reduce the file size is by removing the dense scaled and corrected data matrix. The file still contains the raw data used in the visualizations.

[55]:
adata.X = None
adata.write('./write/pbmc3k_withoutX.h5ad', compression='gzip')

If you want to export to “csv”, you have the following options:

[56]:
# Export single fields of the annotation of observations
# adata.obs[['n_counts', 'louvain_groups']].to_csv(
#     './write/pbmc3k_corrected_louvain_groups.csv')

# Export single columns of the multidimensional annotation
# adata.obsm.to_df()[['X_pca1', 'X_pca2']].to_csv(
#     './write/pbmc3k_corrected_X_pca.csv')

# Or export everything except the data using `.write_csvs`.
# Set `skip_data=False` if you also want to export the data.
# adata.write_csvs(results_file[:-5], )

Analyzing CITE-seq data

Author: Isaac Virshup

[1]:
import scanpy as sc
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
[2]:
sc.logging.print_versions()
sc.set_figure_params(frameon=False, figsize=(4, 4))
scanpy==1.4.5.2.dev45+g890bc1e.d20200319 anndata==0.7.2.dev24+g669dd44 umap==0.4.0rc1 numpy==1.18.1 scipy==1.4.1 pandas==1.0.1 scikit-learn==0.22.2.post1 statsmodels==0.11.1 python-igraph==0.8.0 louvain==0.6.1

Reading

[3]:
# !mkdir -p data
# !wget http://cf.10xgenomics.com/samples/cell-exp/3.1.0/5k_pbmc_protein_v3_nextgem/5k_pbmc_protein_v3_nextgem_filtered_feature_bc_matrix.h5 -O data/5k_pbmc_protein_v3_nextgem_filtered_feature_bc_matrix.h5
[4]:
datafile = "data/5k_pbmc_protein_v3_nextgem_filtered_feature_bc_matrix.h5"
[5]:
pbmc = sc.read_10x_h5(datafile, gex_only=False)
pbmc.var_names_make_unique()
pbmc.layers["counts"] = pbmc.X.copy()
sc.pp.filter_genes(pbmc, min_counts=1)
pbmc
[5]:
AnnData object with n_obs × n_vars = 5527 × 21453
    var: 'gene_ids', 'feature_types', 'genome', 'n_counts'
    layers: 'counts'
[6]:
pbmc.var["feature_types"].value_counts()
[6]:
Gene Expression     21421
Antibody Capture       32
Name: feature_types, dtype: int64

Preprocessing

For ease of preprocessing, we’ll split the data into seperate protein and RNA AnnData objects:

[7]:
protein = pbmc[:, pbmc.var["feature_types"] == "Antibody Capture"].copy()
rna = pbmc[:, pbmc.var["feature_types"] == "Gene Expression"].copy()
[8]:
rna.shape
[8]:
(5527, 21421)
[9]:
pbmc.shape
[9]:
(5527, 21453)

Protein

First we’ll take a look at the antibody counts.

We still need to explain the function here. I’m happy if we add it to the first tutorial, too (I know you did it already at some point, but I didn’t want to let go of the simpler naming scheme back then; now I’d be happy to transition.)

[10]:
protein.var["control"] = protein.var_names.str.contains("control")
sc.pp.calculate_qc_metrics(
    protein,
    percent_top=(5, 10, 15),
    var_type="antibodies",
    qc_vars=("control",),
    inplace=True,
)

We can look check out the qc metrics for our data:

TODO: I would like to include some justification for the change in normalization. It definitley has a much different distribution than transcripts. I think this could be shown through the qc plots, but it’s a huge pain to move around these matplotlib plots. This might be more appropriate for the in-depth guide though.

Agreed! But, having an explanation of the two images below is also a good first step. Also nice to contrast it with the corresponding RNA distributions.

[11]:
sns.jointplot("log1p_total_counts", "n_antibodies_by_counts", protein.obs, kind="hex", norm=mpl.colors.LogNorm())
sns.jointplot("log1p_total_counts", "log1p_total_counts_control", protein.obs, kind="hex", norm=mpl.colors.LogNorm())
[11]:
<seaborn.axisgrid.JointGrid at 0x1410d7f90>
_images/cite-seq_pbmc5k_19_1.png
_images/cite-seq_pbmc5k_19_2.png
[12]:
protein.layers["counts"] = protein.X.copy()

Discuss that this here is a different normalization.

[13]:
sc.pp.normalize_geometric(protein)
sc.pp.log1p(protein)
[14]:
sc.pp.pca(protein, n_comps=20)
sc.pp.neighbors(protein, n_neighbors=30)  # why can't we just work with the default neighbors?
sc.tl.leiden(protein, key_added="protein_leiden")
[15]:
# TODO: remove
protein.obsp["protein_connectivities"] = protein.obsp["connectivities"].copy()
sc.tl.umap(protein)
sc.pl.umap(protein, color="protein_leiden", size=10)
_images/cite-seq_pbmc5k_24_0.png
[16]:
protein
[16]:
AnnData object with n_obs × n_vars = 5527 × 32
    obs: 'n_antibodies_by_counts', 'log1p_n_antibodies_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_5_antibodies', 'pct_counts_in_top_10_antibodies', 'pct_counts_in_top_15_antibodies', 'total_counts_control', 'log1p_total_counts_control', 'pct_counts_control', 'protein_leiden'
    var: 'gene_ids', 'feature_types', 'genome', 'n_counts', 'control', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'log1p', 'pca', 'neighbors', 'leiden', 'umap', 'protein_leiden_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'distances', 'connectivities', 'protein_connectivities'

RNA

Now we’ll process our RNA in the typical way:

[17]:
# sc.pp.filter_genes(rna, min_counts=1)

rna.var["mito"] = rna.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(rna, qc_vars=["mito"], inplace=True)

Yes, would be great to the see the QC plots, here, too. Are cells with low counts etc. weird only on the RNA level or also on the protein level? It’d be nice to see the QC plots side-by-side between RNA and protein, I’d say.

[18]:
rna.layers["counts"] = rna.X.copy()
[19]:
sc.pp.normalize_total(rna)
sc.pp.log1p(rna)
[20]:
sc.pp.pca(rna)
sc.pp.neighbors(rna, n_neighbors=30)   # why can't we just work with the default neighbors?
sc.tl.umap(rna)
sc.tl.leiden(rna, key_added="rna_leiden")

Visualization

An alternative would be to recombine the data into the pbmc object here.

[21]:
rna.obsm["protein"] = protein.to_df()
rna.obsm["protein_umap"] = protein.obsm["X_umap"]
rna.obs["protein_leiden"] = protein.obs["protein_leiden"]
rna.obsp["rna_connectivities"] = rna.obsp["connectivities"].copy()
rna.obsp["protein_connectivities"] = protein.obsp["protein_connectivities"]
[22]:
sc.tl.umap(rna)
[23]:
sc.pl.umap(rna, color=["rna_leiden", "protein_leiden"], size=10)
sc.pl.embedding(rna, basis="protein_umap", color=["rna_leiden", "protein_leiden"], size=10)
_images/cite-seq_pbmc5k_37_0.png
_images/cite-seq_pbmc5k_37_1.png

The clusterings disagree quite a bit. It looks like neither modality is giving a full accounting of the data, so now we’ll see what we can learn by combining the modalities.

Plotting values between modalities

  • TODO: Get selectors like this working in scanpy.

Imports and helper functions

[24]:
import altair as alt
from functools import partial

alt.renderers.enable("png")
alt.data_transformers.disable_max_rows()
[24]:
DataTransformerRegistry.enable('default')
[25]:
def embedding_chart(df: pd.DataFrame, coord_pat: str, *, size=5) -> alt.Chart:
    """Make schema for coordinates, like sc.pl.embedding."""
    x, y = df.columns[df.columns.str.contains(coord_pat)]
    return (
        alt.Chart(plotdf, height=300, width=300)
        .mark_circle(size=size)
        .encode(
            x=alt.X(x, axis=None),
            y=alt.Y(y, axis=None),
        )
    )

def umap_chart(df: pd.DataFrame, **kwargs) -> alt.Chart:
    """Like sc.pl.umap, but just the coordinates."""
    return embedding_chart(df, "umap", **kwargs)

def encode_color(c: alt.Chart, col: str, *, qdomain=(0, 1), scheme: str = "lightgreyred") -> alt.Chart:
    """Add colors to an embedding plot schema."""
    base = c.properties(title=col)
    if pd.api.types.is_categorical(c.data[col]):
        return base.encode(color=col)
    else:
        return base.encode(
            color=alt.Color(
                col,
                scale=alt.Scale(
                    scheme=scheme,
                    clamp=True,
                    domain=list(c.data[col].quantile(qdomain)),
                    nice=True,
                )
            )
        )
[26]:
plotdf = sc.get.obs_df(
    rna,
    obsm_keys=[("X_umap", i) for i in range(2)] + [("protein", i) for i in rna.obsm["protein"].columns]
)
(
    alt.concat(
        *map(partial(encode_color, umap_chart(plotdf), qdomain=(0, .95)), plotdf.columns[5:10]),
        columns=2
    )
    .resolve_scale(color='independent')
    .configure_axis(grid=False)
)
[26]:
_images/cite-seq_pbmc5k_43_0.png

Clustering

Here, we have a few approaches for clustering. Both which take into account both modalities of the data. First, we can use both connectivity graphs generated from each assay.

[27]:
sc.tl.leiden_multiplex(rna, ["rna_connectivities", "protein_connectivities"])  # Adds key "leiden_multiplex" by default

Alternativley, we can try and combine these representations into a joint graph.

[28]:
def join_graphs_max(g1: "sparse.spmatrix", g2: "sparse.spmatrix"):
    """Take the maximum edge value from each graph."""
    out = g1.copy()
    mask = g1 < g2
    out[mask] = g2[mask]

    return out
[29]:
rna.obsp["connectivities"] = join_graphs_max(rna.obsp["rna_connectivities"], rna.obsp["protein_connectivities"])
[30]:
sc.tl.leiden(rna, key_added="joint_leiden")
[31]:
sc.tl.umap(rna)

Now we’ve got a clustering which better represents both modalities in the data:

[32]:
sc.pl.umap(rna, color=["leiden_multiplex", "joint_leiden"], size=5)
_images/cite-seq_pbmc5k_53_0.png
[33]:
sns.heatmap(pd.crosstab(rna.obs["leiden_multiplex"], rna.obs["rna_leiden"], normalize="index"))
[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x1364e29d0>
_images/cite-seq_pbmc5k_54_1.png
[34]:
sns.heatmap(pd.crosstab(rna.obs["leiden_multiplex"], rna.obs["protein_leiden"], normalize="index"))
[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x13714f890>
_images/cite-seq_pbmc5k_55_1.png

Gather data

[36]:
pbmc.X[:, (pbmc.var["feature_types"] == "Gene Expression").values] = rna.X
[37]:
pbmc.X[:, (pbmc.var["feature_types"] == "Antibody Capture").values] = protein.X
[38]:
pbmc.obsm.update(rna.obsm)
[39]:
pbmc.obs[rna.obs.columns] = rna.obs
[40]:
pbmc
[40]:
AnnData object with n_obs × n_vars = 5527 × 21453
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mito', 'log1p_total_counts_mito', 'pct_counts_mito', 'rna_leiden', 'protein_leiden', 'leiden_multiplex', 'joint_leiden'
    var: 'gene_ids', 'feature_types', 'genome', 'n_counts'
    obsm: 'X_pca', 'X_umap', 'protein', 'protein_umap'
    layers: 'counts'

Labelling

[43]:
sc.pl.umap(pbmc, color="leiden_multiplex", legend_loc="on data")
_images/cite-seq_pbmc5k_63_0.png
[80]:
sc.tl.leiden(pbmc, resolution=0.3, key_added="highlevel")  # I had to patch scanpy for this, but sergei's PR should make this work.
[82]:
highlevel_labels = {
    "0": "T-Cells",
    "1": "Monocytes",
    "2": "NK Cells",
    "3": "B-Cells",
}
pbmc.obs["highlevel"] = pbmc.obs["highlevel"].map(highlevel_labels)
[83]:
sc.pl.umap(pbmc, color="highlevel")
_images/cite-seq_pbmc5k_66_0.png
[203]:
joint_mapping = {
    "1": "Classical monocyte",
    "6": "Dendritic cells",
    "7": "Non-classical monocyte",
    "4": "NK-cells",
    "5": "B-cells",
    "3": "CD8+ Effector T-cells", # "CD8+, CCR7-, "CCL5+"
    "0": "CD4+ Naive T-Cells", # CD4+, CCR7+, CCL5-,
    "2": "CD4+ Memory T-Cells", # CD4+, CCR7-, CCL5+
    # I think 8 may be doublets since it's all the monocyte markers and all the t cell markers
}

pbmc.obs["lowlevel"] = pbmc.obs["joint_leiden"].map(joint_mapping)

sc.pl.umap(pbmc, color="lowlevel")
_images/cite-seq_pbmc5k_67_0.png
[201]:
markers = {
    "CD4": ["CD4_TotalSeqB", "CD4"],
    "CD8": ["CD8a_TotalSeqB", "CD8A", "CD8B"],
    "B-cell": ["CD19_TotalSeqB"],
    "Monocytes": ["CD86_TotalSeqB", "CD11b_TotalSeqB", "CD14", "CD14_TotalSeqB", "FCGR3A", "CD14_TotalSeqB"],
    "NK Cells": ["FCGR3A", "NKG7"],
    "T-Cell Activation": ["CCL5", "CD45RO_TotalSeqB", "CCR7", "CD45RA_TotalSeqB"],
}
[202]:
sc.pl.dotplot(  # Should we use z-scores for the hue in these?
    pbmc,
    markers,
    groupby="lowlevel",
)
_images/cite-seq_pbmc5k_69_0.png
[202]:
GridSpec(2, 5, height_ratios=[0.5, 10], width_ratios=[6.3, 0, 0.2, 0.5, 0.25])
[ ]:

Trajectory inference for hematopoiesis in mouse

Reconstructing myeloid and erythroid differentiation for data of Paul et al. (2015).

[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
from matplotlib import rcParams
import scanpy as sc
[2]:
sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
results_file = './write/paul15.h5ad'
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(3, 3))  # low dpi (dots per inch) yields small inline figures
scanpy==1.4.6.dev1+g1404638 anndata==0.7rc2.dev9+g5928e64 umap==0.3.8 numpy==1.16.3 scipy==1.3.0 pandas==0.25.3 scikit-learn==0.22 statsmodels==0.10.0 python-igraph==0.7.1 louvain==0.6.1
[3]:
adata = sc.datasets.paul15()
WARNING: In Scanpy 0.*, this returned logarithmized data. Now it returns non-logarithmized data.
[4]:
adata
[4]:
AnnData object with n_obs × n_vars = 2730 × 3451
    obs: 'paul15_clusters'
    uns: 'iroot'

Let us work with a higher precision than the default ‘float32’ to ensure exactly the same results across different computational platforms.

[5]:
adata.X = adata.X.astype('float64')  # this is not required and results will be comparable without it

Preprocessing and Visualization

Apply a simple preprocessing recipe.

[6]:
sc.pp.recipe_zheng17(adata)
running recipe zheng17
normalizing counts per cell
    finished (0:00:00)
extracting highly variable genes
    finished (0:00:00)
normalizing counts per cell
    finished (0:00:00)
    finished (0:00:00)
[7]:
sc.tl.pca(adata, svd_solver='arpack')
computing PCA with n_comps = 50
    finished (0:00:00)
[8]:
sc.pp.neighbors(adata, n_neighbors=4, n_pcs=20)
sc.tl.draw_graph(adata)
computing neighbors
    using 'X_pca' with n_pcs = 20
    finished: added to `.uns['neighbors']`
    'distances', distances for each pair of neighbors
    'connectivities', weighted adjacency matrix (0:00:03)
drawing single-cell graph using layout 'fa'
    finished: added
    'X_draw_graph_fa', graph_drawing coordinates (adata.obsm) (0:00:10)
[9]:
sc.pl.draw_graph(adata, color='paul15_clusters', legend_loc='on data')
_images/paga-paul15_13_0.png

This looks like a mess.

Optional: Denoising the graph

To denoise the graph, we represent it in diffusion map space (and not in PCA space). Computing distances within a few diffusion components amounts to denoising the graph - we just take a few of the first spectral components. It’s very similar to denoising a data matrix using PCA. The approach has been used in a couple of papers, see e.g. Schiebinger et al. (2017) or Tabaka et al. (2018). It’s also related to the principles behind MAGIC Dijk et al. (2018).

Note

This is not a necessary step, neither for PAGA, nor clustering, nor pseudotime estimation. You might just as well go ahead with a non-denoised graph. In many situations (also here), this will give you very decent results.

[10]:
sc.tl.diffmap(adata)
sc.pp.neighbors(adata, n_neighbors=10, use_rep='X_diffmap')
computing Diffusion Maps using n_comps=15(=n_dcs)
computing transitions
    finished (0:00:00)
    eigenvalues of transition matrix
    [1.         1.         0.9989277  0.99671    0.99430376 0.98939794
     0.9883687  0.98731077 0.98398703 0.983007   0.9790806  0.9762548
     0.9744365  0.9729161  0.9652972 ]
    finished: added
    'X_diffmap', diffmap coordinates (adata.obsm)
    'diffmap_evals', eigenvalues of transition matrix (adata.uns) (0:00:00)
computing neighbors
    finished: added to `.uns['neighbors']`
    'distances', distances for each pair of neighbors
    'connectivities', weighted adjacency matrix (0:00:00)
[11]:
sc.tl.draw_graph(adata)
drawing single-cell graph using layout 'fa'
    finished: added
    'X_draw_graph_fa', graph_drawing coordinates (adata.obsm) (0:00:10)
[12]:
sc.pl.draw_graph(adata, color='paul15_clusters', legend_loc='on data')
_images/paga-paul15_19_0.png

This still looks messy, but in a different way: a lot of the branches are overplotted.

Clustering and PAGA

Note that today, we’d use sc.tl.leiden - here, we use sc.tl.louvain the sake of reproducing the paper results.

[13]:
sc.tl.louvain(adata, resolution=1.0)
running Louvain clustering
    using the "louvain" package of Traag (2017)
    finished: found 25 clusters and added
    'louvain', the cluster labels (adata.obs, categorical) (0:00:00)

Annotate the clusters using marker genes.

cell type marker
HSCs Procr
Erythroids Gata1, Klf1, Epor, Gypa, Hba-a2, Hba-a1, Spi1
Neutrophils Elane, Cebpe, Ctsg, Mpo, Gfi1
Monocytes Irf8, Csf1r, Ctsg, Mpo
Megakaryocytes Itga2b (encodes protein CD41), Pbx1, Sdpr, Vwf
Basophils Mcpt8, Prss34
B cells Cd19, Vpreb2, Cd79a
Mast cells Cma1, Gzmb, CD117/C-Kit
Mast cells & Basophils Ms4a2, Fcer1a, Cpa3, CD203c (human)

For simple, coarse-grained visualization, compute the PAGA graph, a coarse-grained and simplified (abstracted) graph. Non-significant edges in the coarse- grained graph are thresholded away.

[14]:
sc.tl.paga(adata, groups='louvain')
running PAGA
    finished: added
    'paga/connectivities', connectivities adjacency (adata.uns)
    'paga/connectivities_tree', connectivities subtree (adata.uns) (0:00:00)
[15]:
sc.pl.paga(adata, color=['louvain', 'Hba-a2', 'Elane', 'Irf8'])
--> added 'pos', the PAGA positions (adata.uns['paga'])
_images/paga-paul15_27_1.png
[16]:
sc.pl.paga(adata, color=['louvain', 'Itga2b', 'Prss34', 'Cma1'])
--> added 'pos', the PAGA positions (adata.uns['paga'])
_images/paga-paul15_28_1.png

Actually annotate the clusters — note that Cma1 is a Mast cell marker and only appears in a small fraction of the cells in the progenitor/stem cell cluster 8, see the single-cell resolved plot below.

[17]:
adata.obs['louvain'].cat.categories
[17]:
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
       '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24'],
      dtype='object')
[18]:
adata.obs['louvain_anno'] = adata.obs['louvain']
[19]:
adata.obs['louvain_anno'].cat.categories = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10/Ery', '11', '12',
       '13', '14', '15', '16/Stem', '17', '18', '19/Neu', '20/Mk', '21', '22/Baso', '23', '24/Mo']

Let’s use the annotated clusters for PAGA.

[20]:
sc.tl.paga(adata, groups='louvain_anno')
running PAGA
    finished: added
    'paga/connectivities', connectivities adjacency (adata.uns)
    'paga/connectivities_tree', connectivities subtree (adata.uns) (0:00:00)
[21]:
sc.pl.paga(adata, threshold=0.03, show=False)
--> added 'pos', the PAGA positions (adata.uns['paga'])
[21]:
<matplotlib.axes._axes.Axes at 0x13ea01ef0>
_images/paga-paul15_35_2.png

Recomputing the embedding using PAGA-initialization

The following is just as well possible for a UMAP.

[22]:
sc.tl.draw_graph(adata, init_pos='paga')
drawing single-cell graph using layout 'fa'
    finished: added
    'X_draw_graph_fa', graph_drawing coordinates (adata.obsm) (0:00:10)

Now we can see all marker genes also at single-cell resolution in a meaningful layout.

[23]:
sc.pl.draw_graph(adata, color=['louvain_anno', 'Itga2b', 'Prss34', 'Cma1'], legend_loc='on data')
_images/paga-paul15_40_0.png

Choose the colors of the clusters a bit more consistently.

[24]:
pl.figure(figsize=(8, 2))
for i in range(28):
    pl.scatter(i, 1, c=sc.pl.palettes.zeileis_28[i], s=200)
pl.show()
_images/paga-paul15_42_0.png
[25]:
zeileis_colors = np.array(sc.pl.palettes.zeileis_28)
new_colors = np.array(adata.uns['louvain_anno_colors'])
[26]:
new_colors[[16]] = zeileis_colors[[12]]  # Stem colors / green
new_colors[[10, 17, 5, 3, 15, 6, 18, 13, 7, 12]] = zeileis_colors[[5, 5, 5, 5, 11, 11, 10, 9, 21, 21]]  # Ery colors / red
new_colors[[20, 8]] = zeileis_colors[[17, 16]]  # Mk early Ery colors / yellow
new_colors[[4, 0]] = zeileis_colors[[2, 8]]  # lymph progenitors / grey
new_colors[[22]] = zeileis_colors[[18]]  # Baso / turquoise
new_colors[[19, 14, 2]] = zeileis_colors[[6, 6, 6]]  # Neu / light blue
new_colors[[24, 9, 1, 11]] = zeileis_colors[[0, 0, 0, 0]]  # Mo / dark blue
new_colors[[21, 23]] = zeileis_colors[[25, 25]]  # outliers / grey
[27]:
adata.uns['louvain_anno_colors'] = new_colors

And add some white space to some cluster names. The layout shown here differs from the one in the paper, which can be found here. These differences, however, are only cosmetic. We had to change the layout as we moved from a randomized PCA and float32 to float64 precision.

[28]:
sc.pl.paga_compare(
    adata, threshold=0.03, title='', right_margin=0.2, size=10, edge_width_scale=0.5,
    legend_fontsize=12, fontsize=12, frameon=False, edges=True, save=True)
--> added 'pos', the PAGA positions (adata.uns['paga'])
WARNING: saving figure to file figures/paga_compare.pdf
_images/paga-paul15_47_1.png

Reconstructing gene changes along PAGA paths for a given set of genes

Choose a root cell for diffusion pseudotime.

[29]:
adata.uns['iroot'] = np.flatnonzero(adata.obs['louvain_anno']  == '16/Stem')[0]
[30]:
sc.tl.dpt(adata)
computing Diffusion Pseudotime using n_dcs=10
    finished: added
    'dpt_pseudotime', the pseudotime (adata.obs) (0:00:00)

Select some of the marker gene names.

[31]:
gene_names = ['Gata2', 'Gata1', 'Klf1', 'Epor', 'Hba-a2',  # erythroid
              'Elane', 'Cebpe', 'Gfi1',                    # neutrophil
              'Irf8', 'Csf1r', 'Ctsg']                     # monocyte

Use the full raw data for visualization.

[32]:
adata_raw = sc.datasets.paul15()
sc.pp.log1p(adata_raw)
sc.pp.scale(adata_raw)
adata.raw = adata_raw
WARNING: In Scanpy 0.*, this returned logarithmized data. Now it returns non-logarithmized data.
[33]:
sc.pl.draw_graph(adata, color=['louvain_anno', 'dpt_pseudotime'], legend_loc='on data')
_images/paga-paul15_56_0.png
[34]:
paths = [('erythrocytes', [16, 12, 7, 13, 18, 6, 5, 10]),
         ('neutrophils', [16, 0, 4, 2, 14, 19]),
         ('monocytes', [16, 0, 4, 11, 1, 9, 24])]
[35]:
adata.obs['distance'] = adata.obs['dpt_pseudotime']
[36]:
adata.obs['clusters'] = adata.obs['louvain_anno']  # just a cosmetic change
[37]:
adata.uns['clusters_colors'] = adata.uns['louvain_anno_colors']
[38]:
!mkdir write
mkdir: write: File exists
[39]:
_, axs = pl.subplots(ncols=3, figsize=(6, 2.5), gridspec_kw={'wspace': 0.05, 'left': 0.12})
pl.subplots_adjust(left=0.05, right=0.98, top=0.82, bottom=0.2)
for ipath, (descr, path) in enumerate(paths):
    _, data = sc.pl.paga_path(
        adata, path, gene_names,
        show_node_names=False,
        ax=axs[ipath],
        ytick_fontsize=12,
        left_margin=0.15,
        n_avg=50,
        annotations=['distance'],
        show_yticks=True if ipath==0 else False,
        show_colorbar=False,
        color_map='Greys',
        groups_key='clusters',
        color_maps_annotations={'distance': 'viridis'},
        title='{} path'.format(descr),
        return_data=True,
        show=False)
    data.to_csv('./write/paga_path_{}.csv'.format(descr))
pl.savefig('./figures/paga_path_paul15.pdf')
pl.show()
_images/paga-paul15_62_0.png

Visualizing marker genes

[1]:
import scanpy as sc
import pandas as pd
from matplotlib import rcParams
[2]:
sc.set_figure_params(dpi=80, color_map='viridis')
sc.settings.verbosity = 2
sc.logging.print_versions()
scanpy==1.4.3+21.g63fbe58 anndata==0.6.19 umap==0.3.8 numpy==1.15.4 scipy==1.2.1 pandas==0.24.2 scikit-learn==0.20.3 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1

Data was obtained from 10x PBMC 68k dataset (https://support.10xgenomics.com/single-cell-gene-expression/datasets). The dataset was filtered and a sample of 700 cells and 765 highly variable genes was kept.

For this data, PCA and UMAP are already computed. Also, louvain clustering and cell cycle detection are present in pbmc.obs.

[3]:
pbmc = sc.datasets.pbmc68k_reduced()
[4]:
pbmc
[4]:
AnnData object with n_obs × n_vars = 700 × 765
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
    var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

To modify the default figure size, use rcParams.

[5]:
rcParams['figure.figsize'] = 4, 4
[6]:
sc.pl.umap(pbmc, color=['bulk_labels'], s=50)
_images/visualizing-marker-genes_8_0.png

Define list of marker genes from literature.

[7]:
marker_genes = ['CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ',  'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'FCER1A', 'CST3']

Stacked violins

Plot marker genes per cluster using stacked violin plots.

[8]:
ax = sc.pl.stacked_violin(pbmc, marker_genes, groupby='bulk_labels',
                         var_group_positions=[(7, 8)], var_group_labels=['NK'])
_images/visualizing-marker-genes_13_0.png

Same as before but swapping the axes and with dendrogram (notice that the categories are reordered).

[9]:
ax = sc.pl.stacked_violin(pbmc, marker_genes, groupby='bulk_labels', swap_axes=True,
                         var_group_positions=[(7, 8)], var_group_labels=['NK'], dendrogram=True)
WARNING: dendrogram data not found (using key=dendrogram_bulk_labels). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
    using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_bulk_labels']`
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: CD4+/CD25 T Reg, CD4+/CD45RA+/CD25- Naive T, CD4+/CD45RO+ Memory, etc.
var_group_labels: NK
_images/visualizing-marker-genes_15_1.png

Dot plots

The dotplot visualization provides a compact way of showing per group, the fraction of cells expressing a gene (dot size) and the mean expression of the gene in those cell (color scale).

The use of the dotplot is only meaningful when the counts matrix contains zeros representing no gene counts. dotplot visualization does not work for scaled or corrected matrices in which cero counts had been replaced by other values.

The marker genes list can be a list or a dictionary. If marker genes List is a dictionary, then plot shows the marker genes grouped and labelled

[10]:
marker_genes_dict = {'B-cell': ['CD79A', 'MS4A1'],
                     'T-cell': 'CD3D',
                     'T-cell CD8+': ['CD8A', 'CD8B'],
                     'NK': ['GNLY', 'NKG7'],
                     'Myeloid': ['CST3', 'LYZ'],
                     'Monocytes': ['FCGR3A'],
                     'Dendritic': ['FCER1A']}
[11]:
# use marker genes as dict to group them
ax = sc.pl.dotplot(pbmc, marker_genes_dict, groupby='bulk_labels')
_images/visualizing-marker-genes_19_0.png

To show some of the options of dot plot, here we add:

  • dendrogram=True show dendrogram and reorder group by categories based on dendrogram order
  • dot_max=0.5 plot largest dot as 50% or more cells expressing the gene
  • dot_min=0.3 plot smallest dot as 30% or less cells expressing the gene
  • standard_scale=’var’ normalize the mean gene expression values between 0 and 1
[12]:
ax = sc.pl.dotplot(pbmc, marker_genes, groupby='bulk_labels', dendrogram=True, dot_max=0.5, dot_min=0.3, standard_scale='var')
_images/visualizing-marker-genes_21_0.png

In the next plot we added:

  • smallest_dot=40 To increase the size of the smallest dot
  • color_map=’Blues’ To change the colormap palette
  • figsize=(8,5) To change the default figure size
[13]:
ax = sc.pl.dotplot(pbmc, marker_genes_dict, groupby='bulk_labels', dendrogram=True,
                   standard_scale='var', smallest_dot=40, color_map='Blues', figsize=(8,5))
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: CD4+/CD25 T Reg, CD4+/CD45RA+/CD25- Naive T, CD4+/CD45RO+ Memory, etc.
var_group_labels: B-cell, T-cell, T-cell CD8+, etc.
_images/visualizing-marker-genes_23_1.png

Here we add grouping labels for genes using var_group_positions and var_group_labels

[14]:
ax = sc.pl.dotplot(pbmc, marker_genes, groupby='louvain',
              var_group_positions=[(0,1), (11, 12)],
              var_group_labels=['B cells', 'dendritic'],
              figsize=(12,4), var_group_rotation=0, dendrogram='dendrogram_louvain')
WARNING: dendrogram data not found (using key=dendrogram_louvain). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
    using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_louvain']`
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: B cells, dendritic
_images/visualizing-marker-genes_25_1.png

Matrix plots

The matrixplot shows the mean expression of a gene in a group by category as a heatmap. In contrast to dotplot, the matrix plot can be used with corrected and/or scaled counts. By default raw counts are used.

[15]:
gs = sc.pl.matrixplot(pbmc, marker_genes_dict, groupby='bulk_labels')
_images/visualizing-marker-genes_28_0.png

Next we add a dendrogram and scale the gene expression values between 0 and 1 using standar_scale=var

[16]:
gs = sc.pl.matrixplot(pbmc, marker_genes_dict, groupby='bulk_labels', dendrogram=True, standard_scale='var')
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: CD4+/CD25 T Reg, CD4+/CD45RA+/CD25- Naive T, CD4+/CD45RO+ Memory, etc.
var_group_labels: B-cell, T-cell, T-cell CD8+, etc.
_images/visualizing-marker-genes_30_1.png
[17]:
# to use the 'non-raw' data we select marker genes present in this data.
marker_genes_2 = [x for x in marker_genes if x in pbmc.var_names]

In the next figure we use: * use_raw=False to use the scaled values (after sc.pp.scale) stored in pbmc.X * vmin=-3, vmax=3, cmap=’bwr’ to select the max and min for the diverging color map ‘bwr’ * swap_axes=True to show genes in rows * figsize=(5,6) to modify the defult figure size

[18]:
gs = sc.pl.matrixplot(pbmc, marker_genes_2, groupby='bulk_labels', dendrogram=True,
                      use_raw=False, vmin=-3, vmax=3, cmap='bwr',  swap_axes=True, figsize=(5,6))
_images/visualizing-marker-genes_33_0.png

Heatmaps

Heatmaps do not collapse cells as in previous plots. Instead, each cells is shown in a row (or columm if swap_axes=True). The groupby information can be added and is shown using the same color code found for sc.pl.umap or any other scatter plot.

[19]:
ax = sc.pl.heatmap(pbmc,marker_genes_dict, groupby='louvain')
_images/visualizing-marker-genes_36_0.png

Same as before but using use_raw=False to use the scaled data stored in pbmc.X. Some genes are highlighted using var_groups_positions and var_group_labels and the figure size is adjusted.

[20]:
ax = sc.pl.heatmap(pbmc, marker_genes, groupby='louvain', figsize=(5, 8),
              var_group_positions=[(0,1), (11, 12)], use_raw=False, vmin=-3, vmax=3, cmap='bwr',
              var_group_labels=['B cells', 'dendritic'], var_group_rotation=0, dendrogram='dendrogram_louvain')
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: B cells, dendritic
_images/visualizing-marker-genes_38_1.png

Tracksplots

The track plot shows the same information as the heatmap, but, instead of a color scale, the gene expression is represented by height.

[21]:
# Track plot data is better visualized using the non-log counts
import numpy as np
ad = pbmc.copy()
ad.raw.X.data = np.exp(ad.raw.X.data)
[22]:
ax = sc.pl.tracksplot(ad,marker_genes, groupby='louvain')
_images/visualizing-marker-genes_42_0.png

Visualization of marker genes

Gene markers are computed using the bulk_labels categories and the logreg method

[23]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
sc.tl.rank_genes_groups(pbmc, groupby='bulk_labels', method='logreg')
ranking genes
    finished (0:00:00.21)

visualize marker genes using panels

[24]:
rcParams['figure.figsize'] = 4,4
rcParams['axes.grid'] = True
sc.pl.rank_genes_groups(pbmc)
_images/visualizing-marker-genes_47_0.png

Visualize marker genes using dotplot

[25]:
sc.pl.rank_genes_groups_dotplot(pbmc, n_genes=4)
_images/visualizing-marker-genes_49_0.png

Dotplot focusing only on two groups (the groups option is also available for violin, heatmap and matrix plots).

[26]:
axs = sc.pl.rank_genes_groups_dotplot(pbmc, n_genes=15, groups=['Dendritic', 'CD19+ B'])
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: CD4+/CD25 T Reg, CD4+/CD45RA+/CD25- Naive T, CD4+/CD45RO+ Memory, etc.
var_group_labels: Dendritic, CD19+ B
_images/visualizing-marker-genes_51_1.png

Dotplot showing the marker genes but with respect to ‘louvain’ clusters (in contrast to using the bulk_labels categories as before).

[27]:
axs = sc.pl.rank_genes_groups_dotplot(pbmc, groupby='louvain', n_genes=4, dendrogram='dendrogram_louvain')
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: CD4+/CD25 T Reg, CD4+/CD45RA+/CD25- Naive T, CD4+/CD45RO+ Memory, etc.
_images/visualizing-marker-genes_53_1.png

Visualize marker genes using matrixplot

For the following plot the raw gene expression is scaled and the color map is changed from the default to ‘Blues’

[28]:
axs = sc.pl.rank_genes_groups_matrixplot(pbmc, n_genes=3, standard_scale='var', cmap='Blues')
_images/visualizing-marker-genes_55_0.png

Same as before but using the scaled data and setting a divergent color map

[29]:
axs = sc.pl.rank_genes_groups_matrixplot(pbmc, n_genes=3, use_raw=False, vmin=-3, vmax=3, cmap='bwr')
_images/visualizing-marker-genes_57_0.png

Visualize marker genes using stacked violing plots

[30]:
# instead of pbmc we use the 'ad' object (created earlier) in which the raw matrix is exp(pbmc.raw.matrix). This
# highlights better the differences between the markers.
sc.pl.rank_genes_groups_stacked_violin(ad, n_genes=3)
_images/visualizing-marker-genes_59_0.png
[31]:
# setting row_palette='slateblue' makes all violin plots of the same color
sc.pl.rank_genes_groups_stacked_violin(ad, n_genes=3, row_palette='slateblue')
_images/visualizing-marker-genes_60_0.png

Same as previous but with axes swapped.

[32]:
# width is used to set the violin plot width. Here, after setting figsize wider than default,
# the `width` arguments helps to keep the violin plots thin.
sc.pl.rank_genes_groups_stacked_violin(ad, n_genes=3, swap_axes=True, figsize=(6, 10), width=0.4)
_images/visualizing-marker-genes_62_0.png

Visualize marker genes using heatmap

[33]:
sc.pl.rank_genes_groups_heatmap(pbmc, n_genes=3, standard_scale='var')
_images/visualizing-marker-genes_64_0.png
[34]:
sc.pl.rank_genes_groups_heatmap(pbmc, n_genes=3, use_raw=False, swap_axes=True, vmin=-3, vmax=3, cmap='bwr')
_images/visualizing-marker-genes_65_0.png

Showing 10 genes per category, turning the gene labels off and swapping the axes. Notice that when the image is swaped, a color code for the categories appear instead of the ‘brackets’.

[35]:
sc.pl.rank_genes_groups_heatmap(pbmc, n_genes=10, use_raw=False, swap_axes=True, show_gene_labels=False,
                                vmin=-3, vmax=3, cmap='bwr')
_images/visualizing-marker-genes_67_0.png

Visualize marker genes using tracksplot

[36]:
sc.pl.rank_genes_groups_tracksplot(ad, n_genes=3)
_images/visualizing-marker-genes_69_0.png
[37]:
## Filtering of marker genes

The tools sc.tl.filter_rank_genes_groups can be used to select markers that fulfill certain criteria, for example whose fold change is at least 2 with respect to other categories and that are expresssed on 50% of the category cells.

[38]:
# define a new category called 'broad_type' which collapses all T-cells into one group
t_cell = ['CD4+/CD25 T Reg', 'CD4+/CD45RA+/CD25- Naive T', 'CD4+/CD45RO+ Memory','CD8+ Cytotoxic T', 'CD8+/CD45RA+ Naive Cytotoxic']
pbmc.obs['broad_type'] = pd.Categorical(pbmc.obs.bulk_labels.apply(lambda x: x if x not in t_cell  else 'T-cell'))
[39]:
# find gene markers in the 'broad_type' group.
sc.tl.rank_genes_groups(pbmc, 'broad_type', method='wilcoxon', n_genes=200)
ranking genes
    finished (0:00:00.06)
[40]:
sc.tl.filter_rank_genes_groups(pbmc)
Filtering genes using: min_in_group_fraction: 0.25 min_fold_change: 2, max_out_group_fraction: 0.5
[41]:
rcParams['figure.figsize'] = 4,4
rcParams['axes.grid'] = True
sc.pl.rank_genes_groups(pbmc, key='rank_genes_groups_filtered', ncols=3)
_images/visualizing-marker-genes_75_0.png

All filtered genes are set to nan.

[42]:
sc.pl.rank_genes_groups_heatmap(pbmc, n_genes=200, key='rank_genes_groups_filtered',
                                swap_axes=True, use_raw=False, vmax=3, vmin=-3, cmap='bwr', dendrogram=True)
WARNING: dendrogram data not found (using key=dendrogram_broad_type). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
    using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_broad_type']`
WARNING: Gene labels are not shown when more than 50 genes are visualized. To show gene labels set `show_gene_labels=True`
_images/visualizing-marker-genes_77_1.png

Notice that n_genes=200 is set, but only the maximum number of filtered genes found is actually plotted. For example, for CD34+ only few genes are shown.

Apply more stringent marker gene filters.

[43]:
sc.tl.filter_rank_genes_groups(pbmc,
                               min_in_group_fraction=0.5,
                               max_out_group_fraction=0.3,
                               min_fold_change=1.5)
Filtering genes using: min_in_group_fraction: 0.5 min_fold_change: 1.5, max_out_group_fraction: 0.3

Without filtering.

[44]:
sc.pl.rank_genes_groups_dotplot(pbmc, n_genes=6)
_images/visualizing-marker-genes_82_0.png

With filtering.

[45]:
sc.pl.rank_genes_groups_dotplot(pbmc, n_genes=6, key='rank_genes_groups_filtered')
_images/visualizing-marker-genes_84_0.png
[46]:
sc.pl.rank_genes_groups_matrixplot(pbmc, n_genes=6, key='rank_genes_groups_filtered', standard_scale='var', cmap='Blues')
_images/visualizing-marker-genes_85_0.png
[47]:
# Track plot data and violin is better visualized using the non-log counts
ad = pbmc.copy()
ad.raw.X.data = np.exp(ad.raw.X.data)
[48]:
sc.pl.rank_genes_groups_stacked_violin(ad, n_genes=6, key='rank_genes_groups_filtered', standard_scale='var')
_images/visualizing-marker-genes_87_0.png
[49]:
sc.pl.rank_genes_groups_tracksplot(ad, n_genes=6, key='rank_genes_groups_filtered')
_images/visualizing-marker-genes_88_0.png

Compare CD4+/CD25 T Reg markers vs. rest using violin

[50]:
pbmc.obs.bulk_labels.cat.categories[0]
[50]:
'CD4+/CD25 T Reg'
[51]:
sc.pl.rank_genes_groups_violin(pbmc,  n_genes=5, jitter=False)
_images/visualizing-marker-genes_91_0.png
_images/visualizing-marker-genes_91_1.png
_images/visualizing-marker-genes_91_2.png
_images/visualizing-marker-genes_91_3.png
_images/visualizing-marker-genes_91_4.png
_images/visualizing-marker-genes_91_5.png

Dendrogram options

Hierarchical clusterings for categorical observations can also be visualized independently using sc.pl.dendrogram

[52]:
ax = sc.pl.dendrogram(pbmc, 'bulk_labels')
_images/visualizing-marker-genes_93_0.png

The previous image uses a dendrogram that was already computed.

We can re-run the dendrogram function to use other parameters

[53]:
# compute hiearchical clustering based on the
# given `var_names` from the raw matrix
sc.tl.dendrogram(pbmc, 'bulk_labels', var_names=marker_genes, use_raw=True)
Storing dendrogram info using `.uns['dendrogram_bulk_labels']`
[54]:
rcParams['figure.figsize'] = 4,8
ax = sc.pl.dendrogram(pbmc, 'bulk_labels', orientation='left')
_images/visualizing-marker-genes_96_0.png

Plot correlation

Together with the dendrogram it is possible to plot the correlation (by default ‘pearson’) of the categories.

[55]:
# compute hiearchical clustering based on the
# given `var_names` from the raw matrix
sc.tl.dendrogram(pbmc, 'bulk_labels', n_pcs=30)
    using 'X_pca' with n_pcs = 30
Storing dendrogram info using `.uns['dendrogram_bulk_labels']`
[56]:
ax = sc.pl.correlation_matrix(pbmc, 'bulk_labels')
(0.0, 10.0)
_images/visualizing-marker-genes_100_1.png

Integrating data using ingest and BBKNN

The following tutorial describes a simple PCA-based method for integrating data we call ingest and compares it with BBKNN [Polanski19]. BBKNN integrates well with the Scanpy workflow and is accessible through the bbknn function.

The ingest function assumes an annotated reference dataset that captures the biological variability of interest. The rational is to fit a model on the reference data and use it to project new data. For the time being, this model is a PCA combined with a neighbor lookup search tree, for which we use UMAP’s implementation [McInnes18]. Similar PCA-based integrations have been used before, for instance, in [Weinreb18].

  • As ingest is simple and the procedure clear, the workflow is transparent and fast.
  • Like BBKNN, ingest leaves the data matrix itself invariant.
  • Unlike BBKNN, ingest solves the label mapping problem (like scmap) and maintains an embedding that might have desired properties like specific clusters or trajectories.

We refer to this asymmetric dataset integration as ingesting annotations from an annotated reference adata_ref into an adata that still lacks this annotation. It is different from learning a joint representation that integrates datasets in a symmetric way as BBKNN, Scanorma, Conos, CCA (e.g. in Seurat) or a conditional VAE (e.g. in scVI, trVAE) would do, but comparable to the initiall MNN implementation in scran. Take a look at tools in the external API or at the ecoystem page to get a start with other tools.

[1]:
import scanpy as sc
import pandas as pd
import seaborn as sns
[2]:
sc.settings.verbosity = 1             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(3, 3))
scanpy==1.4.5.dev225+gcf4cdab anndata==0.7rc2.dev9+g5928e64 umap==0.3.8 numpy==1.16.3 scipy==1.3.0 pandas==0.25.3 scikit-learn==0.22 statsmodels==0.10.0 python-igraph==0.7.1 louvain==0.6.1

PBMCs

We consider an annotated reference dataset adata_ref and a dataset for which you want to query labels and embeddings adata.

[3]:
adata_ref = sc.datasets.pbmc3k_processed()  # this is an earlier version of the dataset from the pbmc3k tutorial
adata = sc.datasets.pbmc68k_reduced()

To use sc.tl.ingest, the datasets need to be defined on the same variables.

[4]:
var_names = adata_ref.var_names.intersection(adata.var_names)
adata_ref = adata_ref[:, var_names]
adata = adata[:, var_names]

The model and graph (here PCA, neighbors, UMAP) trained on the reference data will explain the biological variation observed within it.

[5]:
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)

The manifold still looks essentially the same as in the clustering tutorial.

[6]:
sc.pl.umap(adata_ref, color='louvain')
_images/integrating-data-using-ingest_12_0.png

Mapping PBMCs using ingest

Let’s map labels and embeddings from adata_ref to adata based on a chosen representation. Here, we use adata_ref.obsm['X_pca'] to map cluster labels and the UMAP coordinates.

[7]:
sc.tl.ingest(adata, adata_ref, obs='louvain')
[8]:
adata.uns['louvain_colors'] = adata_ref.uns['louvain_colors']  # fix colors
[9]:
sc.pl.umap(adata, color=['louvain', 'bulk_labels'], wspace=0.5)
_images/integrating-data-using-ingest_17_0.png

By comparing the ‘bulk_labels’ annotation with ‘louvain’, we see that the data has been reasonably mapped, only the annotation of dendritic cells seems ambiguous and might have been ambiiguous in adata already.

[10]:
adata_concat = adata_ref.concatenate(adata, batch_categories=['ref', 'new'])
[11]:
adata_concat.obs.louvain = adata_concat.obs.louvain.astype('category')
adata_concat.obs.louvain.cat.reorder_categories(adata_ref.obs.louvain.cat.categories, inplace=True)  # fix category ordering
adata_concat.uns['louvain_colors'] = adata_ref.uns['louvain_colors']  # fix category colors
[12]:
sc.pl.umap(adata_concat, color=['batch', 'louvain'])
_images/integrating-data-using-ingest_21_0.png

While there seems to be some batch-effect in the monocytes and dendritic cell clusters, the new data is otherwise mapped relatively homogeneously.

The megakaryoctes are only present in adata_ref and no cells from adata map onto them. If interchanging reference data and query data, Megakaryocytes do not appear as a separate cluster anymore. This is an extreme case as the reference data is very small; but one should always question if the reference data contain enough biological variation to meaningfully accomodate query data.

Using BBKNN

[13]:
sc.tl.pca(adata_concat)
[14]:
%%time
sc.external.pp.bbknn(adata_concat, batch_key='batch')  # running bbknn 1.3.6
CPU times: user 3.96 s, sys: 520 ms, total: 4.48 s
Wall time: 3.67 s
[15]:
sc.tl.umap(adata_concat)
[16]:
sc.pl.umap(adata_concat, color=['batch', 'louvain'])
_images/integrating-data-using-ingest_27_0.png

Also BBKNN doesn’t maintain the Megakaryocytes cluster. However, it seems to mix cells more homogeneously.

Pancreas

The following data has been used in the scGen paper [Lotfollahi19], has been used here, was curated here and can be downloaded from here (the BBKNN paper).

It contains data for human pancreas from 4 different studies (Segerstolpe16, Baron16, Wang16, Muraro16), which have been used in the seminal papers on single-cell dataset integration (Butler18, Haghverdi18) and many times ever since.

[17]:
# note that this collection of batches is already intersected on the genes
adata_all = sc.read('data/pancreas.h5ad', backup_url='https://www.dropbox.com/s/qj1jlm9w10wmt0u/pancreas.h5ad?dl=1')
[18]:
adata_all.shape
[18]:
(14693, 2448)

Inspect the cell types observed in these studies.

[19]:
counts = adata_all.obs.celltype.value_counts()
counts
[19]:
alpha                     4214
beta                      3354
ductal                    1804
acinar                    1368
not applicable            1154
delta                      917
gamma                      571
endothelial                289
activated_stellate         284
dropped                    178
quiescent_stellate         173
mesenchymal                 80
macrophage                  55
PSC                         54
unclassified endocrine      41
co-expression               39
mast                        32
epsilon                     28
mesenchyme                  27
schwann                     13
t_cell                       7
MHC class II                 5
unclear                      4
unclassified                 2
Name: celltype, dtype: int64

To simplify visualization, let’s remove the 5 minority classes.

[20]:
minority_classes = counts.index[-5:].tolist()        # get the minority classes
adata_all = adata_all[                               # actually subset
    ~adata_all.obs.celltype.isin(minority_classes)]
adata_all.obs.celltype.cat.reorder_categories(       # reorder according to abundance
    counts.index[:-5].tolist(), inplace=True)

Seeing the batch effect

[21]:
sc.pp.pca(adata_all)
sc.pp.neighbors(adata_all)
sc.tl.umap(adata_all)

We observe a batch effect.

[22]:
sc.pl.umap(adata_all, color=['batch', 'celltype'], palette=sc.pl.palettes.vega_20_scanpy)
_images/integrating-data-using-ingest_40_0.png

BBKNN

It can be well-resolved using BBKNN [Polanski19].

[23]:
%%time
sc.external.pp.bbknn(adata_all, batch_key='batch')
CPU times: user 2.11 s, sys: 1.53 s, total: 3.63 s
Wall time: 2.28 s
[24]:
sc.tl.umap(adata_all)
[25]:
sc.pl.umap(adata_all, color=['batch', 'celltype'])
_images/integrating-data-using-ingest_45_0.png

If one prefers to work more iteratively starting from one reference dataset, one can use ingest.

Mapping onto a reference batch using ingest

Choose one reference batch for training the model and setting up the neighborhood graph (here, a PCA) and separate out all other batches.

As before, the model trained on the reference batch will explain the biological variation observed within it.

[26]:
adata_ref = adata_all[adata_all.obs.batch == '0']

Compute the PCA, neighbors and UMAP on the reference data.

[27]:
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)

The reference batch contains 12 of the 19 cell types across all batches.

[28]:
sc.pl.umap(adata_ref, color='celltype')
_images/integrating-data-using-ingest_53_0.png

Iteratively map labels (such as ‘celltype’) and embeddings (such as ‘X_pca’ and ‘X_umap’) from the reference data onto the query batches.

[29]:
adatas = [adata_all[adata_all.obs.batch == i].copy() for i in ['1', '2', '3']]
[30]:
sc.settings.verbosity = 2  # a bit more logging
for iadata, adata in enumerate(adatas):
    print(f'... integrating batch {iadata+1}')
    adata.obs['celltype_orig'] = adata.obs.celltype  # save the original cell type
    sc.tl.ingest(adata, adata_ref, obs='celltype')
... integrating batch 1
running ingest
    finished (0:00:06)
... integrating batch 2
running ingest
    finished (0:00:04)
... integrating batch 3
running ingest
    finished (0:00:01)

Each of the query batches now carries annotation that has been contextualized with adata_ref. By concatenating, we can view it together.

[31]:
adata_concat = adata_ref.concatenate(adatas)
[32]:
adata_concat.obs.celltype = adata_concat.obs.celltype.astype('category')
adata_concat.obs.celltype.cat.reorder_categories(adata_ref.obs.celltype.cat.categories, inplace=True)  # fix category ordering
adata_concat.uns['celltype_colors'] = adata_ref.uns['celltype_colors']  # fix category coloring
[33]:
sc.pl.umap(adata_concat, color=['batch', 'celltype'])
_images/integrating-data-using-ingest_60_0.png

Compared to the BBKNN result, this is maintained clusters in a much more pronounced fashion. If one already observed a desired continuous structure (as in the hematopoietic datasets, for instance), ingest allows to easily maintain this structure.

Evaluating consistency

Let us subset the data to the query batches.

[34]:
adata_query = adata_concat[adata_concat.obs.batch.isin(['1', '2', '3'])]

The following plot is a bit hard to read, hence, move on to confusion matrices below.

[35]:
sc.pl.umap(
    adata_query, color=['batch', 'celltype', 'celltype_orig'], wspace=0.4)
_images/integrating-data-using-ingest_66_0.png
Cell types conserved across batches

Let us first focus on cell types that are conserved with the reference, to simplify reading of the confusion matrix.

[36]:
obs_query = adata_query.obs
conserved_categories = obs_query.celltype.cat.categories.intersection(obs_query.celltype_orig.cat.categories)  # intersected categories
obs_query_conserved = obs_query.loc[obs_query.celltype.isin(conserved_categories) & obs_query.celltype_orig.isin(conserved_categories)]  # intersect categories
obs_query_conserved.celltype.cat.remove_unused_categories(inplace=True)  # remove unused categoriyes
obs_query_conserved.celltype_orig.cat.remove_unused_categories(inplace=True)  # remove unused categoriyes
obs_query_conserved.celltype_orig.cat.reorder_categories(obs_query_conserved.celltype.cat.categories, inplace=True)  # fix category ordering
[37]:
pd.crosstab(obs_query_conserved.celltype, obs_query_conserved.celltype_orig)
[37]:
celltype_orig alpha beta ductal acinar delta gamma endothelial mast
celltype
alpha 1819 3 7 0 1 20 0 5
beta 49 803 3 1 10 26 0 0
ductal 8 5 693 263 0 0 0 0
acinar 1 3 2 145 0 3 0 0
delta 5 4 0 0 305 73 0 0
gamma 1 5 0 0 0 194 0 0
endothelial 2 0 0 0 0 0 36 0
mast 0 0 1 0 0 0 0 2

Overall, the conserved cell types are also mapped as expected. The main exception are some acinar cells in the original annotation that appear as acinar cells. However, already the reference data is observed to feature a cluster of both acinar and ductal cells, which explains the discrepancy, and indicates a potential inconsistency in the initial annotation.

All cell types

Let us now move on to look at all cell types.

[38]:
pd.crosstab(adata_query.obs.celltype, adata_query.obs.celltype_orig)
[38]:
celltype_orig PSC acinar alpha beta co-expression delta dropped ductal endothelial epsilon gamma mast mesenchymal mesenchyme not applicable unclassified endocrine
celltype
alpha 0 0 1819 3 2 1 33 7 0 3 20 5 0 0 305 11
beta 0 1 49 803 36 10 42 3 0 1 26 0 0 1 522 24
ductal 0 263 8 5 0 0 41 693 0 0 0 0 1 0 106 1
acinar 0 145 1 3 0 0 25 2 0 0 3 0 0 0 86 0
delta 0 0 5 4 1 305 12 0 0 5 73 0 0 0 94 5
gamma 0 0 1 5 0 0 2 0 0 1 194 0 0 0 14 0
endothelial 1 0 2 0 0 0 7 0 36 0 0 0 0 6 7 0
activated_stellate 49 1 1 4 0 0 11 8 0 0 0 0 79 20 17 0
quiescent_stellate 4 0 1 1 0 0 5 1 1 0 0 0 0 0 1 0
macrophage 0 0 1 1 0 0 0 12 0 0 0 0 0 0 1 0
mast 0 0 0 0 0 0 0 1 0 0 0 2 0 0 1 0

We observe that PSC (pancreatic stellate cells) cells are in fact just inconsistently annotated and correctly mapped on ‘activated_stellate’ cells.

Also, it’s nice to see that ‘mesenchyme’ and ‘mesenchymal’ cells both map onto the same category. However, that category is again ‘activated_stellate’ and likely incorrect.

Visualizing distributions across batches

Often, batches correspond to experiments that one wants to compare. Scanpy offers to convenient visualization possibilities for this.

  1. a density plot
  2. a partial visualization of a subset of categories/groups in an emnbedding
Density plot
[39]:
sc.tl.embedding_density(adata_concat, groupby='batch')
computing density on 'umap'
[40]:
sc.pl.embedding_density(adata_concat, groupby='batch')
_images/integrating-data-using-ingest_80_0.png
Partial visualizaton of a subset of groups in embedding
[41]:
for batch in ['1', '2', '3']:
    sc.pl.umap(adata_concat, color='batch', groups=[batch])
_images/integrating-data-using-ingest_82_0.png
_images/integrating-data-using-ingest_82_1.png
_images/integrating-data-using-ingest_82_2.png

Analysis and visualization of spatial transcriptomics data

This tutorial demonstrates how to work with spatial transcriptomics data with Scanpy.

The analysis pipeline largely recapitulates the standard Scanpy workflow and builds upon our best practices tutorial.
Here, we show how to use Scanpy to analyse spatial data using our custom spatial visualization function and an external tool. In this tutorial we focus on 10x genomics Visium spatial transcriptomics data.

This tutorial was generated using the spatial branch of scanpy using the spatialDE package. To run the tutorial, please run the following commands to use the correct branch and install the dependency

pip install git+https://github.com/theislab/scanpy.git@spatial
pip install spatialde

Loading libraries

[1]:
import scanpy as sc
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from matplotlib import rcParams
import seaborn as sb

import SpatialDE

plt.rcParams['figure.figsize']=(8,8)

%load_ext autoreload
%autoreload 2

Reading the data

We will use a Visium spatial transcriptomics dataset of the human lymphnode, which is publicly available from the 10x genomics website: link

The function datasets.visium_sge() downloads the dataset from 10x genomics and returns an AnnData object that contains counts, images and spatial coordinates. We will calculate standards QC metrics with pp.calculate_qc_metrics and percentage of mitochondrial read counts per sample.

When using your own Visium data, use Scanpy’s `read_visium() <https://icb-scanpy.readthedocs-hosted.com/en/latest/api/scanpy.read_visium.html>`__ function to import it.

[2]:
adata = sc.datasets.visium_sge('V1_Human_Lymph_Node')
adata.var_names_make_unique()

sc.pp.calculate_qc_metrics(adata, inplace=True)
adata.var['mt'] = [gene.startswith('MT-') for gene in adata.var_names]
adata.obs['mt_frac'] = adata[:, adata.var['mt']].X.sum(1).A.squeeze()/adata.obs['total_counts']

QC and preprocessing

Furthermore, we will perform some basic filtering of spots based on total counts and expressed genes

[3]:
fig, axs = plt.subplots(1,4, figsize=(15,4))
fig.suptitle('Covariates for filtering')
sb.distplot(adata.obs['total_counts'], kde=False, ax = axs[0])
sb.distplot(adata.obs['total_counts'][adata.obs['total_counts']<10000], kde=False, bins=40, ax = axs[1])
sb.distplot(adata.obs['n_genes_by_counts'], kde=False, bins=60, ax = axs[2])
sb.distplot(adata.obs['n_genes_by_counts'][adata.obs['n_genes_by_counts']<4000], kde=False, bins=60, ax = axs[3])
[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2c0c9150>
_images/analysis-visualization-spatial_7_1.png
[4]:
sc.pp.filter_cells(adata, min_counts = 5000)
print(f'Number of cells after min count filter: {adata.n_obs}')
sc.pp.filter_cells(adata, max_counts = 35000)
print(f'Number of cells after max count filter: {adata.n_obs}')
adata = adata[adata.obs['mt_frac'] < 0.2]
print(f'Number of cells after MT filter: {adata.n_obs}')
sc.pp.filter_cells(adata, min_genes = 3000)
print(f'Number of cells after gene filter: {adata.n_obs}')
sc.pp.filter_genes(adata, min_cells=10)
print(f'Number of genes after cell filter: {adata.n_vars}')
Number of cells after min count filter: 3988
Number of cells after max count filter: 3962
Number of cells after MT filter: 3962
Number of cells after gene filter: 3895
Number of genes after cell filter: 18457

We proceed to normalize Visium counts data with the built-in normalize_total method from Scanpy, and detect highly-variable genes (for later). Note that there are more sensible alternatives for normalization (see discussion in sc-tutorial paper and more recent alternatives such as SCTransform or GLM-PCA).

[5]:
sc.pp.normalize_total(adata, inplace = True)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor='seurat', n_top_genes=2000, inplace=True)

Dimensionality reduction and clustering

We can then proceed to run dimensionality reduction and clustering. We will plot some covariates to check if there is any particular structure in UMAP space associated with total counts and detected genes.

[6]:
sc.pp.pca(adata, n_comps=50, use_highly_variable=True, svd_solver='arpack')
sc.pp.neighbors(adata)

sc.tl.umap(adata)
sc.tl.louvain(adata, key_added='clusters')
[7]:
sc.pl.umap(adata, color=['total_counts', 'n_genes_by_counts'])
sc.pl.umap(adata, color='clusters', palette=sc.pl.palettes.default_20)
_images/analysis-visualization-spatial_13_0.png
_images/analysis-visualization-spatial_13_1.png

Visualization in spatial dimensions

We visualized two covariates (total counts per spot and number of genes by counts) in UMAP space. We can have a look at how the covariates are projected onto the spatial coordinates. We will overlay the circular spots on top of the Hematoxylin and eosin stain (H&E) image provided, using the function pl.spatial.

[8]:
sc.pl.spatial(adata, img_key = "hires",color=['total_counts', 'n_genes_by_counts'])
_images/analysis-visualization-spatial_15_0.png

the function pl.spatial accepts 4 additional parameters: * img_key str: key where the img is stored in the adata.uns element * crop_coord tuple: coordinates to use for cropping (left, right, top, bottom) * alpha_img float: alpha value for the transcparency of the image * bw bool: flag to convert the image into gray scale

furthermore, in pl.spatial the size parameter changes its behaviour: it becomes a scaling factor for the spot sizes.

Before, we performed clustering in gene expression space, and visualized the results with UMAP. By visualizing clustered samples in spatial dimensions, we can gain insights into tissue organization and inter-cellular communication.

[9]:
sc.pl.spatial(adata, img_key = "hires", color="clusters", size=1.5)
_images/analysis-visualization-spatial_17_0.png

From a visual perspective, spots belonging to the same cluster in gene expression space seems to co-occur in spatial dimensions. For instance, spots belonging to cluster 4 seem to always be surrounded by spots belonging to cluster 5.

We can zoom in specific regions of interests to gain qualitative insights. Furthermore, by changing the alpha values of the spots, we can visualize better the underlying tissue morphology from the H&E image.

[10]:
sc.pl.spatial(adata, img_key = "hires", color="clusters", groups = ["4","5"], crop_coord = [1200,1700,1900,1000], alpha = .5, size = 1.3)
_images/analysis-visualization-spatial_19_0.png

Identification of marker genes and spatially variable genes

After clustering, we can proceed to identify cluster marker genes.

[11]:
sc.tl.rank_genes_groups(adata, "clusters", inplace = True)
sc.pl.rank_genes_groups_heatmap(adata, groups = "4", groupby = "clusters", show = False)
WARNING: dendrogram data not found (using key=dendrogram_clusters). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: 4
_images/analysis-visualization-spatial_21_1.png

Given the spatial organization of cluster 4, which seemed to occur in small groups of spots across the image, we have plotted a heatmap with expression levels of its top 10 marker genes across clusters.

We can have a look at one specific gene and notice that it already show some interesting structural expression patterns in the image.

[12]:
sc.pl.spatial(adata, img_key = "hires", color="SERPINE2")
_images/analysis-visualization-spatial_23_0.png

Spatial transcriptomics allows researchers to investigate how gene expression trends varies in space, thus identifying spatial patterns of gene expression. For this purpose, we use SpatialDE (paper - code), a Gaussian process-based statistical framework that aims to identify spatially variable genes.

First, we convert normalized counts and coordinates to pandas dataframe, needed for inputs to spatialDE. Then we can run the method.

[13]:
counts = pd.DataFrame(adata.X.todense(), columns=adata.var_names, index=adata.obs_names)
coord = pd.DataFrame(adata.obsm["X_spatial"], columns=["x_coord", "y_coord"], index = adata.obs_names)

results = SpatialDE.run(coord, counts)

We concatenate the results with the variable table adata.var

[14]:
results.index = results["g"]
adata.var = pd.concat([adata.var, results.loc[adata.var.index.values,:]], axis = 1)

Then we can inspect significant genes that varies in space and visualize them with sc.pl.spatial function.

[15]:
results.sort_values("qval").head(10)
[15]:
FSV M g l max_delta max_ll max_mu_hat max_s2_t_hat model n s2_FSV s2_logdelta time BIC max_ll_null LLR pval qval
g
ITGAX 0.170702 4 ITGAX 248.801360 4.832369 -3302.354218 0.978878 0.095954 SE 3895 0.000005 0.000257 0.002626 6637.778231 -3430.115338 127.761120 0.0 0.0
CTGF 0.208433 4 CTGF 248.801360 3.777540 -3601.962867 1.059363 0.131464 SE 3895 0.000004 0.000154 0.002593 7236.995531 -3820.898121 218.935254 0.0 0.0
RPS12 0.578740 4 RPS12 248.801360 0.724028 1282.631794 4.262369 0.896279 SE 3895 0.000002 0.000044 0.002194 -2532.193792 132.889696 1149.742098 0.0 0.0
PIK3IP1 0.165093 4 PIK3IP1 474.169746 4.960988 -3121.508186 1.247603 0.075819 SE 3895 0.000013 0.000647 0.002727 6276.086168 -3327.150590 205.642404 0.0 0.0
CSF2RB 0.118427 4 CSF2RB 474.169746 7.302389 -3455.655137 1.032037 0.058573 SE 3895 0.000017 0.001423 0.002386 6944.380070 -3549.412548 93.757410 0.0 0.0
RAC2 0.188012 4 RAC2 474.169746 4.236631 -1862.354983 2.130281 0.095923 SE 3895 0.000017 0.000701 0.002674 3757.779762 -2074.893341 212.538358 0.0 0.0
DDX17 0.080435 4 DDX17 474.169746 11.214828 -1099.271702 2.480615 0.084385 SE 3895 0.000009 0.001430 0.007070 2231.613199 -1177.162770 77.891068 0.0 0.0
EZR 0.237460 4 EZR 248.801360 3.194199 -2561.906858 1.833875 0.199101 SE 3895 0.000004 0.000124 0.002783 5156.883512 -2825.270277 263.363418 0.0 0.0
CTLA4 0.114898 4 CTLA4 474.169746 7.556782 -2927.151343 0.548249 0.036851 SE 3895 0.000013 0.001080 0.002408 5887.372483 -3057.224784 130.073441 0.0 0.0
ACTB 0.459669 4 ACTB 248.801360 1.169239 1190.410976 4.432608 0.932020 SE 3895 0.000002 0.000047 0.002367 -2347.752157 332.192634 858.218343 0.0 0.0
[16]:
sc.pl.spatial(adata,img_key = "hires",color = ["CTLA4", "EZR"], alpha = 0.7)
_images/analysis-visualization-spatial_30_0.png

Recently, other tools have been proposed for the identification of spatially variable genes, such as: * SPARK paper - code * trendsceek paper - code * HMRF paper - code

MERFISH example

In case you have spatial data generated with FISH-based techniques, just read the cordinate table and assign it to the adata.obsm element.

Let’s see an example from Xia et al. 2019.

First, we need to download the coordinate and counts data from the original publication. We will download the files in the current directory…

[18]:
import urllib.request
url_coord = 'https://www.pnas.org/highwire/filestream/887973/field_highwire_adjunct_files/15/pnas.1912459116.sd15.xlsx'
filename_coord = 'pnas.1912459116.sd15.xlsx'
urllib.request.urlretrieve(url_coord, filename_coord)

url_counts = 'https://www.pnas.org/highwire/filestream/887973/field_highwire_adjunct_files/12/pnas.1912459116.sd12.csv'
filename_counts = 'pnas.1912459116.sd12.csv'
urllib.request.urlretrieve(url_counts, filename_counts)
[18]:
('pnas.1912459116.sd12.csv', <http.client.HTTPMessage at 0x1a3a7f8850>)

..and read the data in a AnnData object.

[19]:
coordinates = pd.read_excel("./pnas.1912459116.sd15.xlsx", index_col = 0)
counts = sc.read_csv("./pnas.1912459116.sd12.csv").transpose()

adata_merfish = counts[coordinates.index,:]
adata_merfish.obsm["X_spatial"] = coordinates.to_numpy()

We will perform standard preprocessing and dimensionality reduction.

[20]:
sc.pp.normalize_per_cell(adata_merfish, counts_per_cell_after=1e6)
sc.pp.log1p(adata_merfish)

sc.pp.pca(adata_merfish, n_comps=15)
sc.pp.neighbors(adata_merfish)
sc.tl.louvain(adata_merfish, key_added='groups', resolution=0.5)

The experiment consisted in measuring gene expression counts from a single cell type (cultured U2-OS cells). Clusters consist of cell states at different stages of the cell cycle. We don’t expect to see specific structure in spatial dimensions given the experimental setup.

We can visualize the clusters obtained from running the Louvain algorithm in tSNE space…

[21]:
sc.tl.tsne(adata_merfish, perplexity = 100,  n_jobs=12)
sc.pl.tsne(adata_merfish, color = "groups")
_images/analysis-visualization-spatial_39_0.png

… and spatial coordinates.

[22]:
sc.pl.spatial(adata_merfish, color = "groups")
_images/analysis-visualization-spatial_41_0.png

Conclusion and future plans

This is the end of the tutorial, we hope you’ll find it useful and report back to us which features/external tools you would like to see in Scanpy. We are extending Scanpy and AnnData to support other spatial data types, such as Imaging Mass Cytometry and extend data structure to support spatial graphs and additional features. Stay tuned!