Core plotting functions

Author: Fidel Ramírez

This tutorial explores the visualization possibilities of scanpy and is divided into three sections:

  • Scatter plots for embeddings (eg. UMAP, t-SNE)
  • Identification of clusters using known marker genes
  • Visualization of differentially expressed genes

In this tutorial, we will use a dataset from 10x containing 68k cells from PBMC. Scanpy, includes in its distribution a reduced sample of this dataset consisting of only 700 cells and 765 highly variable genes. This dataset has been already preprocessed and UMAP computed.

In this tutorial, we will also use the following literature markers:

  • B-cell: CD79A, MS4A1
  • T-cell: CD3D
  • NK: GNLY, NKG7
  • Myeloid: CST3, LYZ
  • Monocytes: FCGR3A
  • Dendritic: FCER1A

Scatter plots for embeddings

With scanpy, scatter plots for tSNE, UMAP and several other embeddings are readily available using the sc.pl.tsne, sc.pl.umap etc. functions. See here the list of options.

Those functions access the data stored in adata.obsm. For example sc.pl.umap uses the information stored in adata.obsm['X_umap']. For more flexibility, any key stored in adata.obsm can be used with the generic function sc.pl.embedding.

[2]:
import scanpy as sc
import pandas as pd
from matplotlib import rcParams
[4]:
sc.set_figure_params(dpi=100, color_map = 'viridis_r')
sc.settings.verbosity = 1
sc.logging.print_versions()
scanpy==1.4.7.dev137+gb910afc.d20200709 anndata==0.7.1 umap==0.4.2 numpy==1.18.4 scipy==1.4.1 pandas==1.0.3 scikit-learn==0.21.2 statsmodels==0.11.1 python-igraph==0.8.2 louvain==0.7.0 leidenalg==0.7.0

Load pbmc dataset

[6]:
pbmc = sc.datasets.pbmc68k_reduced()
[7]:
# inspect pbmc contents
pbmc
[7]:
AnnData object with n_obs × n_vars = 700 × 765
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
    var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

Visualization of gene expression and other variables

For the scatter plots, the value to plot is given as the color argument. This can be any gene or any column in .obs, where .obs is a DataFrame containing the annotations per observation/cell, see AnnData for more information.

[8]:
# rcParams is used for the figure size, in this case 4x4
rcParams['figure.figsize'] = 4, 4
sc.pl.umap(pbmc, color='CD79A')
../_images/plotting_core_12_0.png

Multiple values can be given to color. In the following example we will plot 4 genes: ‘CD79A’, ‘MS4A1’, ‘CD3D’, and ‘FCER1A’ to get an idea on where those marker genes are being expressed.

Also, we will plot two other values: n_counts which is the number of UMI counts per cell (stored in .obs), and bulk_labels which is a categorical value containing the original labelling of the cells from 10X.

The number of plots per row is controlled using the ncols parameter. The maximum value plotted can be adjusted using vmax (similarly vmin can be used for the minimum value). In this case we use p99, which means to use as max value the 99 percentile. The max value can be a number or a list of numbers if the vmax wants to be set for multiple plots individually.

Also, we are using frameon=False to remove the boxes around the plots and s=50 to set the dot size.

[9]:
rcParams['figure.figsize'] = 3,3
sc.pl.umap(pbmc, color=['CD79A', 'MS4A1', 'CD3D', 'FCER1A', 'n_counts', 'bulk_labels'], s=50, frameon=False, ncols=3, vmax='p99')
../_images/plotting_core_14_0.png

In this plot we can see the groups of cells that express the marker genes and the agreement with the original cell labels.

The functions for scatterplots have many options that allow fine tuning of the images. For example, we can look at the clustering as follows:

[8]:
# compute clusters using the leiden method and store the results with the name `clusters`
sc.tl.leiden(pbmc, key_added='clusters', resolution=0.5)
[9]:
rcParams['figure.figsize'] = 5, 5
sc.pl.umap(pbmc, color='clusters', add_outline=True, legend_loc='on data',
           legend_fontsize=12, legend_fontoutline=2,frameon=False,
           title='clustering of cells', palette='Set1')
../_images/plotting_core_17_0.png

Identification of clusters based on known marker genes

Frequently, clusters need to be labelled using well known marker genes. Using scatter plots we can see the expression of a gene and perhaps associate it with a cluster. Here, we will show other visual ways to associate marker genes to clusters using dotplots, violin plots, heatmaps and something that we call ‘tracksplot’. All of these visualizations summarize the same information, expression split by cluster, and the selection of the best results is left to the investigator do decide.

First, we set up a dictionary with the marker genes, as this will allow scanpy to automatically label the groups of genes:

[10]:
marker_genes_dict = {'NK': ['GNLY', 'NKG7'],
                     'T-cell': ['CD3D'],
                     'B-cell': ['CD79A', 'MS4A1'],
                     'Monocytes': ['FCGR3A'],
                     'Dendritic': ['FCER1A']}

dotplot

A quick way to check the expression of these genes per cluster is to using a dotplot. This type of plot summarizes two types of information: the color represents the mean expression within each of the categories (in this case in each cluster) and the dot size indicates the fraction of cells in the categories expressing a gene.

Also, it is also useful to add a dendrogram to the graph to bring together similar clusters. The hierarchical clustering is computed automatically using the correlation of the PCA components between the clusters.

[11]:
sc.pl.dotplot(pbmc, marker_genes_dict, 'clusters', dendrogram=True)
../_images/plotting_core_22_0.png

Using this plot, we can see that clusters 5 and 9 correspond to B-cells, clusters 2 and 4 are T-cells etc. This information can be used to manually annotate the cells as follows:

[12]:
# create a dictionary to map cluster to annotation label
cluster2annotation = {
     '0': 'Monocyte',
     '1': 'Dendritic',
     '2': 'T-cell',
     '3': 'NK',
     '4': 'T-cell',
     '5': 'B-cell',
     '6': 'Monocytes',
     '7': 'Dendritic',
     '8': 'other',
     '9': 'B-cell',
     '10': 'other',
     '11': 'Dendritic'
}

# add a new `.obs` column called `cell type` by mapping clusters to annotation using pandas `map` function
pbmc.obs['cell type'] = pbmc.obs['clusters'].map(cluster2annotation).astype('category')
[13]:
sc.pl.dotplot(pbmc, marker_genes_dict, 'cell type', dendrogram=True)
../_images/plotting_core_25_0.png
[14]:
sc.pl.umap(pbmc, color='cell type', legend_loc='on data',
           frameon=False, legend_fontsize=10)
../_images/plotting_core_26_0.png

violin plot

A different way to explore the markers is with violin plots. Here we can see the expression of CD79A in clusters 5 and 8, and MS4A1 in cluster 5.Compared to a dotplot, the violin plot gives us and idea of the distribution of gene expression values across cells.

[15]:
rcParams['figure.figsize'] = 4.5,3
sc.pl.violin(pbmc, ['CD79A', 'MS4A1'], groupby='clusters' )
../_images/plotting_core_29_0.png

Note Violin plots can also be used to plot any numerical value stored in .obs. For example, here violin plots are used to compare the number of genes and the percentage of mitochondrial genes between the different clusters.

[16]:
sc.pl.violin(pbmc, ['n_genes', 'percent_mito'], groupby='clusters', stripplot=False)  # use stripplot=False to remove the internal dots
../_images/plotting_core_31_0.png

stacked-violin plot

To simultaneously look at the violin plots for all marker genes we use sc.pl.stacked_violin. As previously, a dendrogram was added to group similar clusters

[17]:
ax = sc.pl.stacked_violin(pbmc, marker_genes_dict, groupby='clusters', swap_axes=False, dendrogram=True)
../_images/plotting_core_34_0.png

matrixplot

A simple way to visualize the expression of genes is with a matrix plot. This is a heatmap of the mean expression values per gene grouped by categories. This type plot basically shows the same information as the color in the dotplots.

Here, scale the expression of the genes from 0 to 1, being the maximum mean expression and 0 the minimum.

[18]:
sc.pl.matrixplot(pbmc, marker_genes_dict, 'clusters', dendrogram=True, cmap='Blues', standard_scale='var', colorbar_title='column scaled\nexpression')
../_images/plotting_core_37_0.png

Other useful option is to normalize the gene expression using sc.pp.scale. Here we store this information under the scale layer. Afterwards we adjust the plot min and max and use a diverging color map (in this case RdBu_r where _r means reversed).

[19]:
# scale and store results in layer
pbmc.layers['scaled'] = sc.pp.scale(pbmc, copy=True).X
[20]:
sc.pl.matrixplot(pbmc, marker_genes_dict, 'clusters', dendrogram=True,
                 colorbar_title='mean z-score', layer='scaled', vmin=-2, vmax=2, cmap='RdBu_r')
../_images/plotting_core_40_0.png

Combining plots in subplots

An axis can be passed to a plot to combine multiple outputs as in the following example

[21]:
import matplotlib.pyplot as plt
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16,4), gridspec_kw={'wspace':0.8})
ax1_dict = sc.pl.dotplot(pbmc, marker_genes_dict, groupby='bulk_labels', ax=ax1, show=False)
ax2_dict = sc.pl.stacked_violin(pbmc, marker_genes_dict, groupby='bulk_labels', ax=ax2, show=False)
ax3_dict = sc.pl.matrixplot(pbmc, marker_genes_dict, groupby='bulk_labels', ax=ax3, show=False, cmap='viridis')
../_images/plotting_core_43_0.png

Heatmaps

Heatmaps do not collapse cells as in previous plots. Instead, each cells is shown in a row (or column if swap_axes=True). The groupby information can be added and is shown using the same color code found for sc.pl.umap or any other embedding.

[22]:
ax = sc.pl.heatmap(pbmc, marker_genes_dict, groupby='clusters', cmap='viridis', dendrogram=True)
../_images/plotting_core_46_0.png

The heatmap can also be plotted on scaled data. In the next image, similar to the previus matrixplot the min and max had been adjusted and a divergent color map is used.

[23]:
ax = sc.pl.heatmap(pbmc, marker_genes_dict, groupby='clusters', layer='scaled', vmin=-2, vmax=2, cmap='RdBu_r', dendrogram=True, swap_axes=True)
../_images/plotting_core_48_0.png

Tracksplot

The track plot shows the same information as the heatmap, but, instead of a color scale, the gene expression is represented by height.

[24]:
ax = sc.pl.tracksplot(pbmc,marker_genes_dict, groupby='louvain', dendrogram=True)
../_images/plotting_core_51_0.png

Visualization of marker genes

Instead of characterizing clusters by known gene markers as previously, we can identify genes that are differentially expressed in the clusters or groups.

[25]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

To identify differentially expressed genes we run sc.tl.rankg_genes_groups. This function will take each group of cells and compare the distribution of each gene in a group against the distribution in all other cells not in the group. Here, we will use the original cell labels given by 10x to identify marker genes for those cell types.

[26]:
# added `n_genes` to store information in all genes. This is needed if we want to plot log fold changes or pvalues
sc.tl.rank_genes_groups(pbmc, groupby='clusters', n_genes=pbmc.shape[1], method='wilcoxon')

Visualize marker genes using dotplot

The dotplot visualization is useful to get an overview of the genes that show differential expression. To make the resulting image more compact we will use n_genes=4 to show only the top 4 scoring genes.

[27]:
sc.pl.rank_genes_groups_dotplot(pbmc, n_genes=4)
../_images/plotting_core_58_0.png

In order to get a better representation we can plot log fold changes instead of gene expression. Also, we want to focus on genes that have a log fold change >= 3 between the cell type expression and the rest of cells.

In this case we set values_to_plot='logfoldchanges' and min_logfoldchange=3.

Because log fold change is a divergent scale we also adjust the min and max to be plotted and use a divergent color map. Notice in the following plot that is rather difficult to distinguish between T-cell populations.

[28]:
sc.pl.rank_genes_groups_dotplot(pbmc, n_genes=4, values_to_plot='logfoldchanges', min_logfoldchange=3, vmax=7, vmin=-7, cmap='bwr')
../_images/plotting_core_60_0.png

Focusing on particular groups

Next, we use a dotplot focusing only on two groups (the groups option is also available for violin, heatmap and matrix plots). Here, we set n_genes=30 as in this case it will show all the genes that have a min_logfoldchange=4 up to 30.

[29]:
sc.pl.rank_genes_groups_dotplot(pbmc, n_genes=30, values_to_plot='logfoldchanges', min_logfoldchange=4, vmax=7, vmin=-7, cmap='bwr', groups=['1', '5'])
../_images/plotting_core_63_0.png

Visualize marker genes using matrixplot

For the following plot the we use the previously computed ‘scaled’ values (stored in layer scaled) and use a divergent color map.

[30]:
sc.pl.rank_genes_groups_matrixplot(pbmc, n_genes=3, use_raw=False, vmin=-3, vmax=3, cmap='bwr', layer='scaled')
../_images/plotting_core_65_0.png

Visualize marker genes using stacked violin plots

[31]:
sc.pl.rank_genes_groups_stacked_violin(pbmc, n_genes=3, cmap='viridis_r')
../_images/plotting_core_67_0.png

Visualize marker genes using heatmap

[32]:
sc.pl.rank_genes_groups_heatmap(pbmc, n_genes=3, use_raw=False, swap_axes=True, vmin=-3, vmax=3, cmap='bwr', layer='scaled', figsize=(10,5), show=False)
[32]:
{'heatmap_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f672d40bc88>,
 'groupby_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f672cf9e240>,
 'dendrogram_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f672cf2e630>,
 'gene_groups_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f672cf082e8>}
../_images/plotting_core_69_1.png

Showing 10 genes per category, turning the gene labels off and swapping the axes. Notice that when the image is swapped, a color code for the categories appear instead of the ‘brackets’.

[33]:
sc.pl.rank_genes_groups_heatmap(pbmc, n_genes=10, use_raw=False, swap_axes=True, show_gene_labels=False,
                                vmin=-3, vmax=3, cmap='bwr')
../_images/plotting_core_71_0.png

Visualize marker genes using tracksplot

[34]:
sc.pl.rank_genes_groups_tracksplot(pbmc, n_genes=3)
../_images/plotting_core_73_0.png

Comparison of marker genes using split violin plots

In scanpy, is very easy to compare marker genes using split violin plots for all groups at once.

[35]:
rcParams['figure.figsize'] = 9,1.5
sc.pl.rank_genes_groups_violin(pbmc, n_genes=20, jitter=False)
../_images/plotting_core_75_0.png
../_images/plotting_core_75_1.png
../_images/plotting_core_75_2.png
../_images/plotting_core_75_3.png
../_images/plotting_core_75_4.png
../_images/plotting_core_75_5.png
../_images/plotting_core_75_6.png
../_images/plotting_core_75_7.png
../_images/plotting_core_75_8.png

Dendrogram options

Most of the visualizations can arrange the categories using a dendrogram. However, the dendrogram can also be plotted independently as follows:

[36]:
# compute hierachical clustering based using PCAs (several options available to compute the distance matrix and linkage method are available).
sc.tl.dendrogram(pbmc, 'bulk_labels')
[37]:
ax = sc.pl.dendrogram(pbmc, 'bulk_labels')
../_images/plotting_core_78_0.png

Plot correlation

Together with the dendrogram it is possible to plot the correlation (by default ‘pearson’) of the categories.

[38]:
ax = sc.pl.correlation_matrix(pbmc, 'bulk_labels', figsize=(5,3.5))
../_images/plotting_core_81_0.png