Visualizing marker genes

[1]:
import scanpy as sc
import pandas as pd
from matplotlib import rcParams
[2]:
sc.set_figure_params(dpi=80, color_map='viridis')
sc.settings.verbosity = 2
sc.logging.print_versions()
scanpy==1.4.3+21.g63fbe58 anndata==0.6.19 umap==0.3.8 numpy==1.15.4 scipy==1.2.1 pandas==0.24.2 scikit-learn==0.20.3 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1

Data was obtained from 10x PBMC 68k dataset (https://support.10xgenomics.com/single-cell-gene-expression/datasets). The dataset was filtered and a sample of 700 cells and 765 highly variable genes was kept.

For this data, PCA and UMAP are already computed. Also, louvain clustering and cell cycle detection are present in pbmc.obs.

[3]:
pbmc = sc.datasets.pbmc68k_reduced()
[4]:
pbmc
[4]:
AnnData object with n_obs × n_vars = 700 × 765
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
    var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

To modify the default figure size, use rcParams.

[5]:
rcParams['figure.figsize'] = 4, 4
[6]:
sc.pl.umap(pbmc, color=['bulk_labels'], s=50)
_images/visualizing-marker-genes_8_0.png

Define list of marker genes from literature.

[7]:
marker_genes = ['CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ',  'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'FCER1A', 'CST3']

Stacked violins

Plot marker genes per cluster using stacked violin plots.

[8]:
ax = sc.pl.stacked_violin(pbmc, marker_genes, groupby='bulk_labels',
                         var_group_positions=[(7, 8)], var_group_labels=['NK'])
_images/visualizing-marker-genes_13_0.png

Same as before but swapping the axes and with dendrogram (notice that the categories are reordered).

[9]:
ax = sc.pl.stacked_violin(pbmc, marker_genes, groupby='bulk_labels', swap_axes=True,
                         var_group_positions=[(7, 8)], var_group_labels=['NK'], dendrogram=True)
WARNING: dendrogram data not found (using key=dendrogram_bulk_labels). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
    using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_bulk_labels']`
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: CD4+/CD25 T Reg, CD4+/CD45RA+/CD25- Naive T, CD4+/CD45RO+ Memory, etc.
var_group_labels: NK
_images/visualizing-marker-genes_15_1.png

Dot plots

The dotplot visualization provides a compact way of showing per group, the fraction of cells expressing a gene (dot size) and the mean expression of the gene in those cell (color scale).

The use of the dotplot is only meaningful when the counts matrix contains zeros representing no gene counts. dotplot visualization does not work for scaled or corrected matrices in which cero counts had been replaced by other values.

The marker genes list can be a list or a dictionary. If marker genes List is a dictionary, then plot shows the marker genes grouped and labelled

[10]:
marker_genes_dict = {'B-cell': ['CD79A', 'MS4A1'],
                     'T-cell': 'CD3D',
                     'T-cell CD8+': ['CD8A', 'CD8B'],
                     'NK': ['GNLY', 'NKG7'],
                     'Myeloid': ['CST3', 'LYZ'],
                     'Monocytes': ['FCGR3A'],
                     'Dendritic': ['FCER1A']}
[11]:
# use marker genes as dict to group them
ax = sc.pl.dotplot(pbmc, marker_genes_dict, groupby='bulk_labels')
_images/visualizing-marker-genes_19_0.png

To show some of the options of dot plot, here we add:

  • dendrogram=True show dendrogram and reorder group by categories based on dendrogram order
  • dot_max=0.5 plot largest dot as 50% or more cells expressing the gene
  • dot_min=0.3 plot smallest dot as 30% or less cells expressing the gene
  • standard_scale=’var’ normalize the mean gene expression values between 0 and 1
[12]:
ax = sc.pl.dotplot(pbmc, marker_genes, groupby='bulk_labels', dendrogram=True, dot_max=0.5, dot_min=0.3, standard_scale='var')
_images/visualizing-marker-genes_21_0.png

In the next plot we added:

  • smallest_dot=40 To increase the size of the smallest dot
  • color_map=’Blues’ To change the colormap palette
  • figsize=(8,5) To change the default figure size
[13]:
ax = sc.pl.dotplot(pbmc, marker_genes_dict, groupby='bulk_labels', dendrogram=True,
                   standard_scale='var', smallest_dot=40, color_map='Blues', figsize=(8,5))
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: CD4+/CD25 T Reg, CD4+/CD45RA+/CD25- Naive T, CD4+/CD45RO+ Memory, etc.
var_group_labels: B-cell, T-cell, T-cell CD8+, etc.
_images/visualizing-marker-genes_23_1.png

Here we add grouping labels for genes using var_group_positions and var_group_labels

[14]:
ax = sc.pl.dotplot(pbmc, marker_genes, groupby='louvain',
              var_group_positions=[(0,1), (11, 12)],
              var_group_labels=['B cells', 'dendritic'],
              figsize=(12,4), var_group_rotation=0, dendrogram='dendrogram_louvain')
WARNING: dendrogram data not found (using key=dendrogram_louvain). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
    using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_louvain']`
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: B cells, dendritic
_images/visualizing-marker-genes_25_1.png

Matrix plots

The matrixplot shows the mean expression of a gene in a group by category as a heatmap. In contrast to dotplot, the matrix plot can be used with corrected and/or scaled counts. By default raw counts are used.

[15]:
gs = sc.pl.matrixplot(pbmc, marker_genes_dict, groupby='bulk_labels')
_images/visualizing-marker-genes_28_0.png

Next we add a dendrogram and scale the gene expression values between 0 and 1 using standar_scale=var

[16]:
gs = sc.pl.matrixplot(pbmc, marker_genes_dict, groupby='bulk_labels', dendrogram=True, standard_scale='var')
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: CD4+/CD25 T Reg, CD4+/CD45RA+/CD25- Naive T, CD4+/CD45RO+ Memory, etc.
var_group_labels: B-cell, T-cell, T-cell CD8+, etc.
_images/visualizing-marker-genes_30_1.png
[17]:
# to use the 'non-raw' data we select marker genes present in this data.
marker_genes_2 = [x for x in marker_genes if x in pbmc.var_names]

In the next figure we use: * use_raw=False to use the scaled values (after sc.pp.scale) stored in pbmc.X * vmin=-3, vmax=3, cmap=’bwr’ to select the max and min for the diverging color map ‘bwr’ * swap_axes=True to show genes in rows * figsize=(5,6) to modify the defult figure size

[18]:
gs = sc.pl.matrixplot(pbmc, marker_genes_2, groupby='bulk_labels', dendrogram=True,
                      use_raw=False, vmin=-3, vmax=3, cmap='bwr',  swap_axes=True, figsize=(5,6))
_images/visualizing-marker-genes_33_0.png

Heatmaps

Heatmaps do not collapse cells as in previous plots. Instead, each cells is shown in a row (or columm if swap_axes=True). The groupby information can be added and is shown using the same color code found for sc.pl.umap or any other scatter plot.

[19]:
ax = sc.pl.heatmap(pbmc,marker_genes_dict, groupby='louvain')
_images/visualizing-marker-genes_36_0.png

Same as before but using use_raw=False to use the scaled data stored in pbmc.X. Some genes are highlighted using var_groups_positions and var_group_labels and the figure size is adjusted.

[20]:
ax = sc.pl.heatmap(pbmc, marker_genes, groupby='louvain', figsize=(5, 8),
              var_group_positions=[(0,1), (11, 12)], use_raw=False, vmin=-3, vmax=3, cmap='bwr',
              var_group_labels=['B cells', 'dendritic'], var_group_rotation=0, dendrogram='dendrogram_louvain')
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: B cells, dendritic
_images/visualizing-marker-genes_38_1.png

Tracksplots

The track plot shows the same information as the heatmap, but, instead of a color scale, the gene expression is represented by height.

[21]:
# Track plot data is better visualized using the non-log counts
import numpy as np
ad = pbmc.copy()
ad.raw.X.data = np.exp(ad.raw.X.data)
[22]:
ax = sc.pl.tracksplot(ad,marker_genes, groupby='louvain')
_images/visualizing-marker-genes_42_0.png

Visualization of marker genes

Gene markers are computed using the bulk_labels categories and the logreg method

[23]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
sc.tl.rank_genes_groups(pbmc, groupby='bulk_labels', method='logreg')
ranking genes
    finished (0:00:00.21)

visualize marker genes using panels

[24]:
rcParams['figure.figsize'] = 4,4
rcParams['axes.grid'] = True
sc.pl.rank_genes_groups(pbmc)
_images/visualizing-marker-genes_47_0.png

Visualize marker genes using dotplot

[25]:
sc.pl.rank_genes_groups_dotplot(pbmc, n_genes=4)
_images/visualizing-marker-genes_49_0.png

Dotplot focusing only on two groups (the groups option is also available for violin, heatmap and matrix plots).

[26]:
axs = sc.pl.rank_genes_groups_dotplot(pbmc, n_genes=15, groups=['Dendritic', 'CD19+ B'])
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: CD4+/CD25 T Reg, CD4+/CD45RA+/CD25- Naive T, CD4+/CD45RO+ Memory, etc.
var_group_labels: Dendritic, CD19+ B
_images/visualizing-marker-genes_51_1.png

Dotplot showing the marker genes but with respect to ‘louvain’ clusters (in contrast to using the bulk_labels categories as before).

[27]:
axs = sc.pl.rank_genes_groups_dotplot(pbmc, groupby='louvain', n_genes=4, dendrogram='dendrogram_louvain')
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: CD4+/CD25 T Reg, CD4+/CD45RA+/CD25- Naive T, CD4+/CD45RO+ Memory, etc.
_images/visualizing-marker-genes_53_1.png

Visualize marker genes using matrixplot

For the following plot the raw gene expression is scaled and the color map is changed from the default to ‘Blues’

[28]:
axs = sc.pl.rank_genes_groups_matrixplot(pbmc, n_genes=3, standard_scale='var', cmap='Blues')
_images/visualizing-marker-genes_55_0.png

Same as before but using the scaled data and setting a divergent color map

[29]:
axs = sc.pl.rank_genes_groups_matrixplot(pbmc, n_genes=3, use_raw=False, vmin=-3, vmax=3, cmap='bwr')
_images/visualizing-marker-genes_57_0.png

Visualize marker genes using stacked violing plots

[30]:
# instead of pbmc we use the 'ad' object (created earlier) in which the raw matrix is exp(pbmc.raw.matrix). This
# highlights better the differences between the markers.
sc.pl.rank_genes_groups_stacked_violin(ad, n_genes=3)
_images/visualizing-marker-genes_59_0.png
[31]:
# setting row_palette='slateblue' makes all violin plots of the same color
sc.pl.rank_genes_groups_stacked_violin(ad, n_genes=3, row_palette='slateblue')
_images/visualizing-marker-genes_60_0.png

Same as previous but with axes swapped.

[32]:
# width is used to set the violin plot width. Here, after setting figsize wider than default,
# the `width` arguments helps to keep the violin plots thin.
sc.pl.rank_genes_groups_stacked_violin(ad, n_genes=3, swap_axes=True, figsize=(6, 10), width=0.4)
_images/visualizing-marker-genes_62_0.png

Visualize marker genes using heatmap

[33]:
sc.pl.rank_genes_groups_heatmap(pbmc, n_genes=3, standard_scale='var')
_images/visualizing-marker-genes_64_0.png
[34]:
sc.pl.rank_genes_groups_heatmap(pbmc, n_genes=3, use_raw=False, swap_axes=True, vmin=-3, vmax=3, cmap='bwr')
_images/visualizing-marker-genes_65_0.png

Showing 10 genes per category, turning the gene labels off and swapping the axes. Notice that when the image is swaped, a color code for the categories appear instead of the ‘brackets’.

[35]:
sc.pl.rank_genes_groups_heatmap(pbmc, n_genes=10, use_raw=False, swap_axes=True, show_gene_labels=False,
                                vmin=-3, vmax=3, cmap='bwr')
_images/visualizing-marker-genes_67_0.png

Visualize marker genes using tracksplot

[36]:
sc.pl.rank_genes_groups_tracksplot(ad, n_genes=3)
_images/visualizing-marker-genes_69_0.png
[37]:
## Filtering of marker genes

The tools sc.tl.filter_rank_genes_groups can be used to select markers that fulfill certain criteria, for example whose fold change is at least 2 with respect to other categories and that are expresssed on 50% of the category cells.

[38]:
# define a new category called 'broad_type' which collapses all T-cells into one group
t_cell = ['CD4+/CD25 T Reg', 'CD4+/CD45RA+/CD25- Naive T', 'CD4+/CD45RO+ Memory','CD8+ Cytotoxic T', 'CD8+/CD45RA+ Naive Cytotoxic']
pbmc.obs['broad_type'] = pd.Categorical(pbmc.obs.bulk_labels.apply(lambda x: x if x not in t_cell  else 'T-cell'))
[39]:
# find gene markers in the 'broad_type' group.
sc.tl.rank_genes_groups(pbmc, 'broad_type', method='wilcoxon', n_genes=200)
ranking genes
    finished (0:00:00.06)
[40]:
sc.tl.filter_rank_genes_groups(pbmc)
Filtering genes using: min_in_group_fraction: 0.25 min_fold_change: 2, max_out_group_fraction: 0.5
[41]:
rcParams['figure.figsize'] = 4,4
rcParams['axes.grid'] = True
sc.pl.rank_genes_groups(pbmc, key='rank_genes_groups_filtered', ncols=3)
_images/visualizing-marker-genes_75_0.png

All filtered genes are set to nan.

[42]:
sc.pl.rank_genes_groups_heatmap(pbmc, n_genes=200, key='rank_genes_groups_filtered',
                                swap_axes=True, use_raw=False, vmax=3, vmin=-3, cmap='bwr', dendrogram=True)
WARNING: dendrogram data not found (using key=dendrogram_broad_type). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
    using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_broad_type']`
WARNING: Gene labels are not shown when more than 50 genes are visualized. To show gene labels set `show_gene_labels=True`
_images/visualizing-marker-genes_77_1.png

Notice that n_genes=200 is set, but only the maximum number of filtered genes found is actually plotted. For example, for CD34+ only few genes are shown.

Apply more stringent marker gene filters.

[43]:
sc.tl.filter_rank_genes_groups(pbmc,
                               min_in_group_fraction=0.5,
                               max_out_group_fraction=0.3,
                               min_fold_change=1.5)
Filtering genes using: min_in_group_fraction: 0.5 min_fold_change: 1.5, max_out_group_fraction: 0.3

Without filtering.

[44]:
sc.pl.rank_genes_groups_dotplot(pbmc, n_genes=6)
_images/visualizing-marker-genes_82_0.png

With filtering.

[45]:
sc.pl.rank_genes_groups_dotplot(pbmc, n_genes=6, key='rank_genes_groups_filtered')
_images/visualizing-marker-genes_84_0.png
[46]:
sc.pl.rank_genes_groups_matrixplot(pbmc, n_genes=6, key='rank_genes_groups_filtered', standard_scale='var', cmap='Blues')
_images/visualizing-marker-genes_85_0.png
[47]:
# Track plot data and violin is better visualized using the non-log counts
ad = pbmc.copy()
ad.raw.X.data = np.exp(ad.raw.X.data)
[48]:
sc.pl.rank_genes_groups_stacked_violin(ad, n_genes=6, key='rank_genes_groups_filtered', standard_scale='var')
_images/visualizing-marker-genes_87_0.png
[49]:
sc.pl.rank_genes_groups_tracksplot(ad, n_genes=6, key='rank_genes_groups_filtered')
_images/visualizing-marker-genes_88_0.png

Compare CD4+/CD25 T Reg markers vs. rest using violin

[50]:
pbmc.obs.bulk_labels.cat.categories[0]
[50]:
'CD4+/CD25 T Reg'
[51]:
sc.pl.rank_genes_groups_violin(pbmc,  n_genes=5, jitter=False)
_images/visualizing-marker-genes_91_0.png
_images/visualizing-marker-genes_91_1.png
_images/visualizing-marker-genes_91_2.png
_images/visualizing-marker-genes_91_3.png
_images/visualizing-marker-genes_91_4.png
_images/visualizing-marker-genes_91_5.png

Dendrogram options

Hierarchical clusterings for categorical observations can also be visualized independently using sc.pl.dendrogram

[52]:
ax = sc.pl.dendrogram(pbmc, 'bulk_labels')
_images/visualizing-marker-genes_93_0.png

The previous image uses a dendrogram that was already computed.

We can re-run the dendrogram function to use other parameters

[53]:
# compute hiearchical clustering based on the
# given `var_names` from the raw matrix
sc.tl.dendrogram(pbmc, 'bulk_labels', var_names=marker_genes, use_raw=True)
Storing dendrogram info using `.uns['dendrogram_bulk_labels']`
[54]:
rcParams['figure.figsize'] = 4,8
ax = sc.pl.dendrogram(pbmc, 'bulk_labels', orientation='left')
_images/visualizing-marker-genes_96_0.png

Plot correlation

Together with the dendrogram it is possible to plot the correlation (by default ‘pearson’) of the categories.

[55]:
# compute hiearchical clustering based on the
# given `var_names` from the raw matrix
sc.tl.dendrogram(pbmc, 'bulk_labels', n_pcs=30)
    using 'X_pca' with n_pcs = 30
Storing dendrogram info using `.uns['dendrogram_bulk_labels']`
[56]:
ax = sc.pl.correlation_matrix(pbmc, 'bulk_labels')
(0.0, 10.0)
_images/visualizing-marker-genes_100_1.png