Analyzing CITE-seq data

Author: Isaac Virshup

[1]:
import scanpy as sc
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
[2]:
%load_ext autoreload
%autoreload 2
[3]:
sc.logging.print_versions()
sc.set_figure_params(frameon=False, figsize=(4, 4))
scanpy==1.4.5.2.dev48+g5c81890 anndata==0.7.2.dev24+g669dd44 umap==0.4.0rc1 numpy==1.16.3 scipy==1.3.0 pandas==0.25.3 scikit-learn==0.22 statsmodels==0.10.0 python-igraph==0.7.1 louvain==0.6.1

Reading

[4]:
# !mkdir -p data
# !wget http://cf.10xgenomics.com/samples/cell-exp/3.1.0/5k_pbmc_protein_v3_nextgem/5k_pbmc_protein_v3_nextgem_filtered_feature_bc_matrix.h5 -O data/5k_pbmc_protein_v3_nextgem_filtered_feature_bc_matrix.h5
[5]:
datafile = "data/5k_pbmc_protein_v3_nextgem_filtered_feature_bc_matrix.h5"
[6]:
pbmc = sc.read_10x_h5(datafile, gex_only=False)
pbmc.var_names_make_unique()
pbmc.layers["counts"] = pbmc.X.copy()

Inspect the full object.

[7]:
pbmc
[7]:
AnnData object with n_obs × n_vars = 5527 × 33570
    var: 'gene_ids', 'feature_types', 'genome'
    layers: 'counts'

Inspect the number of variables for the two different modalities.

[8]:
pbmc.var["feature_types"].value_counts()
[8]:
Gene Expression     33538
Antibody Capture       32
Name: feature_types, dtype: int64

Preprocessing

For ease of preprocessing, we’ll split the data into seperate protein and RNA AnnData objects:

[9]:
protein = pbmc[:, pbmc.var["feature_types"] == "Antibody Capture"].copy()
rna = pbmc[:, pbmc.var["feature_types"] == "Gene Expression"].copy()

Protein

First we’ll take a look at the antibody counts.

We still need to explain the function here. I’m happy if we add it to the first tutorial, too (I know you did it already at some point, but I didn’t want to let go of the simpler naming scheme back then; now I’d be happy to transition.)

Inspect the proteins stored. Some of them are merge control measurements.

Most of them are protein markers associated with T cell activity. CD8 marks cytotoxicity, PD-1 and TIGIT mark dysfunctional T cell states.

[10]:
print(protein.var_names.tolist())
['CD3_TotalSeqB', 'CD4_TotalSeqB', 'CD8a_TotalSeqB', 'CD11b_TotalSeqB', 'CD14_TotalSeqB', 'CD15_TotalSeqB', 'CD16_TotalSeqB', 'CD19_TotalSeqB', 'CD20_TotalSeqB', 'CD25_TotalSeqB', 'CD27_TotalSeqB', 'CD28_TotalSeqB', 'CD34_TotalSeqB', 'CD45RA_TotalSeqB', 'CD45RO_TotalSeqB', 'CD56_TotalSeqB', 'CD62L_TotalSeqB', 'CD69_TotalSeqB', 'CD80_TotalSeqB', 'CD86_TotalSeqB', 'CD127_TotalSeqB', 'CD137_TotalSeqB', 'CD197_TotalSeqB', 'CD274_TotalSeqB', 'CD278_TotalSeqB', 'CD335_TotalSeqB', 'PD-1_TotalSeqB', 'HLA-DR_TotalSeqB', 'TIGIT_TotalSeqB', 'IgG1_control_TotalSeqB', 'IgG2a_control_TotalSeqB', 'IgG2b_control_TotalSeqB']

Let’s compute some qc metrics.

[11]:
protein.var["control"] = protein.var_names.str.contains("control")
sc.pp.calculate_qc_metrics(
    protein,
    percent_top=(5, 10, 15),
    var_type="antibodies",
    qc_vars=("control",),
    inplace=True,
)

We can look check out the qc metrics for our data:

TODO: I would like to include some justification for the change in normalization. It definitley has a much different distribution than transcripts. I think this could be shown through the qc plots, but it’s a huge pain to move around these matplotlib plots. This might be more appropriate for the in-depth guide though.

Agreed! But, having an explanation of the two images below is also a good first step. Also nice to contrast it with the corresponding RNA distributions.

[12]:
sns.jointplot("log1p_total_counts", "n_antibodies_by_counts", protein.obs, kind="hex", norm=mpl.colors.LogNorm())
sns.jointplot("log1p_total_counts", "log1p_total_counts_control", protein.obs, kind="hex", norm=mpl.colors.LogNorm())
[12]:
<seaborn.axisgrid.JointGrid at 0x13572ccc0>
../_images/cite-seq_pbmc5k_alex_24_1.png
../_images/cite-seq_pbmc5k_alex_24_2.png
[13]:
protein.layers["counts"] = protein.X.copy()

Discuss that this here is a different normalization.

[14]:
sc.pp.normalize_geometric(protein)
sc.pp.log1p(protein)
[15]:
sc.pp.pca(protein, n_comps=20) # we just have 32 proteins, so a low numnber of PCs is appropriate to denoise this
sc.pp.neighbors(protein, n_neighbors=30)

RNA

Now we’ll process our RNA in the typical way:

[17]:
sc.pp.filter_genes(rna, min_counts=1)
rna.var["mito"] = rna.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(rna, qc_vars=["mito"], inplace=True)

Yes, would be great to the see the QC plots, here, too. Are cells with low counts etc. weird only on the RNA level or also on the protein level? It’d be nice to see the QC plots side-by-side between RNA and protein, I’d say.

[18]:
rna.layers["counts"] = rna.X.copy()
[19]:
sc.pp.normalize_total(rna)
sc.pp.log1p(rna)
[20]:
sc.pp.pca(rna)
sc.pp.neighbors(rna, n_neighbors=30)   # again, using higher than default numbers of neighbors stabilizes the layout

Joint manifold and clustering

[25]:
def average_graphs(graphs, weights=None):
    """Take the maximum edge value from each graph.

    graphs : iterable of sparse matrices
    weights : iterable of floats with same length as graphs
    """
    if weights is None:
        weights = [1 for g in graphs]
    if len(graphs) != len(weights):
        raise ValueError('graphs and weights need to have the same lenghts')
    out = weights[0] * graphs[0].copy()
    for i in range(1, len(graphs)):
        out += weights[i] * graphs[i]
    out /= sum(weights)
    return out
[34]:
joint = rna.copy()
[35]:
joint.obsm['protein'] = protein.to_df()
[36]:
joint.uns['neighbors'] = {}
joint.uns['neighbors']['connectivities'] = average_graphs([rna.uns['neighbors']['connectivities'], protein.uns['neighbors']['connectivities']])

There are almost no exactly shared neighbors.

[37]:
print(rna.uns['neighbors']['connectivities'].nnz)
print(protein.uns['neighbors']['connectivities'].nnz)
print(joint.uns['neighbors']['connectivities'].nnz)
245592
238364
458524
[39]:
sc.tl.umap(joint)
sc.tl.leiden(joint, key_added='joint')
sc.tl.umap(joint)
sc.pl.umap(joint, color='joint', size=10)
../_images/cite-seq_pbmc5k_alex_43_0.png

Comparing with an analysis of separate objects

[53]:
sc.tl.leiden(protein, key_added="protein")
sc.tl.umap(protein)
sc.tl.leiden(rna, key_added="rna")
sc.tl.umap(rna)

Map onto joint.

[46]:
joint.obs['protein'] = protein.obs['protein']
joint.obs['rna'] = rna.obs['rna']
[57]:
rna.obs['protein'] = protein.obs['protein']
rna.obs['joint'] = joint.obs['joint']
[60]:
protein.obs['rna'] = rna.obs['rna']
protein.obs['joint'] = joint.obs['joint']
[63]:
print('joint umap')
sc.pl.umap(joint, color=['joint', 'rna', 'protein'], size=10)
print('rna umap')
sc.pl.umap(rna, color=['joint', 'rna', 'protein'], size=10)
print('protein umap')
sc.pl.umap(protein, color=['joint', 'rna', 'protein'], size=10)
joint umap
../_images/cite-seq_pbmc5k_alex_50_1.png
rna umap
../_images/cite-seq_pbmc5k_alex_50_3.png
protein umap
../_images/cite-seq_pbmc5k_alex_50_5.png

Plot some interesting markers, here.

[56]:
# sc.pl.umap(joint, color=['CD8'], size=10)

Visualization

An alternative would be to recombine the data into the pbmc object here.

[19]:
rna.obsm["protein"] = protein.to_df()
rna.obsm["protein_umap"] = protein.obsm["X_umap"]
rna.obs["protein_leiden"] = protein.obs["protein_leiden"]
rna.obsp["rna_connectivities"] = rna.obsp["connectivities"].copy()
rna.obsp["protein_connectivities"] = protein.obsp["protein_connectivities"]
[20]:
sc.tl.umap(rna)
[21]:
sc.pl.umap(rna, color=["rna_leiden", "protein_leiden"], size=10)
sc.pl.embedding(rna, basis="protein_umap", color=["rna_leiden", "protein_leiden"], size=10)
../_images/cite-seq_pbmc5k_alex_57_0.png
../_images/cite-seq_pbmc5k_alex_57_1.png

The clusterings disagree quite a bit. It looks like neither modality is giving a full accounting of the data, so now we’ll see what we can learn by combining the modalities.

Plotting values between modalities

  • TODO: Get selectors like this working in scanpy.

Imports and helper functions

[22]:
import altair as alt
from functools import partial

alt.renderers.enable("png")
alt.data_transformers.disable_max_rows()
[22]:
DataTransformerRegistry.enable('default')
[23]:
def embedding_chart(df: pd.DataFrame, coord_pat: str, *, size=5) -> alt.Chart:
    """Make schema for coordinates, like sc.pl.embedding."""
    x, y = df.columns[df.columns.str.contains(coord_pat)]
    return (
        alt.Chart(plotdf, height=300, width=300)
        .mark_circle(size=size)
        .encode(
            x=alt.X(x, axis=None),
            y=alt.Y(y, axis=None),
        )
    )

def umap_chart(df: pd.DataFrame, **kwargs) -> alt.Chart:
    """Like sc.pl.umap, but just the coordinates."""
    return embedding_chart(df, "umap", **kwargs)

def encode_color(c: alt.Chart, col: str, *, qdomain=(0, 1), scheme: str = "lightgreyred") -> alt.Chart:
    """Add colors to an embedding plot schema."""
    base = c.properties(title=col)
    if pd.api.types.is_categorical(c.data[col]):
        return base.encode(color=col)
    else:
        return base.encode(
            color=alt.Color(
                col,
                scale=alt.Scale(
                    scheme=scheme,
                    clamp=True,
                    domain=list(c.data[col].quantile(qdomain)),
                    nice=True,
                )
            )
        )
[24]:
plotdf = sc.get.obs_df(
    rna,
    obsm_keys=[("X_umap", i) for i in range(2)] + [("protein", i) for i in rna.obsm["protein"].columns]
)
(
    alt.concat(
        *map(partial(encode_color, umap_chart(plotdf), qdomain=(0, .95)), plotdf.columns[5:10]),
        columns=2
    )
    .resolve_scale(color='independent')
    .configure_axis(grid=False)
)
[24]:
../_images/cite-seq_pbmc5k_alex_63_0.png

Clustering

Here, we have a few approaches for clustering. Both which take into account both modalities of the data. First, we can use both connectivity graphs generated from each assay.

[25]:
sc.tl.leiden_multiplex(rna, ["rna_connectivities", "protein_connectivities"])  # Adds key "leiden_multiplex" by default

Alternativley, we can try and combine these representations into a joint graph.

[26]:
def join_graphs_max(g1: "sparse.spmatrix", g2: "sparse.spmatrix"):
    """Take the maximum edge value from each graph."""
    out = g1.copy()
    mask = g1 < g2
    out[mask] = g2[mask]

    return out
[27]:
rna.obsp["connectivities"] = join_graphs_max(rna.obsp["rna_connectivities"], rna.obsp["protein_connectivities"])
[28]:
sc.tl.leiden(rna, key_added="joint_leiden")
[29]:
sc.tl.umap(rna)

Now we’ve got a clustering which better represents both modalities in the data:

[30]:
sc.pl.umap(rna, color=["leiden_multiplex", "joint_leiden"], size=5)
../_images/cite-seq_pbmc5k_alex_73_0.png
[31]:
sns.heatmap(pd.crosstab(rna.obs["leiden_multiplex"], rna.obs["rna_leiden"], normalize="index"))
[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x13dd1f710>
../_images/cite-seq_pbmc5k_alex_74_1.png
[32]:
sns.heatmap(pd.crosstab(rna.obs["leiden_multiplex"], rna.obs["protein_leiden"], normalize="index"))
[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x15daeafd0>
../_images/cite-seq_pbmc5k_alex_75_1.png
[ ]: