SpatialClustering_Stagate_DLPFC

This tutorial demonstrates how to identify spatial domains on 10x Visium data using SODB and STAGATE based on pyG (PyTorch Geometric) framework.

A reference paper can be found at https://www.nature.com/articles/s41467-022-29439-6.

This tutorial refers to the following tutorial at https://stagate.readthedocs.io/en/latest/T1_DLPFC.html. At the same time, the way of loadding data is modified by using SODB.

Preparation

[1]:
# Use the Python warnings module to filter and ignore any warnings that may occur in the program after this point.
import warnings
warnings.filterwarnings("ignore")
[2]:
# import several Python libraries commonly used in data analysis and visualization.
# pandas (imported as pd) is a library for data manipulation and analysis.
# numpy (imported as np) is a library for numerical computing with arrays.
# scanpy (imported as sc) is a library for single-cell RNA sequencing analysis.
# matplotlib.pyplot (imported as plt) is a library for data visualization.
# os is a library for interacting with the operating system, such as reading or writing files.
# sys is a library for interacting with the Python interpreter.
import pandas as pd
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import os
import sys
[3]:
# Import the adjusted_rand_score function from the sklearn.metrics.cluster module of the scikit-learn library.
from sklearn.metrics.cluster import adjusted_rand_score
[4]:
# Import a STAGATE_pyG module
import STAGATE_pyG
[5]:
# The location of R (used for the mclust clustering)
# os.environ['R_HOME'] = '/home/<usr_name>/anaconda3/envs/<environment_name>/lib/R/'
# os.environ['R_USER'] = '/home/<usr_name>/anaconda3/envs/<environment_name>/lib/R/rpy2'
os.environ['R_HOME'] = '/home/linsenlin/anaconda3/envs/stagate38/lib/R/'
os.environ['R_USER'] = '/home/linsenlin/anaconda3/envs/stagate38/lib/R/rpy2'
[7]:
# Import the pysodb library
# pysodb is a Python package that provides a set of tools for working with SODB (Store On-Demand Block) databases.
# SODB is a format used to store data in memory-mapped files for efficient access and querying.
# This library allows users to interact with SODB files using Python.
import pysodb
[ ]:
# Initialization
sodb = pysodb.SODB()
[8]:
# Define the name of the dataset_name and experiment_name
dataset_name = 'maynard2021trans'
experiment_name = '151676'
# Load a specific experiment
# It takes two arguments: the name of the dataset and the name of the experiment to load.
# Two arguments are available at https://gene.ai.tencent.com/SpatialOmics/.
adata = sodb.load_experiment(dataset_name,experiment_name)
download experiment[151676] in dataset[maynard2021trans]
100%|██████████| 121M/121M [01:03<00:00, 2.00MB/s]
load experiment[151676] in dataset[maynard2021trans]
[9]:
adata
[9]:
AnnData object with n_obs × n_vars = 3460 × 33538
    obs: 'in_tissue', 'array_row', 'array_col', 'Region', 'leiden'
    var: 'gene_ids', 'feature_types', 'genome', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'hvg', 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial', 'spatial_neighbors', 'umap'
    obsm: 'X_pca', 'X_umap', 'spatial'
    varm: 'PCs'
    obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances'
[10]:
# Make the index unique by appending a number string to each duplicate index element
adata.var_names_make_unique()
[11]:
# Normalization
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
WARNING: adata.X seems to be already log-transformed.
[12]:
# read the annotation
Ann_df = pd.read_csv('/home/linsenlin/protocols/sc/stagate/STAGATE_pyG/data/151676_truth.txt', sep='\t', header=None, index_col=0)
Ann_df.columns = ['Ground Truth']
[15]:
# Add a ['Ground Truth'] column from the 'Ground Truth' column in the 'Ann_df'.
adata.obs['Ground Truth'] = Ann_df.loc[adata.obs_names, 'Ground Truth']
[17]:
# generate a spatial plot using the image specified in the 'hires' key and coloring the plot based on the 'Ground Truth' column.
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.spatial(adata, img_key="hires", color=['Ground Truth'])
../_images/Test_the_original_data_SpatialClustering_Stagate_DLPFC_16_0.png

Constructing the spatial network

[18]:
# Use "STAGATE_pyG.Cal_Spatial_Net" to calculate a spatial graph with a radius cutoff of 150.
STAGATE_pyG.Cal_Spatial_Net(adata, rad_cutoff=150)
# Use "STAGATE_pyG.Stats_Spatial_Net" to summarize cells and edges information.
STAGATE_pyG.Stats_Spatial_Net(adata)
------Calculating spatial graph...
The graph contains 20052 edges, 3460 cells.
5.7954 neighbors per cell on average.
../_images/Test_the_original_data_SpatialClustering_Stagate_DLPFC_18_1.png

Running STAGATE

[19]:
# Train a STAGATE model
adata = STAGATE_pyG.train_STAGATE(adata)
Size of Input:  (3460, 3000)
100%|██████████| 1000/1000 [00:06<00:00, 144.71it/s]

Spatial Clustering

[ ]:
# Calculate the nearest neighbors in the 'STAGATE' representation and computes the UMAP embedding.
sc.pp.neighbors(adata, use_rep='STAGATE')
sc.tl.umap(adata)
[20]:
# Use Mclust_R to cluster cells in the 'STAGATE' representation into 7 clusters.
adata = STAGATE_pyG.mclust_R(adata, used_obsm='STAGATE', num_cluster=7)
R[write to console]:                    __           __
   ____ ___  _____/ /_  _______/ /_
  / __ `__ \/ ___/ / / / / ___/ __/
 / / / / / / /__/ / /_/ (__  ) /_
/_/ /_/ /_/\___/_/\__,_/____/\__/   version 6.0.0
Type 'citation("mclust")' for citing this R package in publications.

fitting ...
  |======================================================================| 100%
[21]:
# Compute the adjusted rand index (ARI) between the 'mclust' and the 'Ground Truth'.
obs_df = adata.obs.dropna()
ARI = adjusted_rand_score(obs_df['mclust'], obs_df['Ground Truth'])
print('Adjusted rand index = %.2f' %ARI)
Adjusted rand index = 0.60
[22]:
# generate a plot of the UMAP embedding colored by both a mclust and a ground truth.
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.umap(adata, color=["mclust", "Ground Truth"], title=['STAGATE (ARI=%.2f)'%ARI, "Ground Truth"])
../_images/Test_the_original_data_SpatialClustering_Stagate_DLPFC_25_0.png
[23]:
# Visualize the result using mclust and Ground Truth.
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.spatial(adata, color=["mclust", "Ground Truth"], title=['STAGATE (ARI=%.2f)'%ARI, "Ground Truth"])
../_images/Test_the_original_data_SpatialClustering_Stagate_DLPFC_26_0.png

Spatial trajectory inference (PAGA)

[24]:
"""
used_adata = adata[adata.obs['Ground Truth']!='nan']
used_adata
"""
[24]:
"\nused_adata = adata[adata.obs['Ground Truth']!='nan']\nused_adata\n"
[25]:
# Exclude any cells with missing values in the 'Ground Truth' column of the observation metadata.
used_adata = adata[pd.notna(adata.obs['Ground Truth'])]
used_adata
[25]:
View of AnnData object with n_obs × n_vars = 3431 × 33538
    obs: 'in_tissue', 'array_row', 'array_col', 'Region', 'leiden', 'Ground Truth', 'mclust'
    var: 'gene_ids', 'feature_types', 'genome', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_rank', 'variances', 'variances_norm'
    uns: 'hvg', 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial', 'spatial_neighbors', 'umap', 'Ground Truth_colors', 'Spatial_Net', 'mclust_colors'
    obsm: 'X_pca', 'X_umap', 'spatial', 'STAGATE'
    varm: 'PCs'
    obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances'
[26]:
# Use PAGA to infer differentiation trajectories.
sc.tl.paga(used_adata, groups="Ground Truth")
[27]:
# Compare partition-based graph abstraction (PAGA) results.
plt.rcParams["figure.figsize"] = (4,3)
sc.pl.paga_compare(used_adata, legend_fontsize=10, frameon=False, size=20,
                   title=experiment_name+'_STGATE', legend_fontoutline=2, show=False)
[27]:
[<Axes: xlabel='UMAP1', ylabel='UMAP2'>, <Axes: >]
../_images/Test_the_original_data_SpatialClustering_Stagate_DLPFC_31_1.png