Protocols - Streamline development of spatial omics analysis tools using SODB!
Installation
Installation for Pysodb
This tutorial demonstrates how to install the pysodb package in a conda environment.
Installing softwares
1. The first step is to install Anaconda and Visual Studio Code in advance.
Reference tutorials can be found at https://docs.anaconda.com/anaconda/install/index.html and https://docs.anaconda.com/anaconda/user-guide/tasks/integration/vscode/
2. Launch Visual Studio Code and open a terminal window.
Henceforth, various packages or modules will be installed via the command line
Installation pysobd
3. Select the installation path and open it
[ ]:
cd <path>
4. Clone pycodb code
[ ]:
git clone https://github.com/TencentAILabHealthcare/pysodb.git
5. Open the pycodb directory
[ ]:
cd pysodb
6. Create a conda environment
[ ]:
conda env create -n <environment_name> --file pysodb.yml
7. Activate a conda environment
If the conda environment is used on the terminal, run the following command to activate it:
[ ]:
conda activate <environment_name>
8. Install a pysodb package from source code
[ ]:
python setup.py install
9. Install pysodb as a dependency or third-party package with pip
[ ]:
pip install <third-party package>
Usage
The next steps demonstrate usage of the pycodb package via Jupyter.
“Select Kernel” through jupyter, and then select the python environment
import pysodb package
[1]:
import pysodb
Initialization
[2]:
sodb = pysodb.SODB()
Get the list of datasets
[3]:
dataset_list = sodb.list_dataset()
Get the list of datasets with specific category.
categories [“Spatial Transcriptomics”, “Spatial Proteomics”, “Spatial Metabolomics”, “Spatial Genomics”, “Spatial MultiOmics”]
And, take the example of “Spatial Transcriptomics”:
[4]:
dataset_list = sodb.list_dataset_by_category("Spatial Transcriptomics")
dataset_list
[4]:
['maynard2021trans',
'codeluppi2018spatial',
'xia2022the',
'backdahl2021spatial',
'eng2019transcriptome',
'berglund2018spatial',
'Sanchez2021A',
'thrane2018spatially',
'Dhainaut2022Spatial',
'Buzzi2022Spatial',
'Gouin2021An',
'Wang2018Three_1k',
'wang2021easi',
'lohoff2021integration',
'chen2020spatial',
'wang2022high',
'Sun2022Excitatory',
'Garcia2021Mapping',
'ji2020multimodal',
'Dixon2022Spatially',
'Zeng2023Integrative',
'asp2019a',
'seqFISH_VISp',
'Wang2018three',
'rodriques2019slide',
'chen2021decoding',
'stickels2020highly',
'liu2022spatiotemporal',
'Alon2021Expansion',
'Allen2022Molecular_lps',
'chen2022spatiotemporal_compre_20',
'carlberg2019exploring',
'zhang2021spatially',
'Marshall2022High_human',
'Vickovic2019high_update',
'scispace',
'hunter2021spatially',
'Kadur2022Human',
'fawkner2021spatiotemporal',
'stahl2016visualization',
'ortiz2020molecular',
'Vickovic2019high',
'Biermann2022Dissecting',
'DARTFISH',
'Marshall2022High_mouse',
'Allen2022Molecular_aging',
'Merfish_Visp',
'Barkley2022Cancer',
'gracia2021genome',
'mantri2021spatiotemporal',
'moffitt2018molecular',
'Visium_Allen',
'Wu2022spatial',
'chen2022spatiotemporal',
'kvastad2021the',
'asp2017spatial',
'wei2022single',
'Pascual2021Dietary',
'Fang2022Conservation',
'Navarro2020Spatial',
'Joglekar2021A',
'Konieczny2022Interleukin',
'guilliams2022spatial',
'Misra2021Characterizing',
'bergenstrahle2021super',
'Tower2021Spatial',
'he2020integrating',
'Juntaro2022MEK',
'Booeshaghi2021Isoform',
'Zhang2023Amolecularly_rawcount',
'Shi2022Spatial',
'parigi2022the',
'Fu2021Unsupervised',
'Kleshchevnikov2022Cell2location',
'xia2019spatial',
'maniatis2019spatiotemporal',
'Melo2021Integrating',
'Shah2016InSitu',
'Sun2021Integrating',
'chen2021dissecting',
'Ratz2022Clonal',
'hildebrandt2021spatial',
'Borm2022Scalable',
'moncada2020integrating',
'Lebrigand2022The',
'10x']
Load a specific experiment
Loading a specific experiment needs two arguments, dataset_name and experiment_name.
Two arguments are available at https://gene.ai.tencent.com/SpatialOmics/.
[5]:
adata = sodb.load_experiment('hunter2021spatially','sample_B')
adata
load experiment[sample_B] in dataset[hunter2021spatially]
[5]:
AnnData object with n_obs × n_vars = 2179 × 32268
obs: 'col_0', 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'hvg', 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial_neighbors', 'umap'
obsm: 'X_pca', 'X_umap', 'spatial', 'spatial_pixel', 'spatial_real'
varm: 'PCs'
obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances'
Load a specific dataset
[6]:
adataset = sodb.load_dataset('hunter2021spatially')
adataset
load experiment[sample_A] in dataset[hunter2021spatially]
load experiment[sample_C] in dataset[hunter2021spatially]
load experiment[sample_B] in dataset[hunter2021spatially]
[6]:
{'sample_A': AnnData object with n_obs × n_vars = 2425 × 32268
obs: 'col_0', 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'hvg', 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial_neighbors', 'umap'
obsm: 'X_pca', 'X_umap', 'spatial', 'spatial_pixel', 'spatial_real'
varm: 'PCs'
obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances',
'sample_C': AnnData object with n_obs × n_vars = 2677 × 32268
obs: 'col_0', 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'hvg', 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial_neighbors', 'umap'
obsm: 'X_pca', 'X_umap', 'spatial', 'spatial_pixel', 'spatial_real'
varm: 'PCs'
obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances',
'sample_B': AnnData object with n_obs × n_vars = 2179 × 32268
obs: 'col_0', 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'hvg', 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial_neighbors', 'umap'
obsm: 'X_pca', 'X_umap', 'spatial', 'spatial_pixel', 'spatial_real'
varm: 'PCs'
obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances'}
Installation for Pysodb and Stagate
This tutorial demonstrates how to install pycodb on other methods.
Using stagate as an example, install pysodb in its installation environment.
Reference tutorials presents can be found at https://github.com/QIFEIDKN/STAGATE_pyG and https://github.com/TencentAILabHealthcare/pysodb.
Installing softwares and tools
1. The first step is to install Anaconda, Visual Studio Code and CUDA in advance.
Reference tutorials presents how to install anaconda and Visual Studio Code. And they can be found at https://docs.anaconda.com/anaconda/install/index.html and https://docs.anaconda.com/anaconda/user-guide/tasks/integration/vscode/.
Since STAGATE_pyG is based on pyG (PyTorch Geometric) framework, it requires GPUs on a device and need to install CUDA tool. A tutorials is available at https://developer.nvidia.com/cuda-downloads.
2. Launch Visual Studio Code and open a terminal window.
Henceforth, various packages or modules will be installed via the command line
Installation STAGATE_pyG
3. Select the installation path and open it
[ ]:
cd <path>
4. Clone STAGATE_pyG code
[ ]:
git clone https://github.com/QIFEIDKN/STAGATE_pyG
5. Open the STAGATE_pyG directory
[ ]:
cd STAGATE_pyG
6. Create a conda environment
[ ]:
conda create -n <environment_name> python=3.8
7. Activate a conda environment
If the conda environment is used on the terminal, run the following command to activate it:
[ ]:
conda activate <environment_name>
8. Install torch with CUDA
eg. pip install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/cu117/torch_stable.html
[ ]:
pip install torch==<torch_version>+<cuda_version> -f https://download.pytorch.org/whl/<cuda_version>/torch_stable.html
9. Install PyTorch Geometric
Install a PyTorch Geometric according to the version of pytorch and CUDA, system. Select the package and copy a run command to run on your terminal. https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html
eg. pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv torch_geometric -f https://data.pyg.org/whl/torch-1.13.0+cu117.html
10. Install other python packages
[ ]:
pip install -r requirement.txt
11. Install STAGATE_pyG as a dependency or third-party package with pip
[ ]:
pip install <third-party package>
Installation mclust library in STAGATE_pyG conda environment
STAGATE uses the mclust package to identify spatial domains, which is an R library.
The next tutorial will show how to install the mclust package using r and rpy2 for python in a conda environment.
Activate by running the command “conda activate “
12. Install r and rpy2 in in a conda environment
[ ]:
conda install -c r rpy2
Run following command on the terminal to set the R_HOME environment variable to the location of the R executable within a conda environment.
[ ]:
export R_HOME=/home/<user_name>/anaconda3/envs/<environment_name>/lib/R
Run following command on the terminal to set the R_LIBS_USER environment variable to where R will install any packages that are installed within this conda environment.
[ ]:
export R_LIBS_USER=/home/<user_name>/anaconda3/envs/<environment_name>/lib/R/library
13. Enter the python interpreter to install mclust library
Run “python” at the terminal to enter the python interpreter and install mclust library in the python interpreter
[ ]:
python
Run the following commands in sequence:
[ ]:
import rpy2.robjects.packages as r
utils = r.importr("utils")
package_name = "mclust"
utils.install_packages(package_name)
Then, select a loading channel to install mclust.
Finally, exit the python interpreter.
[ ]:
exit()
Installation pysobd
Keep the conda environment active
14. Clone pycodb code
[ ]:
git clone https://github.com/TencentAILabHealthcare/pysodb.git
15. Open the pycodb directory
[ ]:
cd pysodb
16. Install a pysodb package from source code
[ ]:
python setup.py install
17. Install pysodb as a dependency or third-party package with pip
[ ]:
pip install <third-party package>
[5]:
print('finish!')
finish
Test the original data
Alignment_Paste_DLPFC
This tutorial demonstrates alignment on DLPFC data using SODB and Paste.
A reference paper can be found at https://www.nature.com/articles/s41592-022-01459-6.
This tutorial refers to the following tutorial at https://github.com/raphael-group/paste_reproducibility/blob/main/notebooks/DLPFC_pairwise.ipynb. At the same time, the way of loadding data is modified by using SODB.
[1]:
# imports various libraries and packages for data analysis and visualization.
# math: provides mathematical functions such as logarithms, trigonometric functions, etc.
# pandas: used for data manipulation and analysis.
# numpy: used for numerical computing, including mathematical operations on arrays and matrices.
# scipy: used for scientific computing, including functions for optimization, linear algebra, statistics, and signal processing.
# seaborn: used for statistical data visualization, providing high-level interfaces for creating informative and attractive visualizations.
# matplotlib: a comprehensive library for creating static, animated, and interactive visualizations in Python.
# matplotlib.patches: provides classes for creating graphical objects such as rectangles, circles, and polygons.
# style: a module within matplotlib that allows users to customize the style of plots.
# time: provides time-related functions, such as measuring execution time and converting between time formats.
# scanpy: a Python package for single-cell gene expression analysis, including preprocessing, clustering, and differential expression analysis.
# sklearn: a machine learning library with various tools for classification, regression, clustering, and dimensionality reduction.
# networkx: a Python package for creating, manipulating, and studying complex networks.
# ot: a Python package for optimal transport (OT) computations, including OT-based algorithms for data analysis and visualization.
import math
import pandas as pd
import numpy as np
import scipy
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib import style
import matplotlib
import time
import scanpy as sc
import sklearn
import networkx as nx
import ot
/tmp/ipykernel_92630/1186083132.py:19: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
style.use('seaborn-white')
[ ]:
# import the paste package
import paste as pst
[ ]:
"""
#%load_ext autoreload
#%autoreload 2
#style.use('seaborn-dark')
style.use('seaborn-white')
#报错ModuleNotFoundError: No module named 'ot'
#需要pip install POT
"""
[3]:
# create a list containing 12 samples
sample_list = ["151507", "151508", "151509","151510", "151669", "151670","151671", "151672", "151673","151674", "151675", "151676"]
[ ]:
# Import the pysodb library
# pysodb is a Python package that provides a set of tools for working with SODB (Store On-Demand Block) databases.
# SODB is a format used to store data in memory-mapped files for efficient access and querying.
# This library allows users to interact with SODB files using Python.
import pysodb
[ ]:
# Initialization
sodb = pysodb.SODB()
[4]:
# Get the list of datasets with specific category.
# categories ["Spatial Transcriptomics", "Spatial Proteomics", "Spatial Metabolomics", "Spatial Genomics", "Spatial MultiOmics"]
sodb.list_dataset_by_category('Spatial Transcriptomics')
[4]:
['moffitt2018molecular',
'chen2022spatiotemporal',
'Vickovic2019high',
'eng2019transcriptome',
'carlberg2019exploring',
'chen2022spatiotemporal_compre_20',
'Navarro2020Spatial',
'gracia2021genome',
'zhang2021spatially',
'xia2019spatial',
'mantri2021spatiotemporal',
'Merfish_Visp',
'he2020integrating',
'Shah2016InSitu',
'maynard2021trans',
'DARTFISH',
'fawkner2021spatiotemporal',
'ortiz2020molecular',
'Ratz2022Clonal',
'10x',
'Juntaro2022MEK',
'wei2022single',
'Lebrigand2022The',
'Tower2021Spatial',
'liu2022spatiotemporal',
'scispace',
'bergenstrahle2021super',
'Marshall2022High_human',
'Gouin2021An',
'wang2022high',
'backdahl2021spatial',
'Wang2018Three_1k',
'Dixon2022Spatially',
'hildebrandt2021spatial',
'parigi2022the',
'Alon2021Expansion',
'Pascual2021Dietary',
'Sun2021Integrating',
'Buzzi2022Spatial',
'asp2017spatial',
'Sanchez2021A',
'Fang2022Conservation',
'rodriques2019slide',
'seqFISH_VISp',
'moncada2020integrating',
'Biermann2022Dissecting',
'Kleshchevnikov2022Cell2location',
'Garcia2021Mapping',
'chen2021decoding',
'kvastad2021the',
'stickels2020highly',
'guilliams2022spatial',
'Joglekar2021A',
'wang2021easi',
'xia2022the',
'Misra2021Characterizing',
'Marshall2022High_mouse',
'maniatis2019spatiotemporal',
'Barkley2022Cancer',
'thrane2018spatially',
'Sun2022Excitatory',
'lohoff2021integration',
'Dhainaut2022Spatial',
'Borm2022Scalable',
'Wang2018three',
'Vickovic2019high_update',
'chen2021dissecting',
'Konieczny2022Interleukin',
'asp2019a',
'Kadur2022Human',
'ji2020multimodal',
'stahl2016visualization',
'berglund2018spatial',
'Visium_Allen',
'hunter2021spatially',
'chen2020spatial',
'Wu2022spatial',
'Melo2021Integrating',
'codeluppi2018spatial']
[5]:
# Get the list of datasets
adata_list = sodb.load_dataset('maynard2021trans')
load experiment[151508] in dataset[maynard2021trans]
load experiment[151671] in dataset[maynard2021trans]
load experiment[151507] in dataset[maynard2021trans]
load experiment[151674] in dataset[maynard2021trans]
load experiment[151670] in dataset[maynard2021trans]
load experiment[151669] in dataset[maynard2021trans]
load experiment[151676] in dataset[maynard2021trans]
load experiment[151675] in dataset[maynard2021trans]
load experiment[151509] in dataset[maynard2021trans]
load experiment[151673] in dataset[maynard2021trans]
load experiment[151672] in dataset[maynard2021trans]
load experiment[151510] in dataset[maynard2021trans]
[6]:
# creates a new dictionary called "adatas" by removing observations with missing values in the "Region" column from each dataset in the original dictionary "adata_list".
adatas = {}
for key in adata_list.keys():
a = adata_list[key]
a = a[np.logical_not(a.obs['Region'].isna())]
adatas[key] = a
[7]:
# defines groups of samples based on IDs
sample_groups = [["151507", "151508", "151509","151510"],[ "151669", "151670","151671", "151672"],[ "151673","151674", "151675", "151676"]]
# creates a list of lists called layer_groups
# where each sub-list contains the AnnData objects for each sample in the corresponding group from sample_groups.
layer_groups = [[adatas[sample_groups[j][i]] for i in range(len(sample_groups[j]))] for j in range(len(sample_groups))]
# create a dictionary that maps the layer number to a color from the default Seaborn color palette.
# The layer number is represented as a string in the format "Layer{layer_number}"
layer_to_color_map = {'Layer{0}'.format(i+1):sns.color_palette()[i] for i in range(6)}
# adds an additional key "WM" to the layer_to_color_map dictionary
layer_to_color_map['WM'] = sns.color_palette()[6]
[65]:
# Visualize the different slices of the DLPFC brain region mapped by layer_groups
slice_map = {0:'A',1:'B',2:'C',3:'D'}
fig, axs = plt.subplots(3, 4,figsize=(15,11.5))
for j in range(len(layer_groups)):
axs[j,0].text(-0.1, 0.5, 'Sample '+slice_map[j],fontsize=12,rotation='vertical',transform = axs[j,0].transAxes,verticalalignment='center')
for i in range(len(layer_groups[j])):
adata = adatas[sample_list[j*4+i]]
colors = list(adata.obs['Region'].astype('str').map(layer_to_color_map))
colors = [(r, g, b) for r, g, b in colors]
axs[j,i].scatter(layer_groups[j][i].obsm['spatial'][:,0],layer_groups[j][i].obsm['spatial'][:,1],linewidth=0,s=20, marker=".",
color=colors
)
axs[j,i].set_title('Slice '+ slice_map[i],size=12)
axs[j,i].invert_yaxis()
axs[j,i].axis('off')
if i<3:
s = '300$\mu$m' if i==1 else "10$\mu$m"
delta = 0.05 if i==1 else 0
axs[j,i].annotate('',xy=(1-delta, 0.5), xytext=(1.2+delta, 0.5),xycoords=axs[j,i].transAxes,textcoords=axs[j,i].transAxes,arrowprops=dict(arrowstyle='<->',lw=1))
axs[j,0].text(1.1, 0.55, s,fontsize=9,transform = axs[j,i].transAxes,horizontalalignment='center')
axs[j,3].legend(handles=[mpatches.Patch(color=layer_to_color_map[adata.obs['Region'].cat.categories[i]], label=adata.obs['Region'].cat.categories[i]) for i in range(len(adata.obs['Region'].cat.categories))],fontsize=10,title='Cortex layer',title_fontsize=12,bbox_to_anchor=(1, 1))
plt.savefig('figures/DLPFC_before.pdf',bbox_inches='tight',transparent=True)

[9]:
# calculates the maximum accuracy of a binary classifier that distinguishes between two sets of labels
def max_accuracy(labels1,labels2):
w = min(1/len(labels1),1/len(labels2))
cats = set(pd.unique(labels1)).union(set(pd.unique(labels1)))
return sum([w * min(sum(labels1==c),sum(labels2==c)) for c in cats])
# calculates the mapping accuracy between two sets of labels using the optimal transport (OT) algorithm
def mapping_accuracy(labels1,labels2,pi):
mapping_dict = {'Layer1':1, 'Layer2':2, 'Layer3':3, 'Layer4':4, 'Layer5':5, 'Layer6':6, 'WM':7}
return np.sum(pi*(scipy.spatial.distance_matrix(np.matrix(labels1.map(mapping_dict) ).T,np.matrix(labels2.map(mapping_dict)).T)==0))
#
import itertools
# calculates the optimal transport plan pi between two sets of labels using the OT algorithm
def max_accuracy_mapping(labels1,labels2):
n1,n2=len(labels1),len(labels2)
mapping_dict = {'Layer1':1, 'Layer2':2, 'Layer3':3, 'Layer4':4, 'Layer5':5, 'Layer6':6, 'WM':7}
dist = np.array(scipy.spatial.distance_matrix(np.matrix(labels1.map(mapping_dict)).T,np.matrix(labels2.map(mapping_dict)).T)!=0,dtype=float)
pi = ot.emd(np.ones(n1)/n1, np.ones(n2)/n2, dist)
return pi
[10]:
alpha = 0.1
res_df = pd.DataFrame(columns=['Sample','Pair','Kind','Time','Accuracy'])
pis = [[None for i in range(len(layer_groups[j])-1)] for j in range(len(layer_groups))]
for j in range(len(layer_groups)):
for i in range(len(layer_groups[j])-1):
pi0 = pst.match_spots_using_spatial_heuristic(layer_groups[j][i].obsm['spatial'],layer_groups[j][i+1].obsm['spatial'],use_ot=True)
start = time.time()
pis[j][i] = pst.pairwise_align(layer_groups[j][i], layer_groups[j][i+1],alpha=alpha,G_init=pi0,norm=True,verbose=False)
tt = time.time()-start
acc = mapping_accuracy(layer_groups[j][i].obs['Region'],layer_groups[j][i+1].obs['Region'],pis[j][i])
print(j,i,'Accuracy',acc,'time',tt)
# np.savetxt('../data/DLPFC/saved_results/init_{0}_{1}_{2}.gz'.format(j,i,'ot'), pis[j][i], delimiter=',')
res_df.loc[len(res_df)] = [j,i,'PASTE',tt,acc]
Using selected backend cpu. If you want to use gpu, set use_gpu = True.
0 0 Accuracy 0.815543482357772 time 333.4515166282654
Using selected backend cpu. If you want to use gpu, set use_gpu = True.
RESULT MIGHT BE INACURATE
Max number of iteration reached, currently 100000. Sometimes iterations go on in cycle even though the solution has been reached, to check if it's the case here have a look at the minimal reduced cost. If it is very close to machine precision, you might actually have the correct solution, if not try setting the maximum number of iterations a bit higher
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/ot/lp/__init__.py:343: UserWarning: numItermax reached before optimality. Try to increase numItermax.
result_code_string = check_result(result_code)
RESULT MIGHT BE INACURATE
Max number of iteration reached, currently 100000. Sometimes iterations go on in cycle even though the solution has been reached, to check if it's the case here have a look at the minimal reduced cost. If it is very close to machine precision, you might actually have the correct solution, if not try setting the maximum number of iterations a bit higher
RESULT MIGHT BE INACURATE
Max number of iteration reached, currently 100000. Sometimes iterations go on in cycle even though the solution has been reached, to check if it's the case here have a look at the minimal reduced cost. If it is very close to machine precision, you might actually have the correct solution, if not try setting the maximum number of iterations a bit higher
RESULT MIGHT BE INACURATE
Max number of iteration reached, currently 100000. Sometimes iterations go on in cycle even though the solution has been reached, to check if it's the case here have a look at the minimal reduced cost. If it is very close to machine precision, you might actually have the correct solution, if not try setting the maximum number of iterations a bit higher
0 1 Accuracy 0.22047228891676612 time 52.269108057022095
Using selected backend cpu. If you want to use gpu, set use_gpu = True.
RESULT MIGHT BE INACURATE
Max number of iteration reached, currently 100000. Sometimes iterations go on in cycle even though the solution has been reached, to check if it's the case here have a look at the minimal reduced cost. If it is very close to machine precision, you might actually have the correct solution, if not try setting the maximum number of iterations a bit higher
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/ot/lp/__init__.py:343: UserWarning: numItermax reached before optimality. Try to increase numItermax.
result_code_string = check_result(result_code)
RESULT MIGHT BE INACURATE
Max number of iteration reached, currently 100000. Sometimes iterations go on in cycle even though the solution has been reached, to check if it's the case here have a look at the minimal reduced cost. If it is very close to machine precision, you might actually have the correct solution, if not try setting the maximum number of iterations a bit higher
0 2 Accuracy 0.8713495745166362 time 32.520468950271606
Using selected backend cpu. If you want to use gpu, set use_gpu = True.
1 0 Accuracy 0.9151285966713762 time 598.9820036888123
Using selected backend cpu. If you want to use gpu, set use_gpu = True.
RESULT MIGHT BE INACURATE
Max number of iteration reached, currently 100000. Sometimes iterations go on in cycle even though the solution has been reached, to check if it's the case here have a look at the minimal reduced cost. If it is very close to machine precision, you might actually have the correct solution, if not try setting the maximum number of iterations a bit higher
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/ot/lp/__init__.py:343: UserWarning: numItermax reached before optimality. Try to increase numItermax.
result_code_string = check_result(result_code)
RESULT MIGHT BE INACURATE
Max number of iteration reached, currently 100000. Sometimes iterations go on in cycle even though the solution has been reached, to check if it's the case here have a look at the minimal reduced cost. If it is very close to machine precision, you might actually have the correct solution, if not try setting the maximum number of iterations a bit higher
RESULT MIGHT BE INACURATE
Max number of iteration reached, currently 100000. Sometimes iterations go on in cycle even though the solution has been reached, to check if it's the case here have a look at the minimal reduced cost. If it is very close to machine precision, you might actually have the correct solution, if not try setting the maximum number of iterations a bit higher
1 1 Accuracy 0.5920829519638594 time 30.46793842315674
Using selected backend cpu. If you want to use gpu, set use_gpu = True.
1 2 Accuracy 0.861938266075075 time 152.26987481117249
Using selected backend cpu. If you want to use gpu, set use_gpu = True.
2 0 Accuracy 0.8589111598101083 time 178.80665159225464
Using selected backend cpu. If you want to use gpu, set use_gpu = True.
2 1 Accuracy 0.827013263737229 time 208.154137134552
Using selected backend cpu. If you want to use gpu, set use_gpu = True.
2 2 Accuracy 0.8283188989963737 time 259.1206352710724
[11]:
res_df
[11]:
Sample | Pair | Kind | Time | Accuracy | |
---|---|---|---|---|---|
0 | 0 | 0 | PASTE | 333.451517 | 0.815543 |
1 | 0 | 1 | PASTE | 52.269108 | 0.220472 |
2 | 0 | 2 | PASTE | 32.520469 | 0.871350 |
3 | 1 | 0 | PASTE | 598.982004 | 0.915129 |
4 | 1 | 1 | PASTE | 30.467938 | 0.592083 |
5 | 1 | 2 | PASTE | 152.269875 | 0.861938 |
6 | 2 | 0 | PASTE | 178.806652 | 0.858911 |
7 | 2 | 1 | PASTE | 208.154137 | 0.827013 |
8 | 2 | 2 | PASTE | 259.120635 | 0.828319 |
[12]:
paste_layer_groups = [pst.stack_slices_pairwise(layer_groups[j], pis[j]) for j in range(len(layer_groups)) ]
[67]:
def plot_slices_overlap(groups, adatas, sample_list, layer_to_color_map,save=None,):
marker_list = ['.','*','x','+']
for j in range(len(groups)):
plt.figure(figsize=(10,10))
for i in range(len(groups[j])):
adata = adatas[sample_list[j*4+i]]
colors = list(adata.obs['Region'].astype('str').map(layer_to_color_map))
plt.scatter(groups[j][i].obsm['spatial'][:,0],groups[j][i].obsm['spatial'][:,1],linewidth=1,s=80, marker=marker_list[i],color=colors,alpha=0.7)
plt.legend(handles=[mpatches.Patch(color=layer_to_color_map[adata.obs['Region'].cat.categories[i]], label=adata.obs['Region'].cat.categories[i]) for i in range(len(adata.obs['Region'].cat.categories))],fontsize=10,title='Cortex layer',title_fontsize=15,bbox_to_anchor=(1, 1))
plt.gca().invert_yaxis()
plt.axis('off')
if save is None:
plt.show()
else:
plt.savefig(f'{save}_{j}.pdf',bbox_inches='tight',transparent=True)
[68]:
# Plot Stacking of Four slices without alignment
plot_slices_overlap(layer_groups, adatas, sample_list, layer_to_color_map,save='figures/DLPFC_before')



[69]:
# Plot Stacking of Four slices with PASTE alignment
plot_slices_overlap(paste_layer_groups, adatas, sample_list, layer_to_color_map,save='figures/DLPFC_after')



[64]:
dataset_name = 'dlpfc'
for i in range(len(layer_groups)):
rsta_list = []
for j in range(len(layer_groups[i])):
a= layer_groups[i][j]
spatial = a.obsm['spatial']
spatial_z = np.ones(shape=(spatial.shape[0],1))*j*100
spatial_3d = np.hstack([spatial,spatial_z])
a.obsm['spatial_3d'] = spatial_3d
rsta_list.append(a)
a_concat = rsta_list[0].concatenate(rsta_list[1:])
a_concat.write_h5ad(f'{dataset_name}_sample{i}_3d.h5ad')
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
[63]:
dataset_name = 'dlpfc'
for i in range(len(paste_layer_groups)):
rsta_list = []
for j in range(len(paste_layer_groups[i])):
a= paste_layer_groups[i][j]
spatial = a.obsm['spatial']
spatial_z = np.ones(shape=(spatial.shape[0],1))*j*100
spatial_3d = np.hstack([spatial,spatial_z])
a.obsm['spatial_3d'] = spatial_3d
rsta_list.append(a)
a_concat = rsta_list[0].concatenate(rsta_list[1:])
a_concat.write_h5ad(f'{dataset_name}_sample{i}_3d.h5ad')
[63]:
8517.0
Deconvolution_Tangram_Visium_Brain
This tutorial demonstrates deconvolution on 10x Visium brain data using SODB and TANGRAM.
A reference paper can be found at https://www.nature.com/articles/s41592-021-01264-7.
This tutorial refers to the following tutorial at https://squidpy.readthedocs.io/en/stable/external_tutorials/tutorial_tangram.html. At the same time, the way of loadding data is modified by using SODB.
[ ]:
# import several Python libraries, including:
# scanpy: a Python package for single-cell RNA sequencing analysis.
# squidpy: a Python package for spatial transcriptomics analysis.
# numpy: a Python package for scientific computing with arrays.
# pandas: a Python package for data manipulation and analysis.
# anndata: a Python package for handling annotated data objects in genomics.
# pathlib: a Python module for working with file system paths.
# matplotlib: a Python plotting library.
# skimage: a Python package for image processing.
import scanpy as sc
import squidpy as sq
import numpy as np
import pandas as pd
import anndata as ad
from anndata import AnnData
import pathlib
import matplotlib.pyplot as plt
import matplotlib as mpl
import skimage
[ ]:
# import tangram for spatial deconvolution
import tangram as tg
[85]:
# print a header message, and the version of the squidpy and tangram packages
sc.logging.print_header()
print(f"squidpy=={sq.__version__}")
print(f"tangram=={tg.__version__}")
scanpy==1.9.1 anndata==0.8.0 umap==0.5.3 numpy==1.22.4 scipy==1.9.3 pandas==1.5.1 scikit-learn==1.1.3 statsmodels==0.13.5 python-igraph==0.10.2 pynndescent==0.5.8
squidpy==1.2.3
tangram==1.0.3
[86]:
## load the reference single cell dataset
# the input sc data has been normalized and log-transformed
adata_sc = sc.read_h5ad('data/Visium/sc_mouse_cortex.h5ad')
[63]:
# prints out the metadata of adata_sc
adata_sc
[63]:
AnnData object with n_obs × n_vars = 21697 × 36826
obs: 'sample_name', 'organism', 'donor_sex', 'cell_class', 'cell_subclass', 'cell_cluster', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts'
var: 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
uns: 'cell_class_colors', 'cell_subclass_colors', 'hvg', 'neighbors', 'pca', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
[64]:
# visualize a UMAP projection colored by cell_subclass
sc.pl.umap(
adata_sc, color="cell_subclass", size=10, frameon=False, show=False
)
[64]:
<AxesSubplot: title={'center': 'cell_subclass'}, xlabel='UMAP1', ylabel='UMAP2'>

[87]:
## load the low-resolution spatial data
# the input st data has been normalized and log-transformed
adata_st = sc.read_h5ad('data/Visium/visium_fluo_crop.h5ad')
[90]:
# create a spatial scatter plot colored by cluster label
sq.pl.spatial_scatter(adata_st,color='cluster')

[66]:
# visualize embedding base on 'spatial' with points colored by 'cluster' label
sc.pl.embedding(adata_st,basis='spatial',color='cluster')

[67]:
# selects a subset based on the "Cortex_{i}" of 'adata_st.obs.cluster'
# the arange of i is form 0 to 4
# and creates a copy of the resulting subset
adata_st = adata_st[
adata_st.obs.cluster.isin([f"Cortex_{i}" for i in np.arange(1, 5)])
].copy()
[68]:
# visualize embedding base on 'spatial' with points colored by a new 'cluster' label
sc.pl.embedding(adata_st,basis='spatial',color='cluster')

[69]:
# perform differential gene expression analysis across 'cell_subclasses' in 'adata_sc'
sc.tl.rank_genes_groups(adata_sc, groupby="cell_subclass", use_raw=False)
WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'
[70]:
# creates a Pandas DataFrame called "markers_df" by extracting the top 100 differentially expressed genes from 'adata_sc'
markers_df = pd.DataFrame(adata_sc.uns["rank_genes_groups"]["names"]).iloc[0:100, :]
# creates a NumPy array called "genes_sc" by extracting the unique values from the "value" column of a melted version of the "markers_df"
genes_sc = np.unique(markers_df.melt().value.values)
# extracte the names of genes from "adata_st"
genes_st = adata_st.var_names.values
# creates a Python list called "genes"
# contain the intersection of genes identified as differentially expressed in "genes_sc" and genes detected in "genes_st".
genes = list(set(genes_sc).intersection(set(genes_st)))
# the length of "genes"
len(genes)
[70]:
1281
[71]:
# use the Tangram to align the gene expression profiles of "adata_sc" and "adata_st" based on the shared set of genes identified by the intersection of "genes_sc" and "genes_st".
tg.pp_adatas(adata_sc, adata_st, genes=genes)
INFO:root:1280 training genes are saved in `uns``training_genes` of both single cell and spatial Anndatas.
INFO:root:14785 overlapped genes are saved in `uns``overlap_genes` of both single cell and spatial Anndatas.
INFO:root:uniform based density prior is calculated and saved in `obs``uniform_density` of the spatial Anndata.
INFO:root:rna count based density prior is calculated and saved in `obs``rna_count_based_density` of the spatial Anndata.
[72]:
# use the map_cells_to_space function from the tangram to map cells from "adata_sc")" onto "adata_st".
# The mapping use "cells" mode, which assign each cell from adata_sc to a location within the spatial transcriptomics space based on its gene expression profile.
ad_map = tg.map_cells_to_space(
adata_sc,
adata_st,
mode="cells",
# target_count=adata_st.obs.cell_count.sum(),
# density_prior=np.array(adata_st.obs.cell_count) / adata_st.obs.cell_count.sum(),
num_epochs=1000,
device="cpu",
)
INFO:root:Allocate tensors for mapping.
INFO:root:Begin training with 1280 genes and rna_count_based density_prior in cells mode...
INFO:root:Printing scores every 100 epochs.
Score: 0.613, KL reg: 0.001
Score: 0.733, KL reg: 0.000
Score: 0.736, KL reg: 0.000
Score: 0.737, KL reg: 0.000
Score: 0.737, KL reg: 0.000
Score: 0.737, KL reg: 0.000
Score: 0.737, KL reg: 0.000
Score: 0.737, KL reg: 0.000
Score: 0.738, KL reg: 0.000
Score: 0.738, KL reg: 0.000
INFO:root:Saving results..
[73]:
ad_map
[73]:
AnnData object with n_obs × n_vars = 21697 × 324
obs: 'sample_name', 'organism', 'donor_sex', 'cell_class', 'cell_subclass', 'cell_cluster', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts'
var: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_MT', 'log1p_total_counts_MT', 'pct_counts_MT', 'n_counts', 'leiden', 'cluster', 'uniform_density', 'rna_count_based_density'
uns: 'train_genes_df', 'training_history'
[74]:
# project "Cell_subclass" annotations from a single-cell RNA sequencing (scRNA-seq) dataset onto a spatial transcriptomics dataset,
# based on a previously computed cell-to-space mapping
tg.project_cell_annotations(ad_map, adata_st, annotation="cell_subclass")
INFO:root:spatial prediction dataframe is saved in `obsm` `tangram_ct_pred` of the spatial AnnData.
[75]:
# print adata_st.obsm['tangram_ct_pred']
adata_st.obsm['tangram_ct_pred']
[75]:
Pvalb | L4 | Vip | L2/3 IT | Lamp5 | NP | Sst | L5 IT | Oligo | L6 CT | ... | L5 PT | Astro | L6b | Endo | Peri | Meis2 | Macrophage | CR | VLMC | SMC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AAATGGCATGTCTTGT-1 | 6.703794 | 1.249403 | 5.271374 | 0.207670 | 5.762299 | 3.027590 | 4.872548 | 3.667596 | 0.327661 | 7.175315 | ... | 8.914245 | 2.584236 | 0.511923 | 0.394293 | 0.000059 | 0.000051 | 1.245422 | 0.034912 | 0.172818 | 0.655464 |
AACAACTGGTAGTTGC-1 | 4.580706 | 0.000714 | 14.317341 | 0.001151 | 5.769284 | 3.693370 | 4.657304 | 9.034256 | 0.426140 | 1.089851 | ... | 7.453603 | 1.877510 | 1.313003 | 0.582352 | 0.104631 | 0.000457 | 0.579031 | 0.064955 | 0.000128 | 0.231964 |
AACAGGAAATCGAATA-1 | 5.260801 | 0.113564 | 7.159147 | 0.001269 | 5.232882 | 0.334691 | 8.427688 | 5.535397 | 0.432518 | 14.767076 | ... | 1.919125 | 1.978246 | 0.583333 | 0.574374 | 0.224131 | 1.042355 | 0.643977 | 0.044565 | 0.000154 | 0.262665 |
AACCCAGAGACGGAGA-1 | 8.074331 | 4.424110 | 5.061642 | 4.457622 | 8.327649 | 0.000341 | 7.674059 | 11.550767 | 0.628734 | 2.142216 | ... | 0.001076 | 2.988949 | 0.000374 | 0.600461 | 0.000043 | 0.234064 | 0.902490 | 0.000056 | 0.582658 | 0.552895 |
AACCGTTGTGTTTGCT-1 | 9.431644 | 5.692990 | 4.099012 | 0.780499 | 5.406463 | 1.000299 | 7.600714 | 13.673264 | 1.531166 | 0.000495 | ... | 2.683809 | 1.088530 | 0.686432 | 1.533100 | 0.073415 | 0.001884 | 0.212199 | 0.051355 | 0.000080 | 0.611694 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
TTGGATTGGGTACCAC-1 | 6.984557 | 1.388598 | 13.702862 | 1.659602 | 2.685283 | 1.234148 | 12.399532 | 9.730983 | 0.475787 | 0.001387 | ... | 6.636489 | 2.490689 | 0.000400 | 0.498017 | 0.015851 | 0.000065 | 0.214483 | 0.000162 | 0.259347 | 0.759218 |
TTGGCTCGCATGAGAC-1 | 3.597446 | 3.890185 | 5.089306 | 11.309047 | 8.250157 | 0.000501 | 11.919669 | 8.804816 | 0.191763 | 0.546082 | ... | 0.000580 | 0.883828 | 0.074400 | 0.383416 | 0.026436 | 0.030055 | 0.284356 | 0.000174 | 0.290548 | 0.372900 |
TTGTATCACACAGAAT-1 | 3.834474 | 0.001383 | 5.894804 | 0.001749 | 6.786698 | 4.672627 | 7.654390 | 9.202905 | 0.704625 | 6.031383 | ... | 3.797775 | 0.782928 | 0.773477 | 0.392378 | 0.073810 | 0.000265 | 0.178472 | 0.037704 | 0.233385 | 0.208704 |
TTGTGGCCCTGACAGT-1 | 7.564358 | 3.797840 | 9.279658 | 0.000907 | 1.633314 | 0.873424 | 5.427505 | 4.280141 | 0.832505 | 7.299751 | ... | 3.114690 | 2.046871 | 0.544789 | 0.689781 | 0.061167 | 0.000040 | 0.691132 | 0.044765 | 0.366478 | 0.202596 |
TTGTTAGCAAATTCGA-1 | 3.780981 | 20.787467 | 5.825070 | 1.782815 | 1.879261 | 0.825495 | 5.716938 | 13.947906 | 0.657532 | 0.000990 | ... | 0.514469 | 1.320148 | 0.000393 | 0.534422 | 0.038299 | 0.000858 | 0.144029 | 0.050987 | 0.238178 | 0.165719 |
324 rows × 23 columns
[77]:
# concatenate the predicted cell type labels computed by the tangram during the cell-to-space mapping step to 'adata_st.obs'
adata_st.obs = pd.concat([adata_st.obs, adata_st.obsm["tangram_ct_pred"]], axis=1)
# create a spatial scatter plot showing the distribution of different cell types
sq.pl.spatial_scatter(
adata_st,
color=["L2/3 IT", "L4", "L5 IT", "L5 PT", "L6 CT", "L6 IT", "L6b"],
)

SpatialClustering_Stagate_DLPFC
This tutorial demonstrates how to identify spatial domains on 10x Visium data using SODB and STAGATE based on pyG (PyTorch Geometric) framework.
A reference paper can be found at https://www.nature.com/articles/s41467-022-29439-6.
This tutorial refers to the following tutorial at https://stagate.readthedocs.io/en/latest/T1_DLPFC.html. At the same time, the way of loadding data is modified by using SODB.
Preparation
[1]:
# Use the Python warnings module to filter and ignore any warnings that may occur in the program after this point.
import warnings
warnings.filterwarnings("ignore")
[2]:
# import several Python libraries commonly used in data analysis and visualization.
# pandas (imported as pd) is a library for data manipulation and analysis.
# numpy (imported as np) is a library for numerical computing with arrays.
# scanpy (imported as sc) is a library for single-cell RNA sequencing analysis.
# matplotlib.pyplot (imported as plt) is a library for data visualization.
# os is a library for interacting with the operating system, such as reading or writing files.
# sys is a library for interacting with the Python interpreter.
import pandas as pd
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import os
import sys
[3]:
# Import the adjusted_rand_score function from the sklearn.metrics.cluster module of the scikit-learn library.
from sklearn.metrics.cluster import adjusted_rand_score
[4]:
# Import a STAGATE_pyG module
import STAGATE_pyG
[5]:
# The location of R (used for the mclust clustering)
# os.environ['R_HOME'] = '/home/<usr_name>/anaconda3/envs/<environment_name>/lib/R/'
# os.environ['R_USER'] = '/home/<usr_name>/anaconda3/envs/<environment_name>/lib/R/rpy2'
os.environ['R_HOME'] = '/home/linsenlin/anaconda3/envs/stagate38/lib/R/'
os.environ['R_USER'] = '/home/linsenlin/anaconda3/envs/stagate38/lib/R/rpy2'
[7]:
# Import the pysodb library
# pysodb is a Python package that provides a set of tools for working with SODB (Store On-Demand Block) databases.
# SODB is a format used to store data in memory-mapped files for efficient access and querying.
# This library allows users to interact with SODB files using Python.
import pysodb
[ ]:
# Initialization
sodb = pysodb.SODB()
[8]:
# Define the name of the dataset_name and experiment_name
dataset_name = 'maynard2021trans'
experiment_name = '151676'
# Load a specific experiment
# It takes two arguments: the name of the dataset and the name of the experiment to load.
# Two arguments are available at https://gene.ai.tencent.com/SpatialOmics/.
adata = sodb.load_experiment(dataset_name,experiment_name)
download experiment[151676] in dataset[maynard2021trans]
100%|██████████| 121M/121M [01:03<00:00, 2.00MB/s]
load experiment[151676] in dataset[maynard2021trans]
[9]:
adata
[9]:
AnnData object with n_obs × n_vars = 3460 × 33538
obs: 'in_tissue', 'array_row', 'array_col', 'Region', 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'hvg', 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial', 'spatial_neighbors', 'umap'
obsm: 'X_pca', 'X_umap', 'spatial'
varm: 'PCs'
obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances'
[10]:
# Make the index unique by appending a number string to each duplicate index element
adata.var_names_make_unique()
[11]:
# Normalization
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
WARNING: adata.X seems to be already log-transformed.
[12]:
# read the annotation
Ann_df = pd.read_csv('/home/linsenlin/protocols/sc/stagate/STAGATE_pyG/data/151676_truth.txt', sep='\t', header=None, index_col=0)
Ann_df.columns = ['Ground Truth']
[15]:
# Add a ['Ground Truth'] column from the 'Ground Truth' column in the 'Ann_df'.
adata.obs['Ground Truth'] = Ann_df.loc[adata.obs_names, 'Ground Truth']
[17]:
# generate a spatial plot using the image specified in the 'hires' key and coloring the plot based on the 'Ground Truth' column.
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.spatial(adata, img_key="hires", color=['Ground Truth'])

Constructing the spatial network
[18]:
# Use "STAGATE_pyG.Cal_Spatial_Net" to calculate a spatial graph with a radius cutoff of 150.
STAGATE_pyG.Cal_Spatial_Net(adata, rad_cutoff=150)
# Use "STAGATE_pyG.Stats_Spatial_Net" to summarize cells and edges information.
STAGATE_pyG.Stats_Spatial_Net(adata)
------Calculating spatial graph...
The graph contains 20052 edges, 3460 cells.
5.7954 neighbors per cell on average.

Running STAGATE
[19]:
# Train a STAGATE model
adata = STAGATE_pyG.train_STAGATE(adata)
Size of Input: (3460, 3000)
100%|██████████| 1000/1000 [00:06<00:00, 144.71it/s]
Spatial Clustering
[ ]:
# Calculate the nearest neighbors in the 'STAGATE' representation and computes the UMAP embedding.
sc.pp.neighbors(adata, use_rep='STAGATE')
sc.tl.umap(adata)
[20]:
# Use Mclust_R to cluster cells in the 'STAGATE' representation into 7 clusters.
adata = STAGATE_pyG.mclust_R(adata, used_obsm='STAGATE', num_cluster=7)
R[write to console]: __ __
____ ___ _____/ /_ _______/ /_
/ __ `__ \/ ___/ / / / / ___/ __/
/ / / / / / /__/ / /_/ (__ ) /_
/_/ /_/ /_/\___/_/\__,_/____/\__/ version 6.0.0
Type 'citation("mclust")' for citing this R package in publications.
fitting ...
|======================================================================| 100%
[21]:
# Compute the adjusted rand index (ARI) between the 'mclust' and the 'Ground Truth'.
obs_df = adata.obs.dropna()
ARI = adjusted_rand_score(obs_df['mclust'], obs_df['Ground Truth'])
print('Adjusted rand index = %.2f' %ARI)
Adjusted rand index = 0.60
[22]:
# generate a plot of the UMAP embedding colored by both a mclust and a ground truth.
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.umap(adata, color=["mclust", "Ground Truth"], title=['STAGATE (ARI=%.2f)'%ARI, "Ground Truth"])

[23]:
# Visualize the result using mclust and Ground Truth.
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.spatial(adata, color=["mclust", "Ground Truth"], title=['STAGATE (ARI=%.2f)'%ARI, "Ground Truth"])

Spatial trajectory inference (PAGA)
[24]:
"""
used_adata = adata[adata.obs['Ground Truth']!='nan']
used_adata
"""
[24]:
"\nused_adata = adata[adata.obs['Ground Truth']!='nan']\nused_adata\n"
[25]:
# Exclude any cells with missing values in the 'Ground Truth' column of the observation metadata.
used_adata = adata[pd.notna(adata.obs['Ground Truth'])]
used_adata
[25]:
View of AnnData object with n_obs × n_vars = 3431 × 33538
obs: 'in_tissue', 'array_row', 'array_col', 'Region', 'leiden', 'Ground Truth', 'mclust'
var: 'gene_ids', 'feature_types', 'genome', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_rank', 'variances', 'variances_norm'
uns: 'hvg', 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial', 'spatial_neighbors', 'umap', 'Ground Truth_colors', 'Spatial_Net', 'mclust_colors'
obsm: 'X_pca', 'X_umap', 'spatial', 'STAGATE'
varm: 'PCs'
obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances'
[26]:
# Use PAGA to infer differentiation trajectories.
sc.tl.paga(used_adata, groups="Ground Truth")
[27]:
# Compare partition-based graph abstraction (PAGA) results.
plt.rcParams["figure.figsize"] = (4,3)
sc.pl.paga_compare(used_adata, legend_fontsize=10, frameon=False, size=20,
title=experiment_name+'_STGATE', legend_fontoutline=2, show=False)
[27]:
[<Axes: xlabel='UMAP1', ylabel='UMAP2'>, <Axes: >]

[ ]:
# 参照:https://github.com/almaan/sepal/blob/master/examples/melanoma.ipynb
# paper:https://academic.oup.com/bioinformatics/article/37/17/2644/6168120
SpatiallyVariableGene_Sepal_MOB
This tutorial demonstrates how to (find) spatially variable gene on MOB data using SODB and Sepal.
A reference paper can be found at https://academic.oup.com/bioinformatics/article/37/17/2644/6168120.
This tutorial refers to the following tutorial at https://github.com/almaan/sepal/blob/master/examples/melanoma.ipynb. At the same time, the way of loadding data is modified by using SODB.
[1]:
# Imports
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sepal
import sepal.datasets as d
import sepal.models as m
import sepal.utils as ut
import sepal.family as family
import sepal.enrich as fea
# 这里会报错说很多包没装,pip install xxx即可
# 报错ImportError: cannot import name 'GProfiler' from 'gprofiler' (/home/yzy/anaconda3/envs/pysodb/lib/python3.8/site-packages/gprofiler/__init__.py)
# 把 ~/anaconda3/envs/pysodb/lib/python3.8/site-packages/sepal-1.0.0-py3.8.egg/sepal/enrich.py 中
# from gprofiler import GProfiler
# 改为
# from gprofiler import gprofiler as GProfiler
# 参考:https://github.com/vals/python-gprofiler/issues/9
[19]:
# Initialization
import pysodb
[20]:
# Initialization
sodb = pysodb.SODB()
[23]:
# Define the name of the dataset_name and experiment_name
dataset_name = 'stahl2016visualization'
experiment_name = 'Rep4_MOB_trans'
# Load a specific experiment
# It takes two arguments: the name of the dataset and the name of the experiment to load.
# Two arguments are available at https://gene.ai.tencent.com/SpatialOmics/.
adata = sodb.load_experiment(dataset_name,experiment_name)
load experiment[Rep4_MOB_trans] in dataset[stahl2016visualization]
[24]:
# saves the AnnData object to an H5AD file format.
adata.write_h5ad('MOB_pysodb.h5ad')
[ ]:
# load in the raw data using a RawData class.
raw_data = d.RawData('MOB_pysodb.h5ad')
[ ]:
# filter genes observed in less than 5 spots and/or less than 10 total observations
raw_data.cnt = ut.filter_genes(raw_data.cnt,
min_expr=10,
min_occur=5)
[ ]:
# Derivative of CountData to hold ST1k array based data using a ST1K class
data = m.ST1K(raw_data,
eps = 0.1)
[37]:
data.cnt.shape
[37]:
(264, 10869)
[38]:
# A propagate class is employ to normalize count data and then propagate it in time, to measure the diffusion time.
# set scale = True to perform
# minmax scaling of the diffusion times
times = m.propagate(data,
normalize = True,
scale =True)
[INFO] : Using 80 workers
[INFO] : Saturated Spots : 199
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10869/10869 [00:20<00:00, 525.17it/s]
[39]:
# inspect the top 20 highest ranked ones after computing diffusion times for all the profiles
n_top = 20
# get names of profiles
sorted_indices = np.argsort(times.values.flatten())
# argsort sorts from low to high, so reverse the order
sorted_indices = sorted_indices[::-1]
sorted_profiles = times.index.values[sorted_indices]
# extract top profiles
top_profiles = sorted_profiles[0:n_top]
# display top profiles and their times
times.loc[top_profiles,:]
[39]:
average | |
---|---|
Rbfox1 | 1.000000 |
Gpsm1 | 0.832367 |
Prkca | 0.806763 |
Penk | 0.796135 |
Tyro3 | 0.786957 |
Rbfox3 | 0.763285 |
Pcp4 | 0.733333 |
Cacng3 | 0.733333 |
Omp | 0.732850 |
Kcnh3 | 0.723188 |
Grin1 | 0.694686 |
Nrgn | 0.671981 |
Agap2 | 0.658937 |
S100a5 | 0.654589 |
Tshz1 | 0.647343 |
Cpne4 | 0.644444 |
Map2k1 | 0.643961 |
Camk4 | 0.642512 |
Gria3 | 0.623188 |
Sez6 | 0.602899 |
[40]:
# inspect these visually by using the "plot_profiles function
pltargs = dict(s = 60,
cmap = "magma",
edgecolor = 'none',
marker = 'o',
)
# plot the profiles
fig,ax = ut.plot_profiles(cnt = data.cnt.loc[:,top_profiles],
crd = data.real_crd,
rank_values = times.loc[top_profiles,:].values.flatten(),
pltargs = pltargs,
)
plt.show()

[41]:
# As a subsequent step, sort these profiles into pattern families
# Use the top 100 profiles to build the families
n_build = 300
# Use the "get_families" function to group the profiles
family_labels,repr_patterns = family.get_families(data.cnt.loc[:,sorted_profiles].values,
n_base = n_build,
n_sort = n_build,
threshold = 0.80,
)
# familiy label is am array of family indices for each of the profiles.
# Convert familiy label to a data frame to make it easier to work with
families = pd.DataFrame(family_labels,
index = sorted_profiles[0:n_build],
columns = ['family'],
)
[INFO] : Using 4 eigenpatterns
[INFO] : Identified 4 families
/home/yzy/anaconda3/envs/pysodb/lib/python3.8/site-packages/sklearn/cluster/_agglomerative.py:983: FutureWarning: Attribute `affinity` was deprecated in version 1.2 and will be removed in 1.4. Use `metric` instead
warnings.warn(
[42]:
# As displayed, 4 families were found.
# To get an idea of what type what type of spatial patterns that each family consists of
# plot their respective representative motifs
# use the "plot_representative" function for this purpose
fig, ax = family.plot_representative(motifs=repr_patterns,
crd = data.real_crd,
ncols = 3,
pltargs=pltargs)

[43]:
# the motifs differ to some extent
# inspect all the members of each family
# use the "plot_family" function for this
family_plots= family.plot_families(data.cnt.loc[:,sorted_profiles[0:n_build]].values,
genes = sorted_profiles[0:n_build],
crd = data.real_crd,
labels = family_labels,
ncols = 10,
pltargs = pltargs,
side_size = 300,
)




Test the new data
Alignment_Paste_BaristaSeq
[1]:
import math
import pandas as pd
import numpy as np
import scipy
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib import style
import matplotlib
import time
import scanpy as sc
import sklearn
import networkx as nx
import ot
import paste as pst
#%load_ext autoreload
#%autoreload 2
# style.use('seaborn-dark')
style.use('seaborn-white')
# error:ModuleNotFoundError: No module named 'ot'
# need:pip install POT
/tmp/ipykernel_93165/1186083132.py:19: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
style.use('seaborn-white')
[2]:
import pysodb
sodb = pysodb.SODB()
[3]:
# load two kind data
adata_list_baristaseq = sodb.load_dataset('Sun2021Integrating')
adata_starmap = sodb.load_dataset('Dataset11_MS_raw')['Dataset11']
load experiment[Slice_1] in dataset[Sun2021Integrating]
load experiment[Slice_3] in dataset[Sun2021Integrating]
load experiment[Slice_2] in dataset[Sun2021Integrating]
load experiment[Dataset11] in dataset[Dataset11_MS_raw]
[4]:
adata_list_starmap = {}
for si in adata_starmap.obs['slice_id'].cat.categories:
a = adata_starmap[adata_starmap.obs['slice_id']==si]
a.obs['layer'] = a.obs['gt'].astype('str')
a.obs['layer'] = a.obs['layer'].astype('category')
adata_list_starmap[a.obs['slice_id'][0]] = a
/tmp/ipykernel_93165/2236284691.py:4: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
a.obs['layer'] = a.obs['gt'].astype('str')
/tmp/ipykernel_93165/2236284691.py:4: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
a.obs['layer'] = a.obs['gt'].astype('str')
/tmp/ipykernel_93165/2236284691.py:4: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
a.obs['layer'] = a.obs['gt'].astype('str')
[63]:
import numpy as np
import random
def rotate_translate(matrix):
# Create rotation matrix
theta = random.uniform(0, 2 * np.pi)
rotation_matrix = np.array([[np.cos(theta), -np.sin(theta)],
[np.sin(theta), np.cos(theta)]])
# Calculate translation bounds
max_coords = np.max(matrix, axis=0)
min_coords = np.min(matrix, axis=0)
translation_bounds = 0.5 * (max_coords - min_coords)
# Generate random translation vector within bounds
translation_vector = np.array([random.uniform(-translation_bounds[0], translation_bounds[0]),
random.uniform(-translation_bounds[1], translation_bounds[1])])
# Apply rotation and translation
new_matrix = np.dot(matrix, rotation_matrix) + translation_vector
return new_matrix
[7]:
adata_list = adata_list_baristaseq
adata_list.update(adata_list_starmap)
[8]:
adata_list
[8]:
{'Slice_1': AnnData object with n_obs × n_vars = 3390 × 79
obs: 'Slice', 'x', 'y', 'Dist to pia', 'Dist to bottom', 'Angle', 'unused-1', 'unused-2', 'x_um', 'y_um', 'depth_um', 'layer', 'leiden'
uns: 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial_neighbors', 'umap'
obsm: 'X_pca', 'X_umap', 'spatial'
varm: 'PCs'
obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances',
'Slice_3': AnnData object with n_obs × n_vars = 3545 × 79
obs: 'Slice', 'x', 'y', 'Dist to pia', 'Dist to bottom', 'Angle', 'unused-1', 'unused-2', 'x_um', 'y_um', 'depth_um', 'layer', 'leiden'
uns: 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial_neighbors', 'umap'
obsm: 'X_pca', 'X_umap', 'spatial'
varm: 'PCs'
obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances',
'Slice_2': AnnData object with n_obs × n_vars = 4491 × 79
obs: 'Slice', 'x', 'y', 'Dist to pia', 'Dist to bottom', 'Angle', 'unused-1', 'unused-2', 'x_um', 'y_um', 'depth_um', 'layer', 'leiden'
uns: 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial_neighbors', 'umap'
obsm: 'X_pca', 'X_umap', 'spatial'
varm: 'PCs'
obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances',
'BZ14': AnnData object with n_obs × n_vars = 1088 × 166
obs: 'ct', 'gt', 'slice_id', 'batch', 'layer'
uns: 'moranI', 'spatial_neighbors'
obsm: 'spatial'
obsp: 'spatial_connectivities', 'spatial_distances',
'BZ5': AnnData object with n_obs × n_vars = 1049 × 166
obs: 'ct', 'gt', 'slice_id', 'batch', 'layer'
uns: 'moranI', 'spatial_neighbors'
obsm: 'spatial'
obsp: 'spatial_connectivities', 'spatial_distances',
'BZ9': AnnData object with n_obs × n_vars = 1053 × 166
obs: 'ct', 'gt', 'slice_id', 'batch', 'layer'
uns: 'moranI', 'spatial_neighbors'
obsm: 'spatial'
obsp: 'spatial_connectivities', 'spatial_distances'}
[14]:
sample_list = ["Slice_1", "Slice_2", "Slice_3",'BZ5','BZ9','BZ14']
[64]:
adatas = {}
for key in sample_list:
a = adata_list[key]
a = a[np.logical_not((a.obs['layer']=='VISp') | (a.obs['layer']=='outside_VISp'))]
new_spatial = rotate_translate(a.obsm['spatial'])
a.obsm['spatial'] = new_spatial
adatas[key] = a
[65]:
sample_groups = [["Slice_1", "Slice_2", "Slice_3"],['BZ5','BZ9','BZ14',]]
layer_groups = [[adatas[sample_groups[j][i]] for i in range(len(sample_groups[j]))] for j in range(len(sample_groups))]
cmp = sns.color_palette()
layer_to_color_map = {
'VISp_I':cmp[0],
'VISp_II/III':cmp[1],
'VISp_IV':cmp[2],
'VISp_V':cmp[3],
'VISp_VI':cmp[4],
'VISp_wm':cmp[5],
'1':cmp[6],
'2':cmp[7],
'3':cmp[8],
'4':cmp[9],
}
[77]:
slice_map = {0:'A',1:'B',2:'C'}
fig, axs = plt.subplots(2, 3,figsize=(15,11.5))
for j in range(len(layer_groups)):
axs[j,0].text(-0.1, 0.5, 'Sample '+slice_map[j],fontsize=12,rotation='vertical',transform = axs[j,0].transAxes,verticalalignment='center')
for i in range(len(layer_groups[j])):
adata = adatas[sample_list[j*3+i]]
colors = list(adata.obs['layer'].astype('str').map(layer_to_color_map))
colors = [(r, g, b) for r, g, b in colors]
axs[j,i].scatter(layer_groups[j][i].obsm['spatial'][:,0],layer_groups[j][i].obsm['spatial'][:,1],linewidth=0,s=20, marker=".",
color=colors
)
axs[j,i].set_title('Slice '+ slice_map[i],size=12)
axs[j,i].invert_yaxis()
axs[j,i].axis('off')
axs[j,i].axis('equal')
# if i<3:
# s = ''
# delta = 0.05 if i==1 else 0
# axs[j,i].annotate('',xy=(1-delta, 0.5), xytext=(1.2+delta, 0.5),xycoords=axs[j,i].transAxes,textcoords=axs[j,i].transAxes,arrowprops=dict(arrowstyle='<->',lw=1))
# axs[j,0].text(1.1, 0.55, s,fontsize=9,transform = axs[j,i].transAxes,horizontalalignment='center')
axs[j,2].legend(handles=[mpatches.Patch(color=layer_to_color_map[adata.obs['layer'].cat.categories[i]], label=adata.obs['layer'].cat.categories[i]) for i in range(len(adata.obs['layer'].cat.categories))],fontsize=10,title='Cortex layer',title_fontsize=12,bbox_to_anchor=(1, 1))

[67]:
alpha = 0.1
pis = [[None for i in range(len(layer_groups[j])-1)] for j in range(len(layer_groups))]
for j in range(len(layer_groups)):
for i in range(len(layer_groups[j])-1):
pi0 = pst.match_spots_using_spatial_heuristic(layer_groups[j][i].obsm['spatial'],layer_groups[j][i+1].obsm['spatial'],use_ot=True)
start = time.time()
pis[j][i] = pst.pairwise_align(layer_groups[j][i], layer_groups[j][i+1],alpha=alpha,G_init=pi0,norm=True,verbose=False)
tt = time.time()-start
print(j,i,'time',tt)
Using selected backend cpu. If you want to use gpu, set use_gpu = True.
0 0 time 25.737168788909912
Using selected backend cpu. If you want to use gpu, set use_gpu = True.
0 1 time 35.07442283630371
Using selected backend cpu. If you want to use gpu, set use_gpu = True.
1 0 time 8.054092645645142
Using selected backend cpu. If you want to use gpu, set use_gpu = True.
1 1 time 17.03704857826233
[68]:
paste_layer_groups = [pst.stack_slices_pairwise(layer_groups[j], pis[j]) for j in range(len(layer_groups)) ]
[69]:
paste_layer_groups
[69]:
[[AnnData object with n_obs × n_vars = 1525 × 79
obs: 'Slice', 'x', 'y', 'Dist to pia', 'Dist to bottom', 'Angle', 'unused-1', 'unused-2', 'x_um', 'y_um', 'depth_um', 'layer', 'leiden'
uns: 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial_neighbors', 'umap'
obsm: 'X_pca', 'X_umap', 'spatial'
varm: 'PCs'
obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances',
AnnData object with n_obs × n_vars = 2042 × 79
obs: 'Slice', 'x', 'y', 'Dist to pia', 'Dist to bottom', 'Angle', 'unused-1', 'unused-2', 'x_um', 'y_um', 'depth_um', 'layer', 'leiden'
uns: 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial_neighbors', 'umap'
obsm: 'X_pca', 'X_umap', 'spatial'
varm: 'PCs'
obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances',
AnnData object with n_obs × n_vars = 1690 × 79
obs: 'Slice', 'x', 'y', 'Dist to pia', 'Dist to bottom', 'Angle', 'unused-1', 'unused-2', 'x_um', 'y_um', 'depth_um', 'layer', 'leiden'
uns: 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial_neighbors', 'umap'
obsm: 'X_pca', 'X_umap', 'spatial'
varm: 'PCs'
obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances'],
[AnnData object with n_obs × n_vars = 1049 × 166
obs: 'ct', 'gt', 'slice_id', 'batch', 'layer'
uns: 'moranI', 'spatial_neighbors'
obsm: 'spatial'
obsp: 'spatial_connectivities', 'spatial_distances',
AnnData object with n_obs × n_vars = 1053 × 166
obs: 'ct', 'gt', 'slice_id', 'batch', 'layer'
uns: 'moranI', 'spatial_neighbors'
obsm: 'spatial'
obsp: 'spatial_connectivities', 'spatial_distances',
AnnData object with n_obs × n_vars = 1088 × 166
obs: 'ct', 'gt', 'slice_id', 'batch', 'layer'
uns: 'moranI', 'spatial_neighbors'
obsm: 'spatial'
obsp: 'spatial_connectivities', 'spatial_distances']]
[92]:
def plot_slices_overlap(groups, adatas, sample_list, layer_to_color_map,save=None):
marker_list = ['*','^','+']
for j in range(len(groups)):
plt.figure(figsize=(10,10))
for i in range(len(groups[j])):
adata = adatas[sample_list[j*3+i]]
colors = list(adata.obs['layer'].astype('str').map(layer_to_color_map))
plt.scatter(groups[j][i].obsm['spatial'][:,0],groups[j][i].obsm['spatial'][:,1],linewidth=1,s=80, marker=marker_list[i],color=colors,alpha=0.8)
plt.legend(handles=[mpatches.Patch(color=layer_to_color_map[adata.obs['layer'].cat.categories[i]], label=adata.obs['layer'].cat.categories[i]) for i in range(len(adata.obs['layer'].cat.categories))],fontsize=10,title='Cortex layer',title_fontsize=15,bbox_to_anchor=(1, 1))
plt.gca().invert_yaxis()
plt.axis('off')
plt.axis('equal')
if save is None:
plt.show()
else:
plt.savefig(f'{save}_{j}.pdf',bbox_inches='tight',transparent=True)
[93]:
# Plot Stacking of Four slices without alignment
plot_slices_overlap(layer_groups, adatas, sample_list, layer_to_color_map,save='figures/new_before')


[95]:
# Plot Stacking of Four slices with PASTE alignment
plot_slices_overlap(paste_layer_groups, adatas, sample_list, layer_to_color_map,save='figures/new_after')


[ ]:
[74]:
dataset_name = 'new'
for i in range(len(layer_groups)):
rsta_list = []
for j in range(len(layer_groups[i])):
a= layer_groups[i][j]
spatial = a.obsm['spatial']
spatial_z = np.ones(shape=(spatial.shape[0],1))*j*100
spatial_3d = np.hstack([spatial,spatial_z])
a.obsm['spatial_3d'] = spatial_3d
rsta_list.append(a)
a_concat = rsta_list[0].concatenate(rsta_list[1:])
a_concat.write_h5ad(f'{dataset_name}_sample{i}_3d.h5ad')
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
[75]:
dataset_name = 'new'
for i in range(len(paste_layer_groups)):
rsta_list = []
for j in range(len(paste_layer_groups[i])):
a= paste_layer_groups[i][j]
spatial = a.obsm['spatial']
spatial_z = np.ones(shape=(spatial.shape[0],1))*j*100
spatial_3d = np.hstack([spatial,spatial_z])
a.obsm['spatial_3d'] = spatial_3d
rsta_list.append(a)
a_concat = rsta_list[0].concatenate(rsta_list[1:])
a_concat.write_h5ad(f'{dataset_name}_sample{i}_3d.h5ad')
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yzy/anaconda3/envs/paste/lib/python3.8/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
Deconvoluyion_Tangram_ST_PDAC
This tutorial demonstrates deconvolution on ST PDAC data using SODB and TANGRAM.
[ ]:
# import several Python libraries, including:
# scanpy: a Python package for single-cell RNA sequencing analysis.
# squidpy: a Python package for spatial transcriptomics analysis.
# numpy: a Python package for scientific computing with arrays.
# pandas: a Python package for data manipulation and analysis.
# anndata: a Python package for handling annotated data objects in genomics.
# pathlib: a Python module for working with file system paths.
# matplotlib: a Python plotting library.
# skimage: a Python package for image processing.
import scanpy as sc
import squidpy as sq
import numpy as np
import pandas as pd
import anndata as ad
from anndata import AnnData
import pathlib
import matplotlib.pyplot as plt
import matplotlib as mpl
import skimage
[ ]:
# import tangram for spatial deconvolution
import tangram as tg
[86]:
# print a header message, and the version of the squidpy and tangram packages
sc.logging.print_header()
print(f"squidpy=={sq.__version__}")
print(f"tangram=={tg.__version__}")
scanpy==1.9.1 anndata==0.8.0 umap==0.5.3 numpy==1.22.4 scipy==1.9.3 pandas==1.5.1 scikit-learn==1.1.3 statsmodels==0.13.5 python-igraph==0.10.2 pynndescent==0.5.8
squidpy==1.2.3
tangram==1.0.3
[87]:
# Use read_csv to load "pd_sc" and "pd_sc_meta"
pd_sc = pd.read_csv('data/pdac/sc_data.csv')
pd_sc_meta = pd.read_csv('data/pdac/sc_meta.csv')
[88]:
# set index using the 'Unnamed: 0' column for "pd_sc"
pd_sc = pd_sc.set_index('Unnamed: 0')
[89]:
# set index using the 'Cell' column for "pd_sc_meta"
pd_sc_meta = pd_sc_meta.set_index('Cell')
[90]:
#converte pandas's dataframes into numpy arrays
sc_genes = np.array(pd_sc.index)
sc_obs = np.array(pd_sc.columns)
# transposed array
sc_X = np.array(pd_sc.values).transpose()
[91]:
# initialize an AnnData object using "sc_X", "sc_genes" and "sc_obs"
adata_sc = ad.AnnData(sc_X)
adata_sc.var_names = sc_genes
adata_sc.obs_names = sc_obs
# assign the cell type labels to "adata_sc.obs['CellType']"
adata_sc.obs['CellType'] = pd_sc_meta['Cell_type'].values
[92]:
# prints out the metadata of adata_sc
adata_sc
[92]:
AnnData object with n_obs × n_vars = 1926 × 19104
obs: 'CellType'
[ ]:
# Import the pysodb library
# pysodb is a Python package that provides a set of tools for working with SODB (Store On-Demand Block) databases.
# SODB is a format used to store data in memory-mapped files for efficient access and querying.
# This library allows users to interact with SODB files using Python.
import pysodb
[93]:
# Initialization
sodb = pysodb.SODB()
[94]:
# Define the name of the dataset_name and experiment_name
dataset_name = 'moncada2020integrating'
experiment_name = 'GSM3036911_spatial_transcriptomics'
# Load a specific experiment
# It takes two arguments: the name of the dataset and the name of the experiment to load.
# Two arguments are available at https://gene.ai.tencent.com/SpatialOmics/.
adata_st = sodb.load_experiment(dataset_name,experiment_name)
load experiment[GSM3036911_spatial_transcriptomics] in dataset[moncada2020integrating]
[95]:
# perform differential gene expression analysis across 'CellType' in 'adata_sc'
sc.tl.rank_genes_groups(adata_sc, groupby="CellType", use_raw=False)
WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'
[96]:
# creates a Pandas DataFrame called "markers_df" by extracting the top 100 differentially expressed genes from 'adata_sc'
markers_df = pd.DataFrame(adata_sc.uns["rank_genes_groups"]["names"]).iloc[0:100, :]
# creates a NumPy array called "genes_sc" by extracting the unique values from the "value" column of a melted version of the "markers_df"
genes_sc = np.unique(markers_df.melt().value.values)
# extracte the names of genes from "adata_st"
genes_st = adata_st.var_names.values
# creates a Python list called "genes"
# contain the intersection of genes identified as differentially expressed in "genes_sc" and genes detected in "genes_st".
genes = list(set(genes_sc).intersection(set(genes_st)))
# the length of "genes"
len(genes)
[96]:
1133
[97]:
# use the Tangram to align the gene expression profiles of "adata_sc" and "adata_st" based on the shared set of genes identified by the intersection of "genes_sc" and "genes_st".
tg.pp_adatas(adata_sc, adata_st, genes=genes)
INFO:root:1098 training genes are saved in `uns``training_genes` of both single cell and spatial Anndatas.
INFO:root:13775 overlapped genes are saved in `uns``overlap_genes` of both single cell and spatial Anndatas.
INFO:root:uniform based density prior is calculated and saved in `obs``uniform_density` of the spatial Anndata.
INFO:root:rna count based density prior is calculated and saved in `obs``rna_count_based_density` of the spatial Anndata.
[98]:
# use the map_cells_to_space function from the tangram to map cells from "adata_sc")" onto "adata_st".
# The mapping use "cells" mode, which assign each cell from adata_sc to a location within the spatial transcriptomics space based on its gene expression profile.
ad_map = tg.map_cells_to_space(
adata_sc,
adata_st,
mode="cells",
# target_count=adata_st.obs.cell_count.sum(),
# density_prior=np.array(adata_st.obs.cell_count) / adata_st.obs.cell_count.sum(),
num_epochs=1000,
device="cpu",
)
INFO:root:Allocate tensors for mapping.
INFO:root:Begin training with 1098 genes and rna_count_based density_prior in cells mode...
INFO:root:Printing scores every 100 epochs.
Score: 0.293, KL reg: 0.030
Score: 0.548, KL reg: 0.001
Score: 0.550, KL reg: 0.001
Score: 0.551, KL reg: 0.001
Score: 0.551, KL reg: 0.001
Score: 0.551, KL reg: 0.001
Score: 0.551, KL reg: 0.001
Score: 0.551, KL reg: 0.001
Score: 0.551, KL reg: 0.001
Score: 0.551, KL reg: 0.001
INFO:root:Saving results..
[99]:
# project "Cell_subclass" annotations from a single-cell RNA sequencing (scRNA-seq) dataset onto a spatial transcriptomics dataset,
# based on a previously computed cell-to-space mapping
tg.project_cell_annotations(ad_map, adata_st, annotation="CellType")
INFO:root:spatial prediction dataframe is saved in `obsm` `tangram_ct_pred` of the spatial AnnData.
[100]:
# creates new columns in "adata_st.obs" that correspond to the values in "adata_st.obsm['tangram_ct_pred']"
for ct in adata_st.obsm['tangram_ct_pred'].columns:
adata_st.obs[ct] = np.array(adata_st.obsm['tangram_ct_pred'][ct].values)
[101]:
# print adata_st.obsm['tangram_ct_pred']
adata_st.obsm['tangram_ct_pred']
[101]:
Acinar cells | Ductal | Cancer clone A | Cancer clone B | mDCs | Tuft cells | pDCs | Endocrine cells | Endothelial cells | Macrophages | Mast cells | T cells NK cells | Monocytes | RBCs | Fibroblasts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
spots | |||||||||||||||
10x10 | 0.006334 | 4.706014 | 0.117727 | 0.399284 | 0.497135 | 0.013638 | 0.054890 | 0.005253 | 0.094044 | 0.335560 | 0.052560 | 0.180580 | 0.055085 | 0.026505 | 0.036953 |
10x13 | 0.002069 | 4.792149 | 0.280019 | 0.075865 | 0.118639 | 0.065541 | 0.024328 | 0.005411 | 0.016519 | 0.057996 | 0.038706 | 0.095902 | 0.049501 | 0.010803 | 0.005225 |
10x14 | 0.013044 | 4.487615 | 0.054121 | 0.310157 | 0.158647 | 0.017234 | 0.088165 | 0.003886 | 0.034732 | 0.128785 | 0.092184 | 0.057863 | 0.000053 | 0.011522 | 0.003363 |
10x15 | 0.033054 | 3.723423 | 0.066192 | 0.118146 | 0.136094 | 0.064683 | 0.061699 | 0.004048 | 0.052694 | 0.097385 | 0.036767 | 0.059832 | 0.107271 | 0.021100 | 0.003672 |
10x16 | 0.007866 | 3.579839 | 0.033015 | 0.026313 | 0.000054 | 0.088877 | 0.067944 | 0.004586 | 0.007438 | 0.026395 | 0.045274 | 0.282783 | 0.015714 | 0.003617 | 0.008149 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9x29 | 0.003962 | 4.499743 | 0.000111 | 0.424417 | 0.190720 | 0.011316 | 0.017832 | 0.005410 | 0.057422 | 0.028645 | 0.030519 | 0.149691 | 0.000129 | 0.005335 | 0.000061 |
9x30 | 0.003030 | 2.804563 | 0.036787 | 0.482187 | 0.109475 | 0.072179 | 0.017910 | 0.005255 | 0.017748 | 0.058597 | 0.117022 | 0.168436 | 0.080849 | 0.002608 | 0.008962 |
9x31 | 0.008780 | 1.433432 | 0.342950 | 0.297287 | 0.107831 | 0.017091 | 0.072641 | 0.004113 | 0.008162 | 0.380184 | 0.000050 | 0.169367 | 0.021958 | 0.003261 | 0.005781 |
9x32 | 0.055780 | 2.076649 | 0.023450 | 0.000402 | 0.047358 | 0.033798 | 0.000417 | 0.008584 | 0.029910 | 0.075386 | 0.031452 | 0.083817 | 0.020545 | 0.003629 | 0.002936 |
9x33 | 0.009425 | 1.586668 | 0.002164 | 0.133457 | 0.027596 | 0.036530 | 0.015639 | 0.005100 | 0.016363 | 0.029191 | 0.021844 | 0.041210 | 0.029885 | 0.020056 | 0.000254 |
428 rows × 15 columns
[102]:
# create a spatial scatter plot showing the distribution of different cell types
sc.pl.embedding(
adata_st,
basis='spatial',
color=['Acinar cells','Cancer clone A','Cancer clone B','Ductal'],
# color='leiden'
)

SpatialClustering_Stagate_MouseBrain
This tutorial demonstrates how to identify spatial domains and how to train STAGATE in batches on new EEL FISH mouse brain data , as well as using SODB and STAGATE based on pyG (PyTorch Geometric) framework.
Since the EEL FISH mouse brain has 127591 cells, each with 440 genes, this tutorial presents how to download data using SODB and also uses a batch training strategy to deal with large-scale data.
A tutorial of a batch training strategy can be found at https://stagate.readthedocs.io/en/latest/T8_Batch.html
Strategies for dividing subgraphs
Because we build the spatial network based on spatial location, our network can be directly divided into subgraphs in the following form.
The above picture is an example of num_batch_x=3, num_batch_y=2. Specifically, we divide subgraphs according to quantiles on a single spatial coordinate.
Preparation
[1]:
# import several Python libraries commonly used in data analysis and visualization.
# pandas (imported as pd) is a library for data manipulation and analysis.
# numpy (imported as np) is a library for numerical computing with arrays.
# scanpy (imported as sc) is a library for single-cell RNA sequencing analysis.
# matplotlib.pyplot (imported as plt) is a library for data visualization.
# os is a library for interacting with the operating system, such as reading or writing files.
# sys is a library for interacting with the Python interpreter.
# tqdm is a library for creating progress bars in Python, which is useful for tracking the progress of long-running loops or operations.
import pandas as pd
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import os
import sys
from tqdm import tqdm
[2]:
#import PyTorch library
import torch
#import STAGATE_pyG library
import STAGATE_pyG
#import torch.nn.functional module
# "torch.nn.functional" module provides various functions that are commonly used as activations and loss functions in neural networks
import torch.nn.functional as F
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
[3]:
# set the PyTorch device either available GPU or the CPU
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
[4]:
# Import the pysodb library
# pysodb is a Python package that provides a set of tools for working with SODB (Store On-Demand Block) databases.
# SODB is a format used to store data in memory-mapped files for efficient access and querying.
# This library allows users to interact with SODB files using Python.
import pysodb
[ ]:
# Initialization
sodb = pysodb.SODB()
[5]:
# Define the name of the dataset_name and experiment_name
dataset_name = 'Borm2022Scalable'
experiment_name = 'mouse_brain'
# Load a specific experiment
# It takes two arguments: the name of the dataset and the name of the experiment to load.
# Two arguments are available at https://gene.ai.tencent.com/SpatialOmics/.
adata = sodb.load_experiment(dataset_name,experiment_name)
load experiment[mouse_brain] in dataset[Borm2022Scalable]
[6]:
adata
[6]:
AnnData object with n_obs × n_vars = 127591 × 440
obs: 'Clusters', 'TotalMolecules', 'X', 'X_um', 'Y', 'Y_um', 'leiden'
var: 'GeneTotal'
uns: 'Age', 'Clusters_colors', 'Codebook', 'ColorDict', 'CreationDate', 'Cycles', 'Expansion', 'Expansion_um', 'Experiment', 'ExperimentDate', 'FOVoverlapPercentage', 'GenerationDate', 'Joke', 'LOOM_SPEC_VERSION', 'MaxHammingDist', 'Operator', 'Orientation', 'Probes', 'Protocol', 'Quality', 'RNAfile', 'Removal', 'Sample', 'Segmentation', 'Species', 'Stitching', 'StitchingChannel', 'Strain', 'System', 'Tissue', 'TotalMolecules', 'leiden', 'leiden_colors', 'log1p', 'moranI', 'neighbors', 'pca', 'spatial_neighbors', 'umap'
obsm: 'RGB', 'X_pca', 'X_umap', 'spatial', 'tSNE'
varm: 'PCs'
obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances'
[7]:
#Normalization
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
WARNING: adata.X seems to be already log-transformed.
Dividing subgraphs
[9]:
# Assign spatial coordinates to adata.obs['X_coor'] and adata.obs['Y_coor']
adata.obs['X_coor'] = adata.obsm['spatial'][:,0]
adata.obs['Y_coor'] = adata.obsm['spatial'][:,1]
[11]:
# grid setting
num_batch_x = 4
num_batch_y = 4
[12]:
# The adata is divided into multiple subgraphs based on the coordinate partitioning and consolidated into a batch list
Batch_list = STAGATE_pyG.Batch_Data(adata, num_batch_x=num_batch_x, num_batch_y=num_batch_y,
spatial_key=['X_coor', 'Y_coor'], plot_Stats=True)

Constructing the spatial network
[13]:
# Consturcting network for each batch
for temp_adata in Batch_list:
STAGATE_pyG.Cal_Spatial_Net(temp_adata, rad_cutoff=80)
STAGATE_pyG.Stats_Spatial_Net(temp_adata)
------Calculating spatial graph...
The graph contains 495224 edges, 11834 cells.
41.8476 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 545572 edges, 11937 cells.
45.7043 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 193628 edges, 6004 cells.
32.2498 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 53426 edges, 2123 cells.
25.1653 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 305286 edges, 9155 cells.
33.3464 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 198074 edges, 5830 cells.
33.9750 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 287706 edges, 7038 cells.
40.8789 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 347420 edges, 9875 cells.
35.1818 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 169026 edges, 4434 cells.
38.1204 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 257098 edges, 6948 cells.
37.0032 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 348072 edges, 8222 cells.
42.3342 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 563444 edges, 12294 cells.
45.8308 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 258650 edges, 6475 cells.
39.9459 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 367994 edges, 7183 cells.
51.2312 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 831018 edges, 10635 cells.
78.1399 neighbors per cell on average.
------Calculating spatial graph...
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
The graph contains 614454 edges, 7606 cells.
80.7854 neighbors per cell on average.
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
self.data[key] = value
















[14]:
# Convert each subgraph in the Batch_list to a tensor used for training through the Transfer_pytorch_Data function, and collect them into a data list
data_list = [STAGATE_pyG.Transfer_pytorch_Data(adata) for adata in Batch_list]
for temp in data_list:
temp.to(device)
[15]:
# Use "STAGATE_pyG.Cal_Spatial_Net" to calculate a spatial graph with a radius cutoff of 80.
STAGATE_pyG.Cal_Spatial_Net(adata, rad_cutoff=80)
------Calculating spatial graph...
The graph contains 6019678 edges, 127591 cells.
47.1795 neighbors per cell on average.
[ ]:
# Convert the entire adata to a tensor used for evaluation
data = STAGATE_pyG.Transfer_pytorch_Data(adata)
DataLoader for bathces
[16]:
# create a PyTorch Geometric DataLoader object for loading and processing graph data
from torch_geometric.loader import DataLoader
# creates a PyTorch DataLoader object named "loader" that will iterate through the "data_list"
# batch_size=1 or 2
loader = DataLoader(data_list, batch_size=1, shuffle=True)
Running STAGATE
[17]:
# hyper-parameters
num_epoch = 500
lr=0.001
weight_decay=1e-4
hidden_dims = [512, 30]
[18]:
# Train a STAGATE model
model = STAGATE_pyG.STAGATE(hidden_dims = [data_list[0].x.shape[1]]+hidden_dims).to(device)
# initializes an Adam optimizer with learning rate "lr" and weight decay "weight_decay"
# that will be used to update the parameters of the PyTorch model.
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
[19]:
#This code trains a PyTorch model using mini-batch stochastic gradient descent with the Adam optimizer by iterating over the data loader "loader" for "num_epoch" epochs, computing the MSE loss between the predicted output "out" and the ground truth "batch.x", clipping the gradients to avoid exploding gradients, and updating the model parameters using the optimizer's step function.
for epoch in tqdm(range(1, num_epoch+1)):
for batch in loader:
model.train()
optimizer.zero_grad()
z, out = model(batch.x, batch.edge_index)
loss = F.mse_loss(batch.x, out) #F.nll_loss(out[data.train_mask], data.y[data.train_mask])
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 5.)
optimizer.step()
100%|██████████| 500/500 [03:07<00:00, 2.67it/s]
[20]:
# The total network
data.to(device)
[20]:
Data(x=[127591, 440], edge_index=[2, 6147269])
[21]:
# Set the model in evaluation mode
model.eval()
# Get the hidden representation and output of the model
z, out = model(data.x, data.edge_index)
# Transfer a tensor from GPU to CPU and convert it to a numpy array
# Then assigning it to a specific key "STAGATE_rep" in adata.obsm['STAGATE']
STAGATE_rep = z.to('cpu').detach().numpy()
adata.obsm['STAGATE'] = STAGATE_rep
Clustering and UMAP
[ ]:
# Calculate the nearest neighbors in the 'STAGATE' representation and computes the UMAP embedding.
sc.pp.neighbors(adata, use_rep='STAGATE')
sc.tl.umap(adata)
[22]:
# Use Mclust_R to cluster cells in the 'STAGATE' representation into 13 clusters.
adata = STAGATE_pyG.mclust_R(adata, used_obsm='STAGATE', num_cluster=13)
R[write to console]: __ __
____ ___ _____/ /_ _______/ /_
/ __ `__ \/ ___/ / / / / ___/ __/
/ / / / / / /__/ / /_/ (__ ) /_
/_/ /_/ /_/\___/_/\__,_/____/\__/ version 6.0.0
Type 'citation("mclust")' for citing this R package in publications.
fitting ...
|======================================================================| 100%
[25]:
# Display a spatial embedding plot with cell clustering information
plt.rcParams["figure.figsize"] = (5, 5)
sc.pl.embedding(adata, basis="spatial", color="mclust",s=20, show=False)#, legend_loc=False)
plt.title('')
plt.axis('off')
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(
[25]:
(-611.2128013610841, 15985.261156463623, 72.22414855957027, 7865.117953491211)

[26]:
# Display a UMAP plot colored by the mclust cluster
sc.pl.umap(adata, color='mclust')
/home/linsenlin/anaconda3/envs/stagate_sodb/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(

SpatiallyVariableGene_Sepal_Visium
[1]:
# Imports
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sepal
import sepal.datasets as d
import sepal.models as m
import sepal.utils as ut
import sepal.family as family
import sepal.enrich as fea
# pip install xxx
# ImportError: cannot import name 'GProfiler' from 'gprofiler' (/home/yzy/anaconda3/envs/pysodb/lib/python3.8/site-packages/gprofiler/__init__.py)
# 把 ~/anaconda3/envs/pysodb/lib/python3.8/site-packages/sepal-1.0.0-py3.8.egg/sepal/enrich.py 中
# from gprofiler import GProfiler
# to
# from gprofiler import gprofiler as GProfiler
# refer:https://github.com/vals/python-gprofiler/issues/9
[2]:
import pysodb
[3]:
sodb = pysodb.SODB()
[15]:
dataset_name = '10x'
experiment_name = 'V1_Mouse_Brain_Sagittal_Posterior_filtered_feature_bc_matrix'
adata = sodb.load_experiment(dataset_name,experiment_name)
download experiment[V1_Mouse_Brain_Sagittal_Posterior_filtered_feature_bc_matrix] in dataset[10x]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 185M/185M [00:15<00:00, 12.9MB/s]
load experiment[V1_Mouse_Brain_Sagittal_Posterior_filtered_feature_bc_matrix] in dataset[10x]
[16]:
adata.write_h5ad('Visium_pysodb.h5ad')
[17]:
# Set path to melanoma sample
# pth = "/home/yzy/software/sepal/data/real/melanoma.tsv"
# pth = "/home/yzy/software/sepal/data/real/mob.tsv"
# load in the raw data using the RawData class
# pth = "/home/yzy/software/sepal/data/real/mob.tsv"
# raw_data = d.RawData(pth)
raw_data = d.RawData('Visium_pysodb.h5ad')
# filter genes observed in less than 5 spots
# and/or less than 10 total observations
raw_data.cnt = ut.filter_genes(raw_data.cnt,
min_expr=10,
min_occur=5)
# load into our our count data object
# this is a ST1K array, hence
# we use the ST1K class
data = m.VisiumData(raw_data,
eps = 0.1)
[18]:
# We will normalize our count data
# and then propagate it in time, to measure
# the diffusion time.
# We set scale = True to perform
# minmax scaling of the diffusion times
# Running the simulation on 8 cores
# takes about ~2 minutes
times = m.propagate(data,
normalize = True,
scale =True)
[INFO] : Using 80 workers
[INFO] : Saturated Spots : 3095
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16278/16278 [01:21<00:00, 199.59it/s]
[26]:
# Now, once the diffusion times
# have been computed for all the
# profiles, we can inspect the
# top 20 highest ranked ones
n_top = 20
# get names of profiles
sorted_indices = np.argsort(times.values.flatten())
# argsort sorts from low to high, so reverse the order
sorted_indices = sorted_indices[::-1]
sorted_profiles = times.index.values[sorted_indices]
# extract top profiles
top_profiles = sorted_profiles[0:n_top]
tail_profiles = sorted_profiles[-n_top:-1]
# display top profiles and their times
times.loc[top_profiles,:]
[26]:
average | |
---|---|
Calb1 | 1.000000 |
Prkcg | 0.960986 |
Gfap | 0.946612 |
Apod | 0.919918 |
Hpca | 0.919918 |
Sst | 0.915811 |
Car8 | 0.899384 |
Itpka | 0.893224 |
Mgp | 0.891170 |
Itpr1 | 0.887064 |
Dner | 0.874743 |
Hbb-bt | 0.872690 |
Gad1 | 0.862423 |
Igfbp2 | 0.856263 |
Vim | 0.850103 |
Pcp4 | 0.841889 |
Nefm | 0.835729 |
Gria1 | 0.833676 |
Fam107a | 0.825462 |
Hba-a2 | 0.821355 |
[28]:
# Now, let us also inspect these visually
# for this we can use the "plot_profiles"
# function for this purpose
# specifications for our
# plot. Same syntac
# as the scatter function
# in matplotlib
pltargs = dict(s = 20,
cmap = "magma",
edgecolor = 'none',
marker = 'o',
)
# plot the profiles
fig,ax = ut.plot_profiles(cnt = data.cnt.loc[:,top_profiles],
crd = data.real_crd,
rank_values = times.loc[top_profiles,:].values.flatten(),
pltargs = pltargs,
)
plt.show()

[29]:
# Now, let us also inspect these visually
# for this we can use the "plot_profiles"
# function for this purpose
# specifications for our
# plot. Same syntac
# as the scatter function
# in matplotlib
pltargs = dict(s = 20,
cmap = "magma",
edgecolor = 'none',
marker = 'o',
)
# plot the profiles
fig,ax = ut.plot_profiles(cnt = data.cnt.loc[:,tail_profiles],
crd = data.real_crd,
rank_values = times.loc[tail_profiles,:].values.flatten(),
pltargs = pltargs,
)
plt.show()

[32]:
# Let us, as a subsequent step, sort
# these profiles into pattern families
# we will use the top 100 profiles
# to build our families
n_build = 5000
# use the "get_families" function
# to group the profiles
family_labels,repr_patterns = family.get_families(data.cnt.loc[:,sorted_profiles].values,
n_base = n_build,
n_sort = n_build,
threshold = 0.80,
)
# familiy label is am array of family indices
# for each of the profiles. We convert
# this to a data frame to make it easier to work with
families = pd.DataFrame(family_labels,
index = sorted_profiles[0:n_build],
columns = ['family'],
)
[INFO] : Using 2 eigenpatterns
[INFO] : Identified 2 families
/home/yzy/anaconda3/envs/pysodb/lib/python3.8/site-packages/sklearn/cluster/_agglomerative.py:983: FutureWarning: Attribute `affinity` was deprecated in version 1.2 and will be removed in 1.4. Use `metric` instead
warnings.warn(
[ ]:
# As displayed, 4 families were found. To get
# an idea of what type what type of spatial
# patterns that each family consists of
# we will plot their respective representative motifs
# use the "plot_representative" function
# for this purpose
fig, ax = family.plot_representative(motifs=repr_patterns,
crd = data.real_crd,
ncols = 3,
pltargs=pltargs)

[43]:
# having done so, we se that the motifs differ to some extent
# and we may inspect all the members of each
# family
# use the "plot_family" function for this
family_plots= family.plot_families(data.cnt.loc[:,sorted_profiles[0:n_build]].values,
genes = sorted_profiles[0:n_build],
crd = data.real_crd,
labels = family_labels,
ncols = 10,
pltargs = pltargs,
side_size = 300,
)




[ ]: