Seurat分组随机选取细胞数实战（随机采样后找亚群DEG） 2022-06-01

3年前发布

Seurat分组随机选取细胞数实战（随机采样后找亚群DEG） 2022-06-01

关键词

随机取样细胞
Downsample cells
分组随机选取细胞

适用背景

之前的博客提到，R语言处理大数据效率较低，耗时长，一种解决方案是可以转用Python语言流程，但如果对Python语言比较陌生，任务又急，那可以采用另一种方案——分组随机取样。
尽管Seurat这个软件包功能极其强大，但是当细胞数达到几十万甚至上百万时，把常规流程跑一遍少则几天，多则几周，实在是极其消耗时间。而且有时吧，只是单纯想测试一下某些参数或者流程是否可用，如果用全数据集来测试实在有点浪费时间，所有可用考虑分组随机选取细胞数进行分析。

主函数

这里封装了一个函数sample_seob，以下是参数解释：

obj Seurat对象
group.by 分组名，默认使用聚类结果seurat_clusters
sp.size 取样大小，也就是对分组里的每一个类别选取的细胞数，例如设置为100，将对cluster 1取100个细胞，cluster 2也取100个细胞，以此类推，如果某个cluster的细胞数不足100个将选取这个cluster的所有细胞。
diet 是否使用 DietSeurat函数对Seurat对象进行瘦身，默认为true，因为如果Seurat对象包含scale.data等信息会很耗内存，瘦身后能减少内存并加快分析速度。
sp.total 选取的总细胞数，为可选项，有时候不知道每个亚群取多少细胞数合适，只想大概取到一定的细胞数，例如1000，就可以用sp.total参数，注意，只有sp.size没有赋值的时候，sp.total参数才会生效，即这两个参数是二选一即可。

sample_seob <- function(obj,group.by="seurat_clusters",sp.size=NULL,diet="true",sp.total=1000) {
all <- obj
if (diet=="true") {
all <- DietSeurat(all)
}
if (is.null(sp.size)) {
nlen <- length(unique(all@meta.data[,group.by]))
sp.size <- ceiling(sp.total/nlen)
}
seob_list <- list()
i <- 1
for (sc in unique(all@meta.data[,group.by])){
cellist <- colnames(all)[which(all@meta.data[,group.by] == sc)]
ob <- subset(all, cells=cellist)
if (length(colnames(ob)) > sp.size) {
ob <- subset(ob,cells=sample(colnames(ob), sp.size))
}
seob_list[[i]] <- ob
i <- i+1
}
all <- Reduce(merge,seob_list)
return(all)
}

脚本示例

数据集采用Seurat内置数据集pbmc_small，有80个细胞，按RNA_snn_res.1分组有3种类型

> pbmc_small
An object of class Seurat
230 features across 80 samples within 1 assay
Active assay: RNA (230 features, 20 variable features)
 2 dimensional reductions calculated: pca, tsne
> unique(pbmc_small$RNA_snn_res.1)
[1] 0 2 1
Levels: 0 1 2

使用sp.size=10，按RNA_snn_res.1分组，每种类型取10个细胞。

> all <- sample_seob(pbmc_small,sp.size=10,group.by='RNA_snn_res.1')
> all
An object of class Seurat
230 features across 30 samples within 1 assay
Active assay: RNA (230 features, 0 variable features)

使用sp.total=50，按RNA_snn_res.1分组，取大约50个细胞。

> all <- sample_seob(pbmc_small,sp.total=50,group.by='RNA_snn_res.1')
> all
An object of class Seurat
230 features across 51 samples within 1 assay
Active assay: RNA (230 features, 0 variable features)

分析实战——分组随机采样后找每个亚群的DEG

在实际的分析中，我发现FindAllMarkers经常跑着跑着就断了，出现以下报错:

Warning: When testing 15 versus all:
        The total size of the 5 globals that need to be exported for the future expression (‘FUN()’) is 2.06 GiB. This exceeds the maximum allowed size of 500.00 MiB (option 'future.globals.maxSize'). The three largest globals are ‘data.use’ (2.06 GiB of class ‘S4’), ‘group.info’ (2.56 MiB of class ‘list’) and ‘FUN’ (9.76 KiB of class ‘function’).

这种情况一般是因为数据集太大了，由运行内存不足导致，这种情况目前找到的解决办法是随机取样后找DEG，结果也比较可靠，运行起来也快很多。以下是示例封装的函数**subset_deg **，部分参数与上面一致，其它参数与FindAllMarkers一致。

subset_deg <- function(obj,group.by="seurat_clusters",sp.size=NULL,output="./",
                       min.pct=0.25,logfc.threshold=0.25,only.pos=F,assays ="RNA",order=F) {
all <- DietSeurat(obj)
if (!is.null(sp.size)) {
seob_list <- list()
i <- 1
for (sc in unique(all@meta.data[,group.by])){
cellist <- colnames(all)[which(all@meta.data[,group.by] == sc)]
ob <- subset(all, cells=cellist)
if (length(colnames(ob)) > sp.size) {
ob <- subset(ob,cells=sample(colnames(ob), sp.size))
}
seob_list[[i]] <- ob
i <- i+1
}
all <- Reduce(merge,seob_list)
}
all_markers <- FindAllMarkers(all, only.pos = only.pos, min.pct = min.pct, logfc.threshold = logfc.threshold, verbose = F,assays = assays,order=order)
write.table(all_markers, paste0(output, "deg_sample",sp.size,".xls"),sep="\t",quote=F)
}

也可以简写成下面的形式：

subset_deg <- function(obj,group.by="seurat_clusters",sp.size=NULL,output="./",
                       min.pct=0.25,logfc.threshold=0.25,only.pos=F,assays ="RNA",order=F,diet="true",sp.total=1000) {
all <- sample_seob(obj, sp.size=sp.size,group.by=group.by,diet=diet,sp.total=sp.total)
all_markers <- FindAllMarkers(all, only.pos = only.pos, min.pct = min.pct, logfc.threshold = logfc.threshold, verbose = F,assays = assays,order=order)
write.table(all_markers, paste0(output, "deg_sample",sp.size,".xls"),sep="\t",quote=F)
}

升级版

因为上面的版本循环里每次都要subset，太麻烦了，最后还要merge一下，比较耗时间，因此修改代码为下面的结果，只需要subset一次，也不用merge，能较快得到结果。

Sample_seob <- function(obj,group.by="seurat_clusters",sp.size=NULL,diet="true",sp.total=1000) {
all <- obj
if (diet=="true") {
all <- DietSeurat(all,dimreducs = c('pca','umap'))
}

if (is.null(sp.size)) {
nlen <- length(unique(all@meta.data[,group.by]))
sp.size <- ceiling(sp.total/nlen)

}
ncellist <- c()
for (sc in unique(all@meta.data[,group.by])){
cellist <- colnames(all)[which(all@meta.data[,group.by] == sc)]
if (length(cellist) > sp.size) {
cellist=sample(cellist, sp.size)
}
ncellist <- c(ncellist,cellist)
}
all <- subset(all,cells=ncellist)
return(all)
}

极简版

突然看到官网有函数是能简便实现这种分组随机取细胞数的，转换Idents后就能根据Idents信息分类别随机选取细胞。

Idents(pbmc) <- "orig.ident"
# Downsample the number of cells per identity class
ob1 <- subset(x = pbmc, downsample = 100)

小结与补充

不多说了，搬砖去了。祝大家六一儿童节快乐！

评分

欢迎为Ta评分