Single cell age in fly development

In this tutorial, we will show you a slightly more complex application of EpiTrace on infering single cell age during fly development. This data is comprised of some 200k+ single cells, collected at defined time after egg laying, and sequenced by scATAC. The data is described in [Calderon et al., 2022].

Remember from the bulk ATAC-seq demo that, the input data for EpiTrace are:

  • a count matrix where rows are genomic loci, and columns are cells, and elements are ATAC-seq read numbers.

  • a data frame or GRanges object corresponds to the genomic loci (we call it “peak sets”)

Also, to run EpiTrace, you also need to choose a reference clock-like loci for age estimation. However, there is no active CpG methylation in Drosophila genome, prohibiting determination of putative clock-like differential methylated loci as we did in human.

Here we will show you how to perform single-cell age estimation from fly scATAC data, by leveraging high-level orthology relationship to map human clock-like regions to fly genome.

You need to download the fly downsampled scATAC data (“fly.all.downsampled_seurat_filtered_processed.rds”) from the WASHU website., and the human-fly orthology table (“dmel_human_orthologs_disease_fb_2022_05.tsv.gz”) from the flybase website, in order to run this demo. We assume you have already put these files in a fly_scATAC dir.

Step 1. Initialize environment

Load EpiTrace as:

library(dplyr)
library(tidyr)
library(readr)
library(GenomicRanges)
library(reshape2)
library(openxlsx)
library(ggplot2)
library(Matrix)
library(EpiTrace)
library(Seurat)
library(SeuratObject)
library(ggtree)
library(EnsDb.Hsapiens.v86)
library(BSgenome.Hsapiens.UCSC.hg19)
library(BSgenome.Dmelanogaster.UCSC.dm6)
library(patchwork)
library(ArchR)
library(parallel)
library(ChIPseeker)
library(TxDb.Dmelanogaster.UCSC.dm6.ensGene)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
library(org.Dm.eg.db)

Step 2. Load data

We load the human clock-like DML using:

data("clock_gr_list")

We then load the fly data (assume you put them in a fly_scATAC dir):

setwd('fly_scATAC') # change this to your dir
datadir = 'fly_scATAC'
fly_dat <- readRDS('fly.all.downsampled_seurat_filtered_processed.rds')

We now load the fly-to-human orthology mapping data (assume you put them in the same fly_scATAC dir):

dmel_mapped_to_human <- readr::read_tsv('dmel_human_orthologs_disease_fb_2022_05.tsv.gz',comment = "## ")
colnames(dmel_mapped_to_human) <- gsub('#','',colnames(dmel_mapped_to_human))

Step 3. Define clock-like loci in fly genome

We first annotate the genomic location for each clockDML using ChIPseeker:

# map human clock in the promoter/TSS region to respective fly homolog
human_clock_hg19 <- plyranges::reduce_ranges(c(clock_gr_list[[1]],clock_gr_list[[2]]))
human_clock_hg19_anno <- ChIPseeker::annotatePeak(human_clock_hg19,TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene,assignGenomicAnnotation = T,annoDb = 'org.Hs.eg.db',tssRegion = c(3000,1000),level='gene')
# extract human annotation
human_clock_hg19_anno_df <- human_clock_hg19_anno@anno %>% as.data.frame
human_clock_hg19_anno_tss_df <- human_clock_hg19_anno_df[abs(human_clock_hg19_anno_df$distanceToTSS)< 100,]
human_clock_hg19_anno_tss_df$SYMBOL -> human_clock_hg19_anno_tss_df$Human_gene_symbol

We then annotate the genomic location for each fly scATAC peak, similarly using ChIPseeker:

# annotate fly ranges
fly_dat_ranges <- data.frame(peaks=rownames(fly_dat)) %>% separate(col=1,into=c('chr','start','end'),remove=F,convert=T) %>% makeGRangesFromDataFrame(keep.extra.columns = T)
fly_dat_ranges_anno <- ChIPseeker::annotatePeak(fly_dat_ranges,TxDb=TxDb.Dmelanogaster.UCSC.dm6.ensGene,assignGenomicAnnotation = T,annoDb = 'org.Dm.eg.db',tssRegion = c(3000,1000),level='gene')
# extract fly annotation
fly_dat_ranges_anno_df <- fly_dat_ranges_anno@anno %>% as.data.frame()

We now use the ortholog table to find “mappable” human ClockDML which is close to TSS of human-fly ortholog genes:

# find mappable (have ortholog) human annotation
human_clock_hg19_anno_tss_df <- left_join(human_clock_hg19_anno_tss_df,dmel_mapped_to_human)
human_clock_hg19_anno_tss_mappable_df <- human_clock_hg19_anno_tss_df[!is.na(human_clock_hg19_anno_tss_df$Dmel_gene_ID),]

Finally, we can define putative clock-like regions in fly, using such orthology:

# define putative clockAcc regions for fly
fly_dat_ranges_anno_df$clock_hit <- ifelse(abs(fly_dat_ranges_anno_df$distanceToTSS)<100 & fly_dat_ranges_anno_df$geneId %in% human_clock_hg19_anno_tss_mappable_df$Dmel_gene_ID, 'clock','non_clock')
fly_dat_ranges_putative_clock <- fly_dat_ranges_anno@anno[fly_dat_ranges_anno_df$clock_hit %in% 'clock',]

Step 4. Estimate cell age with fly clock-like loci

Now we can perform scATAC age estimation on fly genome, as we did for other data, just use a different reference genomic loci (setting the clock_gr option):

## use the fly dataset to perform convergence analysis
EpiTrace::Init_Peakset(fly_dat_ranges) -> init_gr
EpiTrace::Init_Matrix(peakname = fly_dat_ranges$peaks,cellname = colnames(fly_dat),matrix = (GetAssayData(fly_dat))) -> init_mm
# allPeaks clock on fly
EpiTrace::EpiTraceAge_Convergence(peakSet = init_gr,matrix=init_mm,ref_genome='dm6',clock_gr=fly_dat_ranges_putative_clock,iterative_time = 5,min.cutoff = 0,non_standard_clock = T,qualnum = 10,ncore_lim = 40,mean_error_limit = 0.1) -> epitrace_obj_age_estimated

Step 5. Validating the result with the published sample time and cell age derived from neural network

Note that we have exported the metadata from the fly scATAC dataset as a rds:

Shendure_atac_meta <- readRDS('Shendure_atac_meta.rds')
meta_compare <- left_join(epitrace_obj_age_estimated@meta.data,Shendure_atac_meta)
cor.test(meta_compare$lasso_age,meta_compare$NNv1_age)
cor.test(meta_compare$EpiTraceAge_iterative,meta_compare$NNv1_age)
meta_compare %>% dplyr::group_by(NNv1_time.new) %>% dplyr::summarise(meanAge=mean(EpiTraceAge_iterative,na.rm=T)) -> per_timepoint_mean_age
cor.test(as.numeric(per_timepoint_mean_age$NNv1_time.new),per_timepoint_mean_age$meanAge)

This reports:

Pearson's product-moment correlation

data:  as.numeric(per_timepoint_mean_age$NNv1_time.new) and per_timepoint_mean_age$meanAge
t = 11.104, df = 8, p-value = 3.863e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8706589 0.9928825
sample estimates:
      cor
0.9690577

We can also do the correlation between single cell ages:

meta_compare <- na.omit(meta_compare)
cor.test(as.numeric(meta_compare$NNv1_age),meta_compare$EpiTraceAge_iterative)

This reports:

Pearson's product-moment correlation

data:  as.numeric(meta_compare$NNv1_age) and meta_compare$EpiTraceAge_iterative)
t = 286.89, df = 203240, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5337807 0.5399695
sample estimates:
      cor
0.5368824

We can now test the association between mean EpiTrace age of the sample, vs sample development time:

ggplot(per_timepoint_mean_age,aes(x=NNv1_time.new,y=meanAge,group=1)) + geom_point(pch=21,size=5,aes(fill=NNv1_time.new)) + geom_smooth(method='lm',aes(group=1)) + ggpubr::stat_cor(size=6) + scale_fill_viridis_d() + theme_classic() + theme(text=element_text(size=20),axis.text.x = element_text(angle=45,vjust=1,hjust=1))

The result is summarised in this figure:

_images/DMscATAC.svg