Hematopoietic cell expansion potential -------------------------------------- In this demo, we will show you a complex example of using EpiTrace to evaluate a hypothesis. In this published experiment :cite:p:`Lareau2021`, single CD34+ hematopoietic cells were firstly expanded *in vitro* for 8 days, then forced into differentiation by adding in LIF/EPO. Cells were collected and sequenced by scATAC at day 8 (initiating), day 14 (intermediate) and day 20 (late). We are curious about whether the starting cell "age" would be associated with its future expansion potential ("clonal expansion capacity")? In other words, compared to the old cells, would the younger cells be more competitive, and derive more phenotypically mature and functional offsprings? We focused on analyzing the myeloid lineage cells in this experiment. For simplicity and reproducibility, in this demo we omit the data processing steps for raw mtscATAC sequencing data, particularly the clone calling steps. Instead, this result could be reproduced by using the processed mtscATAC data from the authors. It can be downloaded from `this link `_. Here, we assume that you have put it in a folder ``your_mtscATAC_dir``. In the following code, we would show you how to test this hypothesis by: - registering each single cell to a clone, using the somatic genetic mutation on its mitochondrial genome (done by mGATK, information included in the `mtscATAC repo `_.) - estimating single cell age using EpiTrace - calculating the mean cell age of cells in each individual clone at the starting time point - calculating the number of terminally differentiated myeloid cells in each individual clone at each timepoint - calculate the log-expansion ratio of terminally differentiated myeloid cell between timepoints (this stands for the clonal expansion capacity) - test the association between the mean starting cell age and log-expansion ratio of each individual clone Step 1. Initialize environment '''''''''''''''''''''''''''''' Load libraries for this tutorial:: library(dplyr) library(tidyr) library(readr) library(GenomicRanges) library(reshape2) library(openxlsx) library(ggplot2) library(Matrix) library(EpiTrace) library(Seurat) library(SeuratObject) library(patchwork) Step 2. Load data ''''''''''''''''' We load the reference ClockDML using:: data("clock_gr_list") We assume that you have downloaded all the data from the ``mtscATAC_reproducibility`` repo, including the large OSF zip archieve, to ``your_mtscATAC_dir``:: setwd('your_mtscATAC_dir') ## change this to your dir. # load the UMAP data load('mtscATACpaper_reproducibility/figure_invitro_CD34/output/trajectory_inferences.18march2020.rda') df_ery$pseudutime_ery <- df_ery$pseudotime df_my$pseudutime_my <- df_my$pseudotime df_ery$cell <- rownames(df_ery) df_my$cell <- rownames(df_my) df_umap <- left_join(df_my %>% dplyr::select(-pseudotime),df_ery %>% dplyr::select(-pseudotime)) # load the mitochrondria clone data filelist <- list.files('mtscATACpaper_large_data_files/source/mgatk_output',pattern = '^CD34',full.names = T) filelist <- filelist[grepl('Day',filelist)] lapply(filelist,function(x){ readRDS(x) }) -> clone_define_list names(clone_define_list) <- filelist # load the SummarizedExperiment (SE) data filelist <- list.files('mtscATACpaper_large_data_files/intermediate',pattern = '^CD34',full.names = T) filelist <- filelist[grepl('Day',filelist)] lapply(filelist,function(x){ readRDS(x) }) -> se_list names(se_list) <- filelist # load the mitochrondria clone-id match table to single cells readRDS('mtscATACpaper_reproducibility/figure_invitro_CD34/output/cluster-id-500.rds') -> clone_id_500 readRDS('mtscATACpaper_reproducibility/figure_invitro_CD34/output/cluster-id-800.rds') -> clone_id_800 clone_id_500$origin <- 'CD34_500' clone_id_800$origin <- 'CD34_800' clone_id <- rbind(clone_id_800,clone_id_500) colnames(clone_id) <- c('cell','mt_clone','origin') # rename SE lapply(names(se_list),function(x){ id = gsub('.+/','',gsub('.rds$','',x)) se_list[[x]] -> dat1 colnames(dat1) <- paste0(id,'-',colnames(dat1)) dat1 }) -> se_list1 names(se_list1) <- gsub('.+/','',gsub('.rds$','',names(se_list))) SummarizedExperiment::cbind(se_list1[[1]],se_list1[[2]],se_list1[[3]],se_list1[[4]],se_list1[[5]]) -> combined_ranges_se combined_ranges_se@rowRanges -> GR combined_ranges_se@colData %>% as.data.frame(row.names=NULL) -> cell_definition cell_definition$cell <- as.character(rownames(cell_definition)) cell_definition <- left_join(cell_definition,left_join(df_umap,clone_id)) assays(combined_ranges_se)[['counts']] -> mm ## this is the count matrix Step 3. Estimate single cell EpiTrace age ''''''''''''''''''''''''''''''''''''''''' EpiTrace age is estimated on the full dataset. Note: you might want to change the `ncore_lim` param to suit your machine. Also please note that the final result could be slightly stochastic due to the nature of random sampling in algorithm:: initiated_peaks <- Init_Peakset(GR) initiated_mm <- Init_Matrix(cellname = cell_definition$cell,peakname = initiated_peaks$peakId, matrix = mm) plyranges::reduce_ranges(c(clock_gr_list[[1]],clock_gr_list[[2]])) -> input_clock_gr # infer cell age age_obj <- EpiTraceAge_Convergence(initiated_peaks,initiated_mm, celltype = NULL, clock_gr = input_clock_gr, qualnum = 10, Z_cutoff = 3, mean_error_limit = 0.01, iterative_time = 30, parallel = T, ncore_lim = 46, ref_genome = 'hg19', non_standard_clock = T) # annotate cells CD34_res <- age_obj@meta.data %>% as.data.frame() CD34_res <- left_join(CD34_res,cell_definition) CD34_res$cell_day <- gsub('-.+','',gsub('.+Day','',CD34_res$cell)) %>% as.numeric() CD34_res <- arrange(CD34_res,EpiTraceAge_iterative) CD34_res$cell <- factor(CD34_res$cell,levels=CD34_res$cell) CD34_res$mt_clone <- paste0(CD34_res$origin,"_",CD34_res$mt_clone) CD34_res$Group <- factor(CD34_res$Group,levels=c('prog','prog_my','prog_ery','my1','my2','my3','my4','ery2','ery3','ery4','ery5','ery6')) Step 4. Get the myeloid cells ''''''''''''''''''''''''''''' We extract the myeloid-lineage and progenitor cells from `CD34_800` experiment from the full dataset, use the original cell annotation from :cite:p:`Lareau2021`:: my_cell_types_df <- data.frame( cell_type_gross = c('prog','prog','int','terminal','terminal','terminal'), Group = c('prog','prog_my','my1','my2','my3','my4')) CD34_res_my <- left_join( CD34_res %>% dplyr::filter(Group %in% c('prog','prog_my','my1','my2','my3','my4') & origin %in% 'CD34_800'), my_cell_types_df) Step 5. Summarise cell numbers '''''''''''''''''''''''''''''' We simply calculate the cell numbers in each clone, at specific date. Here, for simplicity, we define day 8 as the earlier timepoint and day 14 as the later timepoint. You can change the "14" to "20" to test another scenario, following this code:: clone_number_earlier_timepoint <- CD34_res_my %>% dplyr::filter(cell_day==8) %>% dplyr::group_by(mt_clone) %>% dplyr::summarise(prog_cells_early=sum(cell_type_gross %in% 'prog'), int_cells_early=sum(cell_type_gross %in% 'int'), terminal_cells_early=sum(cell_type_gross %in% 'terminal'), total_cell_number_early=n()) clone_number_late_timepoint <- CD34_res_my %>% dplyr::filter(cell_day==14) %>% dplyr::group_by(mt_clone) %>% dplyr::summarise(prog_cells_late=sum(cell_type_gross %in% 'prog'), int_cells_late=sum(cell_type_gross %in% 'int'), terminal_cells_late=sum(cell_type_gross %in% 'terminal'), total_cell_number_late=n()) clone_number_time_series <- left_join(clone_number_earlier_timepoint, clone_number_late_timepoint) clone_number_time_series[is.na(clone_number_time_series)] <- 0 # if there is no match in the later timepoint, simply set the cell numbers to 0. in next step we will remove them. Step 6. Filter and annotate cell clones ''''''''''''''''''''''''''''''''''''''' It is possible that some clones die out during experiment or were simply not sampled in latter timepoint. To avoid counting these clones, we only select for those clones that really expanded -- here we select clones that have more terminally differentiated cells at latter timepoint compared to starting (day 8). Since cell expansion potential might be related to its phenotype ("cell type") as well as its age, we also annotate clones according to the type of putative proliferating cells in it:: clone_number_time_series <- clone_number_time_series %>% dplyr::mutate(clone_type=ifelse(prog_cells_early>0 & int_cells_early==0,'prog',ifelse(prog_cells_early==0,'int','both')), clone_type_gross=ifelse(clone_type %in% 'prog','prog','int_or_both')) %>% dplyr::filter(terminal_cells_late>terminal_cells_early) %>% na.omit() Step 7. Define earlier time point clonal age '''''''''''''''''''''''''''''''''''''''''''' We directly compute clonal age of a particular clone at the earlier time point:: # define earlier time point clone age early_timepoint_clone_age <- CD34_res_my %>% dplyr::filter(cell_day==8) %>% dplyr::group_by(mt_clone) %>% dplyr::summarise(early_timepoint_clone_age = mean(EpiTraceAge_iterative)) Step 8. Define clonal expansion capacity '''''''''''''''''''''''''''''''''''''''' We define clonal expansion capacity of a particular clone as the log-ratio of increased terminally differentiated cells to the initial population size:: # define clonal expansion capacity clone_expansion_ratio <- clone_number_time_series %>% dplyr::filter(terminal_cells_late>terminal_cells_early) %>% dplyr::group_by(clone_type_gross,mt_clone) %>% dplyr::mutate(expansion_ratio=(terminal_cells_late-terminal_cells_early)/total_cell_number_early) Step 9. Test the association between earlier clonal age and clonal expansion capacity ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' Finally, we combine the age and expansion capacity results for each clone, and test their association:: # match the earlier clonal age with clone expansion ratio for each clone comparison_result_source_data <- left_join(clone_expansion_ratio,early_timepoint_clone_age) # Test the association ggplot(comparison_result_source_data,aes(x=early_timepoint_clone_age,y=expansion_ratio)) + ggpubr::stat_cor(label.x.npc = 0.5,size=6) + scale_y_log10() + geom_smooth(se=F,linetype='dashed',method = 'lm',color='black') + geom_point(aes(fill=clone_type),pch=21,size=6) + scale_fill_manual(values=c('beige','dodgerblue','aquamarine3')) + theme_classic() + theme(text=element_text(size=20)) + xlab('\nMean Prog/Int Cell EpiTrace Age at Day8\n\n\n') + ylab('\n\nLog Expansion Ratio of Terminal Cell at Day20\n') The result is summarised in this figure: .. image:: _static/mtscATAC.svg :width: 600px :align: center