Hematopoietic cell expansion potential

In this demo, we will show you a complex example of using EpiTrace to evaluate a hypothesis. In this published experiment [Lareau et al., 2021], single CD34+ hematopoietic cells were firstly expanded in vitro for 8 days, then forced into differentiation by adding in LIF/EPO. Cells were collected and sequenced by scATAC at day 8 (initiating), day 14 (intermediate) and day 20 (late). We are curious about whether the starting cell “age” would be associated with its future expansion potential (“clonal expansion capacity”)? In other words, compared to the old cells, would the younger cells be more competitive, and derive more phenotypically mature and functional offsprings?

We focused on analyzing the myeloid lineage cells in this experiment. For simplicity and reproducibility, in this demo we omit the data processing steps for raw mtscATAC sequencing data, particularly the clone calling steps. Instead, this result could be reproduced by using the processed mtscATAC data from the authors. It can be downloaded from this link. Here, we assume that you have put it in a folder your_mtscATAC_dir.

In the following code, we would show you how to test this hypothesis by:

  • registering each single cell to a clone, using the somatic genetic mutation on its mitochondrial genome (done by mGATK, information included in the mtscATAC repo.)

  • estimating single cell age using EpiTrace

  • calculating the mean cell age of cells in each individual clone at the starting time point

  • calculating the number of terminally differentiated myeloid cells in each individual clone at each timepoint

  • calculate the log-expansion ratio of terminally differentiated myeloid cell between timepoints (this stands for the clonal expansion capacity)

  • test the association between the mean starting cell age and log-expansion ratio of each individual clone

Step 1. Initialize environment

Load libraries for this tutorial:

library(dplyr)
library(tidyr)
library(readr)
library(GenomicRanges)
library(reshape2)
library(openxlsx)
library(ggplot2)
library(Matrix)
library(EpiTrace)
library(Seurat)
library(SeuratObject)
library(patchwork)

Step 2. Load data

We load the reference ClockDML using:

data("clock_gr_list")

We assume that you have downloaded all the data from the mtscATAC_reproducibility repo, including the large OSF zip archieve, to your_mtscATAC_dir:

setwd('your_mtscATAC_dir') ## change this to your dir.

# load the UMAP data
load('mtscATACpaper_reproducibility/figure_invitro_CD34/output/trajectory_inferences.18march2020.rda')
df_ery$pseudutime_ery <- df_ery$pseudotime
df_my$pseudutime_my <- df_my$pseudotime
df_ery$cell <- rownames(df_ery)
df_my$cell <- rownames(df_my)
df_umap <- left_join(df_my %>% dplyr::select(-pseudotime),df_ery %>% dplyr::select(-pseudotime))

# load the mitochrondria clone data
filelist <- list.files('mtscATACpaper_large_data_files/source/mgatk_output',pattern = '^CD34',full.names = T)
filelist <- filelist[grepl('Day',filelist)]
lapply(filelist,function(x){
  readRDS(x)
}) -> clone_define_list
names(clone_define_list) <- filelist

# load the SummarizedExperiment (SE) data
filelist <- list.files('mtscATACpaper_large_data_files/intermediate',pattern = '^CD34',full.names = T)
filelist <- filelist[grepl('Day',filelist)]
lapply(filelist,function(x){
  readRDS(x)
}) -> se_list
names(se_list) <- filelist

# load the mitochrondria clone-id match table to single cells
readRDS('mtscATACpaper_reproducibility/figure_invitro_CD34/output/cluster-id-500.rds') -> clone_id_500
readRDS('mtscATACpaper_reproducibility/figure_invitro_CD34/output/cluster-id-800.rds') -> clone_id_800
clone_id_500$origin <- 'CD34_500'
clone_id_800$origin <- 'CD34_800'
clone_id <- rbind(clone_id_800,clone_id_500)
colnames(clone_id) <- c('cell','mt_clone','origin')

# rename SE
lapply(names(se_list),function(x){
  id = gsub('.+/','',gsub('.rds$','',x))
  se_list[[x]] -> dat1
  colnames(dat1) <- paste0(id,'-',colnames(dat1))
  dat1
}) -> se_list1
names(se_list1) <- gsub('.+/','',gsub('.rds$','',names(se_list)))

SummarizedExperiment::cbind(se_list1[[1]],se_list1[[2]],se_list1[[3]],se_list1[[4]],se_list1[[5]]) -> combined_ranges_se

combined_ranges_se@rowRanges -> GR
combined_ranges_se@colData %>% as.data.frame(row.names=NULL) -> cell_definition
cell_definition$cell <- as.character(rownames(cell_definition))

cell_definition <- left_join(cell_definition,left_join(df_umap,clone_id))
assays(combined_ranges_se)[['counts']] -> mm ## this is the count matrix

Step 3. Estimate single cell EpiTrace age

EpiTrace age is estimated on the full dataset. Note: you might want to change the ncore_lim param to suit your machine. Also please note that the final result could be slightly stochastic due to the nature of random sampling in algorithm:

initiated_peaks <- Init_Peakset(GR)
initiated_mm <- Init_Matrix(cellname = cell_definition$cell,peakname = initiated_peaks$peakId, matrix = mm)
plyranges::reduce_ranges(c(clock_gr_list[[1]],clock_gr_list[[2]])) -> input_clock_gr
# infer cell age
age_obj <- EpiTraceAge_Convergence(initiated_peaks,initiated_mm,
  celltype = NULL,
  clock_gr = input_clock_gr,
  qualnum = 10,
  Z_cutoff = 3,
  mean_error_limit = 0.01,
  iterative_time = 30,
  parallel = T,
  ncore_lim = 46,
  ref_genome = 'hg19',
  non_standard_clock = T)
# annotate cells
CD34_res <- age_obj@meta.data %>% as.data.frame()
CD34_res <- left_join(CD34_res,cell_definition)
CD34_res$cell_day <- gsub('-.+','',gsub('.+Day','',CD34_res$cell)) %>% as.numeric()
CD34_res <- arrange(CD34_res,EpiTraceAge_iterative)
CD34_res$cell <- factor(CD34_res$cell,levels=CD34_res$cell)
CD34_res$mt_clone <- paste0(CD34_res$origin,"_",CD34_res$mt_clone)
CD34_res$Group <- factor(CD34_res$Group,levels=c('prog','prog_my','prog_ery','my1','my2','my3','my4','ery2','ery3','ery4','ery5','ery6'))

Step 4. Get the myeloid cells

We extract the myeloid-lineage and progenitor cells from CD34_800 experiment from the full dataset, use the original cell annotation from [Lareau et al., 2021]:

my_cell_types_df <- data.frame(
    cell_type_gross = c('prog','prog','int','terminal','terminal','terminal'),
    Group = c('prog','prog_my','my1','my2','my3','my4'))
CD34_res_my <- left_join( CD34_res %>%
                              dplyr::filter(Group %in%  c('prog','prog_my','my1','my2','my3','my4') & origin %in% 'CD34_800'),
                          my_cell_types_df)

Step 5. Summarise cell numbers

We simply calculate the cell numbers in each clone, at specific date. Here, for simplicity, we define day 8 as the earlier timepoint and day 14 as the later timepoint. You can change the “14” to “20” to test another scenario, following this code:

clone_number_earlier_timepoint <- CD34_res_my %>%
    dplyr::filter(cell_day==8) %>%
    dplyr::group_by(mt_clone) %>%
    dplyr::summarise(prog_cells_early=sum(cell_type_gross %in% 'prog'),
                     int_cells_early=sum(cell_type_gross %in% 'int'),
                     terminal_cells_early=sum(cell_type_gross %in% 'terminal'),
                     total_cell_number_early=n())

clone_number_late_timepoint <- CD34_res_my %>%
    dplyr::filter(cell_day==14) %>%
    dplyr::group_by(mt_clone) %>%
    dplyr::summarise(prog_cells_late=sum(cell_type_gross %in% 'prog'),
                     int_cells_late=sum(cell_type_gross %in% 'int'),
                     terminal_cells_late=sum(cell_type_gross %in% 'terminal'),
                     total_cell_number_late=n())

clone_number_time_series <- left_join(clone_number_earlier_timepoint,
                                           clone_number_late_timepoint)
clone_number_time_series[is.na(clone_number_time_series)] <- 0 # if there is no match in the later timepoint, simply set the cell numbers to 0. in next step we will remove them.

Step 6. Filter and annotate cell clones

It is possible that some clones die out during experiment or were simply not sampled in latter timepoint. To avoid counting these clones, we only select for those clones that really expanded – here we select clones that have more terminally differentiated cells at latter timepoint compared to starting (day 8). Since cell expansion potential might be related to its phenotype (“cell type”) as well as its age, we also annotate clones according to the type of putative proliferating cells in it:

clone_number_time_series <- clone_number_time_series %>%
  dplyr::mutate(clone_type=ifelse(prog_cells_early>0 & int_cells_early==0,'prog',ifelse(prog_cells_early==0,'int','both')),
                clone_type_gross=ifelse(clone_type %in% 'prog','prog','int_or_both')) %>%
  dplyr::filter(terminal_cells_late>terminal_cells_early) %>%
  na.omit()

Step 7. Define earlier time point clonal age

We directly compute clonal age of a particular clone at the earlier time point:

# define earlier time point clone age
early_timepoint_clone_age <- CD34_res_my %>%
  dplyr::filter(cell_day==8) %>%
  dplyr::group_by(mt_clone) %>%
  dplyr::summarise(early_timepoint_clone_age = mean(EpiTraceAge_iterative))

Step 8. Define clonal expansion capacity

We define clonal expansion capacity of a particular clone as the log-ratio of increased terminally differentiated cells to the initial population size:

# define clonal expansion capacity
clone_expansion_ratio <- clone_number_time_series %>%
  dplyr::filter(terminal_cells_late>terminal_cells_early) %>%
  dplyr::group_by(clone_type_gross,mt_clone) %>%
  dplyr::mutate(expansion_ratio=(terminal_cells_late-terminal_cells_early)/total_cell_number_early)

Step 9. Test the association between earlier clonal age and clonal expansion capacity

Finally, we combine the age and expansion capacity results for each clone, and test their association:

# match the earlier clonal age with clone expansion ratio for each clone
comparison_result_source_data <- left_join(clone_expansion_ratio,early_timepoint_clone_age)
# Test the association
ggplot(comparison_result_source_data,aes(x=early_timepoint_clone_age,y=expansion_ratio))  + ggpubr::stat_cor(label.x.npc = 0.5,size=6) + scale_y_log10() + geom_smooth(se=F,linetype='dashed',method = 'lm',color='black') + geom_point(aes(fill=clone_type),pch=21,size=6) + scale_fill_manual(values=c('beige','dodgerblue','aquamarine3')) + theme_classic() + theme(text=element_text(size=20)) + xlab('\nMean Prog/Int Cell EpiTrace Age at Day8\n\n\n') + ylab('\n\nLog Expansion Ratio of Terminal Cell at Day20\n')

The result is summarised in this figure:

_images/mtscATAC.svg