Preparing data • greatR

Data requirements

Input data frame required for greatR to register your gene expression profiles is a data frame that contains gene expression time-course data with all replicates. The illustrated diagram below shows the required input_df data frame format of greatR. This data frame must contain:

two different species which users wish to compare (for each gene, at least two replicates are required in all species), and
six data columns:
- locus_name: accession or unique ID of each gene.
- accession: accession or name of the species to compare.
- timepoint: time point of the gene expression data.
- expression_value: desired RNA-seq expression value or measure of the abundance of gene or transcripts which one wishes to compare. This value can be RPM, RPKM, FPKM, TPM, TMM, DESeq, SCnorm, GeTMM, ComBat-Seq and raw reads counts.
- tissue: tissue from which RNA-seq data were extracted.
- group: replicate ID for each expression value.

Using a helper function to prepare input data

greatR package provides a helper function get_expression_of_interest() to combine reference and query (data to register) data frames, if table ID which connect reference and query data frames provided (as illustrated in the figure below).

This package also provides data frames as examples to use this function. Those data frames are expression data of B. rapa (Calderwood et al., 2021), Arabidopsis (Klepikova et al., 2015), and join table or table ID which contains both accession of B. rapa and Arabidopsis. This table ID is important to combine both B. rapa and Arabidopsis data frames.

Load and view example data

# Load the package
library(greatR)
library(dplyr)

To load example data frames provided within the package, users need to run the following procedure:

# Load example data
id_table_rapa_subset <- system.file(
  "extdata/sample_data/id_table_5genes.rds", 
  package = "greatR") %>% 
  readRDS()
klepikova_subset <- system.file(
  "extdata/sample_data/arabidopsis_expression_5genes.rds", 
  package = "greatR") %>% 
  readRDS()
rapa_subset <- system.file(
  "extdata/sample_data/brassica_rapa_expression_5genes.rds", 
  package = "greatR") %>% 
  readRDS()

Let’s take a look on each data frame we have loaded below:

# View reference data
rapa_subset %>%
  utils::head(3) %>%
  knitr::kable()

CDS.model	sample_id	FPKM	accession	tissue	timepoint	dataset	est_counts	group	norm.cpm
BRAA02G018970.3C	ERR_ro18_rna_seq_v3_R18A1_1	0.287365	Ro18	apex	11	ro18_rna_seq	12	Ro18-apex-11	0.3968734
BRAA02G018970.3C	ERR_ro18_rna_seq_v3_R18A1_2	1.069565	Ro18	apex	11	ro18_rna_seq	47	Ro18-apex-11	1.4147711
BRAA02G018970.3C	ERR_ro18_rna_seq_v3_R18A1_3	0.541893	Ro18	apex	11	ro18_rna_seq	23	Ro18-apex-11	0.7423984

# View query data
klepikova_subset %>%
  utils::head(3) %>%
  knitr::kable()

CDS.model	sample_id	FPKM	accession	tissue	timepoint	dataset	est_counts	group	norm.cpm
AT1G69120	ERR_ds_klepikova_SRR1659921	0.299101	Col0	apex	7	ds_klepikova_1	4	Col0-apex-07	0.4667855
AT1G69120	ERR_ds_klepikova_SRR1659922	0.020064	Col0	apex	7	ds_klepikova_2	1	Col0-apex-07	0.0741901
AT1G69120	ERR_ds_klepikova_SRR1660397	0.000000	Col0	apex	8	ds_klepikova_1	0	Col0-apex-08	0.0000000

As we can see, both data frames of expression profiles above contain a column of gene accession (CDS.model). These two columns from each data frame were mapped into a table ID as we can see below: CDS.model for B. rapa gene accessions and locus_name for Arabidopsis gene accessions.

# View table ID
id_table_rapa_subset %>%
  dplyr::select(CDS.model, locus_name, symbol, gene_id) %>% 
  utils::head(3) %>%
  knitr::kable()

CDS.model	locus_name	symbol	gene_id
BRAA01G016140.3C	AT4G24540	AGL24	AT4G24540
BRAA02G018970.3C	AT1G69120	AP1	AT1G69120
BRAA02G043220.3C	AT5G61850	LFY	AT5G61850

Having all of these informations, we can then easily get the required input data.

Combine data

Users can use function get_expression_of_interest() to combine two different table data frames containing expression values that users wish to compare.

all_data <- get_expression_of_interest(
  data_ref = rapa_subset,
  data_to_transform = klepikova_subset,
  id_table = id_table_rapa_subset,
  lookup_col_ref_and_id_table = "CDS.model",
  lookup_col_ref_and_to_transform = "locus_name",
  colnames_wanted = NULL,
  tissue_wanted = "apex",
  gene_of_interest_acc = c("AT1G69120", "AT5G618"),
  sum_exp_data_ref = FALSE,
  accession_data_to_transform = "Col0"
)

Let’s take a look on how the combined data:

# View data
all_data %>%
  dplyr::group_by(accession) %>%
  dplyr::slice(1:3) %>%
  knitr::kable()

locus_name	CDS.model	sample_id	FPKM	accession	tissue	timepoint	dataset	est_counts	group	norm.cpm	id_transform_data
BRAA02G018970.3C	BRAA02G018970.3C	ERR_ds_klepikova_SRR1659921	0.299101	Col0	apex	7	ds_klepikova_1	4	Col0-apex-07	0.4667855	AT1G69120
BRAA02G018970.3C	BRAA02G018970.3C	ERR_ds_klepikova_SRR1659922	0.020064	Col0	apex	7	ds_klepikova_2	1	Col0-apex-07	0.0741901	AT1G69120
BRAA02G018970.3C	BRAA02G018970.3C	ERR_ds_klepikova_SRR1660397	0.000000	Col0	apex	8	ds_klepikova_1	0	Col0-apex-08	0.0000000	AT1G69120
BRAA02G018970.3C	BRAA02G018970.3C	ERR_ro18_rna_seq_v3_R18A1_1	0.287365	Ro18	apex	11	ro18_rna_seq	12	Ro18-apex-11	0.3968734	AT1G69120
BRAA02G018970.3C	BRAA02G018970.3C	ERR_ro18_rna_seq_v3_R18A1_2	1.069565	Ro18	apex	11	ro18_rna_seq	47	Ro18-apex-11	1.4147711	AT1G69120
BRAA02G018970.3C	BRAA02G018970.3C	ERR_ro18_rna_seq_v3_R18A1_3	0.541893	Ro18	apex	11	ro18_rna_seq	23	Ro18-apex-11	0.7423984	AT1G69120

After combining reference data and data to transform, users need to adjust it to the required input data, this can be done as follows.

all_data_final <- all_data %>% 
  dplyr::select(
    locus_name, 
    accession, 
    timepoint, 
    expression_value = norm.cpm, 
    tissue, 
    group
  )

all_data_final %>%
  dplyr::group_by(accession) %>%
  dplyr::slice(1:6) %>%
  knitr::kable()

locus_name	accession	timepoint	expression_value	tissue	group
BRAA02G018970.3C	Col0	7	0.4667855	apex	Col0-apex-07
BRAA02G018970.3C	Col0	7	0.0741901	apex	Col0-apex-07
BRAA02G018970.3C	Col0	8	0.0000000	apex	Col0-apex-08
BRAA02G018970.3C	Col0	8	0.0000000	apex	Col0-apex-08
BRAA02G018970.3C	Col0	9	0.3722542	apex	Col0-apex-09
BRAA02G018970.3C	Col0	9	0.0000000	apex	Col0-apex-09
BRAA02G018970.3C	Ro18	11	0.3968734	apex	Ro18-apex-11
BRAA02G018970.3C	Ro18	11	1.4147711	apex	Ro18-apex-11
BRAA02G018970.3C	Ro18	11	0.7423984	apex	Ro18-apex-11
BRAA02G018970.3C	Ro18	29	11.3007002	apex	Ro18-apex-29
BRAA02G018970.3C	Ro18	29	23.2055664	apex	Ro18-apex-29
BRAA02G018970.3C	Ro18	29	22.0307747	apex	Ro18-apex-29