Skip to contents

Introduction

TARDIS offers an easy and straightforward way to automatically calculate area under the peak, max intensity and various quality metrics for targeted chemical compounds in LC-MS data. It makes use of an established retention time correction algorithm from the xcms package and loads MS data as Spectra objects so it’s easily integrated with other tools of the Rformassspectrometry initiative.

See README for installation instructions.

In this vignette, a case study on publicly available biological data is described. We presume that we discovered a list of important metabolite biomarkers for cardiovascular disease (CVD), which we would like to validate in this dataset.

Dataset

We use the MetaboLights dataset MTBLS8735, an example untargeted metabolomics dataset tailored to quantify the small polar metabolome in human plasma samples and aimed to identify differences between individuals suffering from a cardiovascular disease (CVD) and healthy controls (CTR). The subset analyzed here includes data for three CVD patients, three CTR individuals, and four quality control (QC) samples. The QC samples, representing a pooled serum sample from a large cohort, were measured repeatedly throughout the experiment to monitor signal stability.

All samples were analyzed using ultra-high-performance liquid chromatography (UHPLC) (Agilent 1290; Agilent Technologies, Santa Clara, CA, USA) coupled to a Q-TOF mass spectrometer (TripleTOF 5600+; AB Sciex, Foster City, CA, USA). The chromatographic separation was based on hydrophilic interaction liquid chromatography (HILIC) and performed using an Acquity BEH amide, 100 x 2.1 mm column (Waters Corporation, Milford, MA, USA).

The list of internal standards: Alanine 13C315N (0.9 μg/mL), arginine 13C615N4 (1.8 μg/mL), aspartic acid 13C415N (1.3 μg/mL), cystine 13C615N2 (1.2 μg/mL), glutamic acid 13C515N (1.5 μg/mL), glycine 13C215N (0.8 μg/mL), histidine 13C615N3 (1.6 μg/mL), isoleucine 13C615N (1.3 μg/mL), leucine 13C615N (1.3 μg/mL), lysine 13C615N2 (1.5 μg/mL), methionine 13C515N (1.5 μg/mL), phenylalanine 13C915N (1.7 μg/mL), proline 13C515N (1.2 μg/mL), serine 13C315N (1.1 μg/mL), threonine 13C415N (1.2 μg/ mL), tyrosine 13C915N (1.8 μg/mL) and valine 13C515N (1.2 μg/mL).

For more details, we refer to the MetaboLights repository.

Loading data

The first step is creating a data.frame that describes the metabolites of interest. In this case, this are the internal standards and the 6 biomarkers of interest. Following columns at least need to be present for each compound:

  • A compound ID, a unique identifier
  • A compound Name
  • Theoretical or measured m/z
  • Expected RT (in minutes)
  • A column that indicates the polarity of the formed ion for that compound

Extra columns can be included in the file, but will be ignored by TARDIS unless otherwise indicated.

An input file (either .xlsx or .csv) can be converted to a correct data.frame using the createTargetList() function. Input parameters needed are: the path to the file, the patterns for positive and negative ionization, the polarity of interest, the columnn that contains the ionization mode and the other columns of interest.

library(readxl)
input <- read.csv2("vignette_data/bio_targets.csv",sep = ",")
kableExtra::kable(head(input))
ID name abbreviation formula POS NEG RT data_set sample operator version quality_NEG quality_POS mz mzmin mzmax rtmin rtmax rt
1 Carnitine (D3) carnitine_d3 164.124 [M+H]+ NA 61 2019_11_matrix_effect serum Mar Garcia-Aloy 2 3_bad 2_medium 165.131276 165.126276 165.136276 51 71 1.016666667
2 Creatinine (N-Methhyl_D3) creatinine_methyl_d3 116.0777 [M+H]+ [M-H]- 126 2019_11_matrix_effect serum Mar Garcia-Aloy 2 2_medium 2_medium 117.084976 117.079976 117.089976 116 136 2.1
3 Glucose (6,6-D2) glucose_d2 182.0759 [M+Na]+ [M+CHO2]- 166 2019_11_matrix_effect serum Mar Garcia-Aloy 2 1_good 2_medium 205.06512 205.06012 205.07012 156 176 2.766666667
4 L-Alanine (13C3, 99%; 15N, 99%) alanine_13C_15N 93.0548 [M+H]+ [M-H]- 167 2019_11_matrix_effect serum Mar Garcia-Aloy 2 2_medium 1_good 94.062076 94.057076 94.067076 157 177 2.783333333
5 L-Arginine HCl (13C6, 99%; 15N4, 99%) arginine_13C_15N 184.1199 [M+H]+ [M-H]- 183 2019_11_matrix_effect serum Mar Garcia-Aloy 2 0_perfect 0_perfect 185.127176 185.122176 185.132176 173 193 3.05
6 L-Aspartic acid (13C4, 99%; 15N, 99%) aspartic_13C_15N 138.048 [M+H]+ [M-H]- 179 2019_11_matrix_effect serum Mar Garcia-Aloy 2 1_good 1_good 139.055276 139.050276 139.060276 169 189 2.983333333

The target data.frame is created using createTargetList(). For more info on how to use the function, check the help page: ?createTargetList

Our biomarkers are defined as:

library(TARDIS)
targets <- createTargetList("vignette_data/bio_targets.csv",
                            pos_pattern = "+",
                            neg_pattern = "-",
                            polarity = "positive",
                            ion_column = "POS",
                            columns_of_interest = c("ID", "name", "mz", "rt"))
kableExtra::kable(targets[21:26,])
ID NAME m/z tr
21 21 Unknown_1 182.0749 34.83789
22 22 Caffeine 195.0877 32.65668
23 23 BENZYL METHYL SULFIDE 161.0400 162.13668
24 24 Anthranilic acid 138.0547 148.39599
25 25 Isoproturon 229.1299 181.08828
26 26 Unknown_2 560.3603 33.54917

Below we extract our dataset from the MetaboLigths database and load it as an MsExperiment object. For more information on how to load your data from the MetaboLights database, we refer to the MsIO vignette.

library(MsExperiment)
library(MsIO)
library(MsBackendMetaboLights)
param <- MetaboLightsParam(mtblsId = "MTBLS8735",
                           assayName = paste0("a_MTBLS8735_LC-MS_positive_",
                                              "hilic_metabolite_profiling.txt"),
                           filePattern = ".mzML")
lcms1 <- readMsObject(MsExperiment(),
                      param,
                      keepOntology = FALSE,
                      keepProtocol = FALSE,
                      simplify = TRUE)
lcms1
#> Object of class MsExperiment 
#>  Spectra: MS1 (17210) 
#>  Experiment data: 10 sample(s)
#>  Sample data links:
#>   - spectra: 10 sample(s) to 17210 element(s).

Simplifying the sampleData:

colnames(sampleData(lcms1)) <- c("sample_name", "spectraOrigin",
                                "metabolite_asssignment_file",
                                "source_name",
                                "organism",
                                "blood_sample_type",
                                "type", "age", "unit", "phenotype")

# Add "QC" to the phenotype of the QC samples
sampleData(lcms1)$phenotype[sampleData(lcms1)$sample_name == "POOL"] <- "QC"
sampleData(lcms1)$sample_name[sampleData(lcms1)$sample_name == "POOL" ] <- c("POOL1", "POOL2", "POOL3", "POOL4")
sampleData(lcms1)$type[sampleData(lcms1)$type == "pool"] <- "QC"

#  Add injection index column
sampleData(lcms1)$injection_index <- seq_len(nrow(sampleData(lcms1)))
sampleData(lcms1)[, c("spectraOrigin",
                     "phenotype", "sample_name", "type",
                     "injection_index")] |>
    kableExtra::kable(format = "pipe")
spectraOrigin phenotype sample_name type injection_index
FILES/MS_QC_POOL_1_POS.mzML QC POOL1 QC 1
FILES/MS_A_POS.mzML CVD A experimental sample 2
FILES/MS_B_POS.mzML CTR B experimental sample 3
FILES/MS_QC_POOL_2_POS.mzML QC POOL2 QC 4
FILES/MS_C_POS.mzML CTR C experimental sample 5
FILES/MS_D_POS.mzML CVD D experimental sample 6
FILES/MS_QC_POOL_3_POS.mzML QC POOL3 QC 7
FILES/MS_E_POS.mzML CTR E experimental sample 8
FILES/MS_F_POS.mzML CVD F experimental sample 9
FILES/MS_QC_POOL_4_POS.mzML QC POOL4 QC 10

IMPORTANT! The sample type column (indicating the QCs) has to be named type. and the column indicating the file has to be named spectraOrigin.

Screening mode

First, we perform a screening step to check if our targets are visible within our m/z and RT windows.

We can run screening mode using the argument screening_mode = TRUE in the tardisPeaks function.

For more details on the inputs of the function, please read the help page ?tardisPeaks

To limit the running time of this vignette, we limit the internal standard targets to five, which should be enough for good RT alignment, given their retention times span throughout most of the analysis. For more info on how to choose the right “housekeeping compounds” for RT alignment, we refer the reader to the xcms documentation.

  • L-Cystine (13C6, 99%; 15N2, 99%)
  • L-Methionine (13C5, 99%; 15N, 99%)
  • L-Glutamic acid (13C5, 99%; 15N, 99%)
  • L-Phenylalanine (13C9, 99%; 15N, 99%)
  • L-Serine (13C3, 99%; 15N, 99%)
subset <- targets[c(17,14,7,8,15,21:26),]
results <- tardisPeaks( lcmsData = lcms1, 
                        dbData = subset,
                        mass_range = NULL,
                        polarity = "positive",
                        output_directory = "vignette_data/output/case/screening/",
                        batch_positions = list(c(1,10)),
                        QC_pattern = "QC",
                        int_std_id = as.character(1:20),
                        screening_mode = TRUE)
#> Performing retention time correction using 5 peak groups.

The resulting EICs are saved in the output folder and can be inspected.

Based on the EICs we can see that detection, RT alignment and integration of target peaks in QC runs were successful. However, the signal of component 26, a non-annotated biomarker, is of remarkably lower quality than the others. One pooled QC shows a high signal that is absent in the other QC runs.

Peak detection

Now we can perform peak detection in all our runs by setting screening_mode = FALSE.

results <- tardisPeaks( lcmsData = lcms1, 
                        dbData = subset,
                        mass_range = NULL,
                        polarity = "positive",
                        output_directory = "vignette_data/output/case/",
                        batch_positions = list(c(1,10)),
                        QC_pattern = "QC",
                        int_std_id = as.character(1:20),
                        screening_mode = FALSE)
#> Performing retention time correction using 5 peak groups.
#> Aligning sample number 2 against subset ... OK
#> Aligning sample number 3 against subset ... OK
#> Aligning sample number 5 against subset ... OK
#> Aligning sample number 6 against subset ... OK
#> Aligning sample number 8 against subset ... OK
#> Aligning sample number 9 against subset ... OK

Results

The results object is a list that contains a data.frame with the AUC of each target in each run and a tibble that contains a feature table with the average metrics for each target in the QC runs.

AUC <- results[[1]]
kableExtra::kable(head(AUC))
Component MS_A_POS.mzML MS_B_POS.mzML MS_C_POS.mzML MS_D_POS.mzML MS_E_POS.mzML MS_F_POS.mzML MS_QC_POOL_1_POS.mzML MS_QC_POOL_2_POS.mzML MS_QC_POOL_3_POS.mzML MS_QC_POOL_4_POS.mzML
14 16525.69812 18002.5842 16543.549 17354.10495 19069.943 17914.6009 19826.2410 20773.350 21403.3610 20310.1275
15 17633.56844 18678.1921 16358.943 20086.53309 19134.836 11104.8497 18041.7061 17538.437 19005.3145 18484.4783
17 9696.06094 10009.2589 10862.924 9731.80795 10586.717 11158.9709 10348.6157 10773.937 10188.5909 10082.6647
21 NA 3573.0464 3157.537 47.03151 4260.996 NA 1498.4509 2006.222 1746.3844 1378.9488
22 827.47100 79504.3672 116928.037 2204.12719 166453.150 827.6966 36758.7755 40199.217 39582.0053 39544.2507
23 28.60398 384.4481 632.872 NA 1976.859 NA 660.7771 536.172 715.3061 672.9312
feat_table <- results[[2]]
kableExtra::kable(head(feat_table))
Component AUC MaxInt SNR peak_cor foundRT pop ID NAME m.z tr mean na.rm trold
14 20578.2698 12635.3442 16.64761 0.9696404 159.81106 18.00 14 L-Methionine (13C5, 99%; 15N, 99%) 156.0722 159.81106 159.81106 TRUE 161.00000
15 18267.4839 10256.7736 12.89131 0.9476993 151.52398 21.25 15 L-Phenylalanine (13C9, 99%; 15N, 99%) 176.1135 151.52398 151.52398 TRUE 153.00000
17 10348.4520 6464.6150 19.84894 0.9757596 179.14857 17.50 17 L-Serine (13C3, 99%; 15N, 99%) 110.0570 179.14857 179.14857 TRUE 178.00000
21 1657.5015 918.1422 12.93310 0.9439880 34.89898 15.75 21 Unknown_1 182.0749 34.89898 34.89898 TRUE 34.83789
22 39021.0621 27834.8161 30.26548 0.9887586 32.87623 14.00 22 Caffeine 195.0877 32.87623 32.87623 TRUE 32.65668
23 646.2966 392.2353 17.00577 0.9647597 162.04193 14.50 23 BENZYL METHYL SULFIDE 161.0400 162.04193 162.04193 TRUE 162.13668

Other results include tables with the other metrics (Max. Int., SNR, peak_cor and points over the peak) and are saved into the output folder in .csv format.

maxint <- read.csv("vignette_data/output/case/int_table.csv",check.names = FALSE)[,-1]
SNR <- read.csv("vignette_data/output/case/snr_table.csv",check.names = FALSE)[,-1]
peak_cor <- read.csv("vignette_data/output/case/peakcor_table.csv",check.names = FALSE)[,-1]
pop <- read.csv("vignette_data/output/case/pop_table.csv",check.names = FALSE)[,-1]
kableExtra::kable(head(pop))
Component MS_A_POS.mzML MS_B_POS.mzML MS_C_POS.mzML MS_D_POS.mzML MS_E_POS.mzML MS_F_POS.mzML MS_QC_POOL_1_POS.mzML MS_QC_POOL_2_POS.mzML MS_QC_POOL_3_POS.mzML MS_QC_POOL_4_POS.mzML
14 21 22 26 25 19 20 19 18 17 18
15 23 22 25 23 24 23 21 21 22 21
17 16 20 14 15 15 14 17 18 19 16
21 NA 13 20 7 13 NA 13 21 17 12
22 10 14 13 13 14 9 15 14 14 13
23 7 12 11 NA 23 NA 15 12 16 15

Exploratory data analysis

The next section of this vignette describes an example exploratory data analysis using the results of TARDIS preprocessing.

PCA

First, we use perform a principal component analysis (PCA) using the AUC’s of our biomarkers:

# Load required packages
library(dplyr)
library(tidyr)
library(ggplot2)
# Limit input data to biomarkers and not ISTD targets
data <- AUC[4:9,]
sampledata <- data.frame(sampleData(lcms1))
# Remove "FILES/" prefix
sampledata$spectraOrigin <- gsub("FILES/", "", sampledata$spectraOrigin)
# Ensure row names correspond to Feature ID
rownames(data) <- data[, 1]
data <- data[, -1]  # Remove the feature ID column for PCA
# Transpose the data so that samples are rows and features are columns
data_t <- as.data.frame(t(data))
# Impute NA values
na_unidis <- function(z) {
    na <- is.na(z)
    if (any(na)) {
        min = min(z, na.rm = TRUE)
        z[na] <- runif(sum(na), min = min/2, max = min)
    }
    z
}
tmp <- apply(data_t, MARGIN = 1, na_unidis)
data_t <- t(tmp)
# Perform PCA
pca_result <- prcomp(data_t, scale. = TRUE,center = TRUE)
# Extract PCA scores
pca_df <- as.data.frame(pca_result$x)
pca_df$Sample <- rownames(pca_df)
# Merge with sample metadata to get phenotype information
pca_df <- left_join(pca_df, sampledata, by = c("Sample" = "spectraOrigin"))
# Plot PCA
ggplot(pca_df, aes(x = PC1, y = PC2, color = phenotype)) +
  geom_point(size = 3) +
  theme_minimal() +
  labs(title = "PCA",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  scale_color_manual(values = c("red", "blue", "green")) # Customize colors if needed

Results are as expected:

  • QC samples cluster together, indicating little technical variance in the analysis. However, one QC sample clearly differs from the others. This is due to component 26, which, as shown earlier, showed a large peak in one of the QC runs, but not in the others.

  • Samples with the CVD phenotype cluster together, more so than the CTR samples indicating that the biomarkers vary less within in the CVD phenotype as compared to the healthy controls.

Disregarding component nr 26, the PCA looks like:

# Limit input data to biomarkers and not ISTD targets
data <- AUC[4:8,]
sampledata <- data.frame(sampleData(lcms1))
# Remove "FILES/" prefix
sampledata$spectraOrigin <- gsub("FILES/", "", sampledata$spectraOrigin)
# Ensure row names correspond to Feature ID
rownames(data) <- data[, 1]
data <- data[, -1]  # Remove the feature ID column for PCA
# Transpose the data so that samples are rows and features are columns
data_t <- as.data.frame(t(data))
# Impute NA values
na_unidis <- function(z) {
    na <- is.na(z)
    if (any(na)) {
        min = min(z, na.rm = TRUE)
        z[na] <- runif(sum(na), min = min/2, max = min)
    }
    z
}
tmp <- apply(data_t, MARGIN = 1, na_unidis)
data_t <- t(tmp)
# Perform PCA
pca_result <- prcomp(data_t, scale. = TRUE,center = TRUE)
# Extract PCA scores
pca_df <- as.data.frame(pca_result$x)
pca_df$Sample <- rownames(pca_df)
# Merge with sample metadata to get phenotype information
pca_df <- left_join(pca_df, sampledata, by = c("Sample" = "spectraOrigin"))
# Plot PCA
ggplot(pca_df, aes(x = PC1, y = PC2, color = phenotype)) +
  geom_point(size = 3) +
  theme_minimal() +
  labs(title = "PCA",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  scale_color_manual(values = c("blue", "red", "green")) # Customize colors if needed

Differential abundance analysis

In this section we perform differential abundance analysis to investigate which differences in abudance of biomarkers we observe between the two phenotype groups.

First we remove the ISTD targets from the data and impute missing values.

# Limit input data to biomarkers and not ISTD targets
data <- AUC[4:8,]
sampledata <- data.frame(sampleData(lcms1))
sampledata$spectraOrigin <- gsub("FILES/", "", sampledata$spectraOrigin)
# Ensure row names correspond to Feature ID
rownames(data) <- data[, 1]
data <- data[, -1]  # Remove the feature ID column for PCA
data_t <- as.data.frame(t(data))
na_unidis <- function(z) {
    na <- is.na(z)
    if (any(na)) {
        min = min(z, na.rm = TRUE)
        z[na] <- runif(sum(na), min = min/2, max = min)
    }
    z
}
#' Row-wise impute missing values and add the data as a new assay
tmp <- apply(data_t, MARGIN = 1, na_unidis)
data_t <- as.data.frame(t(tmp))
# Merge with sample metadata
data_t$Sample <- rownames(data_t)
data_t <- left_join(data_t, sampledata, by = c("Sample" = "spectraOrigin"))
data_t <- data_t[-which(data_t$phenotype == "QC"),]

Next we visualize the log2 transformed abudances of each feature in each group using boxplots.

boxdata <- data_t[,c(1:5,15)]
boxdata <- pivot_longer(boxdata,cols = !phenotype,names_to = "feature",values_to = "abundance")
ggplot(boxdata, aes(x=feature, y=log2(abundance), fill=phenotype)) + 
    geom_boxplot() +
  scale_fill_manual(values = c("blue","red"))

To test the differences for each biomarker, we use to wilcox.test function. From the boxplots above, we known that the biomarkers are underrepresented in the CVD phenotype, hence, we use the alternative = "less" option in the wilcox.test function. Data is log2 transformed.

pres <- data.frame(Feature = colnames(data_t)[1:5], p_value = NA)
for (i in 1:5) {
  feature_values <- data_t[, i]
  group1 <- log2(feature_values[data_t$phenotype == "CVD"])
  group2 <- log2(feature_values[data_t$phenotype == "CTR"])
  pres$p_value[i] <- wilcox.test(group1, group2,alternative = "less")$p.value
  }
pres$adj_p_value <- p.adjust(pres$p_value, method = "fdr")
knitr::kable(pres)
Feature p_value adj_p_value
21 0.05 0.05
22 0.05 0.05
23 0.05 0.05
24 0.05 0.05
25 0.05 0.05

Conclusion

Based on the results preprocessed using TARDIS, we can conclude that the earlier discovered metabolites are indeed valid biomarkers for the CVD phenotype and are significantly underrepresented in the CVD samples.