Cleaning Lipid Names for Annotation Part 3
By Jeremy Selva
April 23, 2022
Introduction
In this blog, we continue the process of cleaning up these lipid annotations that my workplace uses
Given Name | Clean Name For Annotation | Precursor Ion | Product Ion |
---|---|---|---|
DG 32:0 [-16:0] | DG 16:0_16:0 | 586.5 | 313.3 |
DG 36:1 [NL-18:1] | DG 18:1_18:0 | 640.6 | 341.3 |
TG 54:3 [-18:1] | TG 18:1_36:2 | 902.8 | 603.5 |
TG 54:3 [NL-18:2] | TG 18:2_36:1 | 902.8 | 605.5 |
TG 54:3 [SIM] | TG 54:3 | 902.8 | 902.8 |
This is so that they can be processed by lipid annotations converter tools like Goslin (1), (2) and RefMet (3).
Our last function below is able to remove the [SIM] at the end of the transition name.
clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {
# If we have a sum composition labelled as [SIM] at the end,
# remove it and return the results
if (isTRUE(stringr::str_detect(string = input_acyl,
pattern = "\\s*\\[SIM\\]\\s*$")))
{
input_acyl <- input_acyl %>%
stringr::str_remove(pattern = "\\s*\\[SIM\\]\\s*$")
return(input_acyl)
}
return(input_acyl)
}
clean_acyl("TG 54:3 [SIM]")
## [1] "TG 54:3"
We continue the cleaning task for these transitions.
Given Name | Clean Name For Annotation | Precursor Ion | Product Ion |
---|---|---|---|
DG 32:0 [-16:0] | DG 16:0_16:0 | 586.5 | 313.3 |
DG 36:1 [NL-18:1] | DG 18:1_18:0 | 640.6 | 341.3 |
TG 54:3 [-18:1] | TG 18:1_36:2 | 902.8 | 603.5 |
TG 54:3 [NL-18:2] | TG 18:2_36:1 | 902.8 | 605.5 |
Just to recall, these transition names measure the amount of fatty acyl chain attached to the given DG or TG. For example, TG 54:3 [NL-18:1] measures the amount of fatty acyl chain 18:1 attached to TG 54:3 while TG 54:3 [-18:2] measures the amount of fatty acyl chain 18:2 attached to TG 54:3 instead.
In general, they are in the form
- {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}].
We need to transform into the form
- {acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}
where
- {remaining_C} is {total_C - measured_C}
- {remaining_DB} is {total_DB - measured_DB}
R packages used
library("rgoslin")
library("reactable")
library("flair")
library("readr")
library("magrittr")
library("stringr")
library("glue")
library("dplyr")
library("purrr")
library("tibble")
library("report")
summary(report::report(sessionInfo()))
## The analysis was done using the R Statistical language (v4.2.0; R Core Team, 2022) on Windows 10 x64, using the packages rgoslin (v1.0.0), report (v0.5.1), dplyr (v1.0.9), flair (v0.0.2), glue (v1.6.2), magrittr (v2.0.3), purrr (v0.3.4), reactable (v0.2.3), readr (v2.1.2), stringr (v1.4.0) and tibble (v3.1.7).
The plan
We recall the following steps needed to clean such transition names.
-
Get the acyl class of the transition. {acyl_class}
-
Get the total carbon number of the transition. {total_C}
-
Get the total number of double bond of the transition {total_DB}
-
Get the total carbon number of the measured fatty acid chain. {measured_C}
-
Get the total number of double bond of the measured fatty acid chain. {measured_DB}
-
Use the tools above to clean the transition name.
Read data
annotation_data <- readr::read_csv("https://raw.github.com/JauntyJJS/jaunty-blogdown/main/content/blog/2022-04-23-Clean-Lipid-Names-3/Annotation.csv")
reactable::reactable(annotation_data, defaultPageSize = 5)
Get the acyl class {acyl_class}
We now try to get {acyl_class} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}].
To extract the (DG or DAG) or (TG or TAG) in the front of the transition name, we first add the ^
to indicate the software to look at the beginning of the string. Next, as the first letter can only be a D or T, we use the pattern [D|T]
. Recall that |
means or. As A is optional (appear zero or one time), we add a ?
add the end of the pattern A
. Finally we simply add the G at the end.
This gives the final pattern to be ^[D|T]A?G
acyl_class <- "TG 54:3 [NL-18:1]" %>%
stringr::str_extract(pattern = "^[D|T]A?G")
acyl_class
## [1] "TG"
Get the total carbon number of the transition {total_C}
The next step is to extract the total carbon number of the given transition or to get {total_C} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. In the case of TG 54:3 [NL-18:1], the total carbon number to 54. The key is to remove the {acyl_class} on the front first and then extract the total carbon.
To do this, we will need to make use of the {acyl_class} extracted earlier as a pattern and remove them using
stringr::str_remove
.
For me, I use
stringr::str_glue
to achieve this pattern. \\s*
is just a pattern to find white spaces and remove them.
total_C_step_1 <- "TG 54:3 [NL-18:1]" %>%
stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*"))
total_C_step_1
## [1] "54:3 [NL-18:1]"
After removing the {acyl_class} on the front, we will extract the pattern {total_C}:{total_DB} with using the pattern ^\\d+:\\d+
. This works when the digits are the first character of the string. Remember that ^
tells the software to look at the start of the string.
total_C_step_2 <- "TG 54:3 [NL-18:1]" %>%
stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
stringr::str_extract(pattern = "^\\d+:\\d+")
total_C_step_2
## [1] "54:3"
From {total_C}:{total_DB}, we can then extract {total_C} using the pattern ^\\d+
total_C <- "TG 54:3 [NL-18:1]" %>%
stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
stringr::str_extract(pattern = "^\\d+:\\d+") %>%
stringr::str_extract(pattern = "^\\d+")
total_C
## [1] "54"
Get the total number of double bond of the transition {total_DB}
The next step is to extract the total double bond number of the given transition or to get {total_DB} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. In the case of TG 54:3 [NL-18:1], the total number of double bonds is 3.
The extraction process is similar to the previous task. The only difference is that during the stage when we have obtained {total_C}:{total_DB}, we can extract {total_DB} using the pattern \\d+$
instead. Recall that $
tells the software to look at the end of the string.
total_DB <- "TG 54:3 [NL-18:1]" %>%
stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
stringr::str_extract(pattern = "^\\d+:\\d+") %>%
stringr::str_extract(pattern = "\\d+$")
total_DB
## [1] "3"
Get the total carbon number of the measured fatty acid chain {measured_C}
We now try to get {measured_C} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. To do this, first, we again remove the acyl class on the front, giving us {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. In the case of TG 54:3 [NL-18:1], the total number of measured carbon bonds is 18.
measured_C_step1 <- "TG 54:3 [NL-18:1]" %>%
stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*"))
measured_C_step1
## [1] "54:3 [NL-18:1]"
As the measured fatty acid chain comes with a -
, we extract -{measured_C}:{measured_DB} from {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. This is done by using the pattern -\\s*\\d+:\\d+
measured_C_step2 <- "TG 54:3 [NL-18:1]" %>%
stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
stringr::str_extract(pattern = "-\\s*\\d+:\\d+")
measured_C_step2
## [1] "-18:1"
We proceed to remove the -
from -{measured_C}:{measured_DB}, giving us {measured_C}:{measured_DB}
measured_C_step3 <- "TG 54:3 [NL-18:1]" %>%
stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
stringr::str_remove(pattern = "-")
measured_C_step3
## [1] "18:1"
Finally, we again use the pattern ^\\d+
to extract the {measured_C}
measured_C <- "TG 54:3 [NL-18:1]" %>%
stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
stringr::str_remove(pattern = "-") %>%
stringr::str_extract(pattern = "^\\d+")
measured_C
## [1] "18"
Get the total number of double bond of the measured fatty acid chain {measured_DB}
We now try to get {measured_DB} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. In the case of TG 54:3 [NL-18:1], the total number of measured double bonds is 1.
The process is similar to the previous section, the difference is that during the stage when we have obtained {measured_C}:{measured_DB}, we can extract {measured_DB} using the pattern \\d+$
instead.
measured_DB <- "TG 54:3 [NL-18:1]" %>%
stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
stringr::str_remove(pattern = "-") %>%
stringr::str_extract(pattern = "\\d+$")
measured_DB
## [1] "1"
Check if extraction is successful
There will be situation when the extraction will fail due to a new input of transitions that the function has not encountered before. To check for this issue, we have the following.
input_acyl <- "TG 54:3 [NL-18:1]"
if (!isTRUE(stringr::str_detect(total_C, "^[0-9]+$"))) {
stop(glue::glue("Extracting total carbon in {input_acyl} has failed"))
}
if (!isTRUE(stringr::str_detect(total_DB, "^[0-9]+$"))) {
stop(glue::glue("Extracting total double bond in {input_acyl} has failed"))
}
if (!isTRUE(stringr::str_detect(measured_C, "^[0-9]+$"))) {
stop(glue::glue("Extracting measured carbon in {input_acyl} has failed"))
}
if (!isTRUE(stringr::str_detect(measured_DB, "^[0-9]+$"))) {
stop(glue::glue("Extracting measured double bond in {input_acyl} has failed"))
}
The pattern ^[0-9]+$
means the software will try to detect digits 0 to 9 (indicated by [0-9]
). The digits can appear one or more times (indicated by +
) and the start (indicated by ^
) and the end (indicated by $
) of the string must contain only these digits.
The reason we use isTRUE
is to ensure that only logicals TRUE
and FALSE
are returned. It is possible for stringr::str_detect
to return NA
. Take a look at this
Win Vector LLC blog post for more information.
Use the tools above to clean the transition name
With the {acyl_class}, {total_C}, {total_DB}, {measured_C} and {measured_DB} extracted successfully, we can proceed to calculate
- {remaining_C} as {total_C - measured_C}
- {remaining_DB} as {total_DB - measured_DB}
remaining_C <- (as.numeric(total_C) - as.numeric(measured_C)) %>%
as.character()
remaining_C
## [1] "36"
remaining_DB <- (as.numeric(total_DB) - as.numeric(measured_DB)) %>%
as.character()
remaining_DB
## [1] "2"
Lastly, we use
stringr::str_glue
to construct {acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}
clean_acyl <- stringr::str_glue(
"{acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}"
) %>%
as.character()
clean_acyl
## [1] "TG 18:1_36:2"
Putting it all together
Combining what we have accomplished from the previous and this section, we have the following function and corresponding documentation.
#' @title Clean Acyl Lipids
#' @description Clean Acyl Lipids for `rgoslin` input
#' @param input_acyl A character string highlighting a lipid,
#' Default: 'DG 30:0 \[NL-15:0\]'
#' @return A acyl lipid that `rgoslin` can accept
#' @details We only accept DG, DAG, TG, TAG for now.
#' Input Acyl Lipid is of the form
#' {acyl_class} {total_C}:{total_DB} \[NL-{measured_C}:{measured_DB}\]
#' where C is carbon and DB is double bond
#' Output Acyl Lipid is of the form
#' {acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}
#' where {remaining_C} is {total_C - measured_C}
#' and {remaining_DB} is {total_DB - measured_DB}
#' @examples
#' clean_acyl(input_acyl = "DG 32:0 [-16:0]")
#' @rdname clean_acyl
#' @export
clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {
# If we have a sum composition labelled as [SIM] at the end
# after removing the (a\b) and (ISTD)
if(isTRUE(stringr::str_detect(string = input_acyl,
pattern = "\\s*\\[SIM\\]\\s*$"))) {
input_acyl <- input_acyl %>%
stringr::str_remove(pattern = "\\s*\\[SIM\\]\\s*$")
return(input_acyl)
}
# If we have no neutral loss labeling
if(!isTRUE(stringr::str_detect(string = input_acyl,
pattern = "\\s*(\\[)?(NL)?-\\s*\\d+:\\d+(\\])?\\s*$"))) {
return(input_acyl)
}
# From here all should end with [NL-XX:X]
acyl_class <- input_acyl %>%
stringr::str_extract(pattern = "^[D|T]A?G")
total_C <- input_acyl %>%
stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
stringr::str_extract(pattern = "^\\d+:\\d+") %>%
stringr::str_extract(pattern = "^\\d+")
total_DB <- input_acyl %>%
stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
stringr::str_extract(pattern = "^\\d+:\\d+") %>%
stringr::str_extract(pattern = "\\d+$")
measured_C <- input_acyl %>%
stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
stringr::str_remove(pattern = "-") %>%
stringr::str_extract(pattern = "^\\d+")
measured_DB <- input_acyl %>%
stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
stringr::str_remove(pattern = "-") %>%
stringr::str_extract(pattern = "\\d+$")
if(!isTRUE(stringr::str_detect(total_C, "^[0-9]+$"))) {
stop(glue::glue("Extracting total carbon in {input_acyl} has failed"))
}
if(!isTRUE(stringr::str_detect(total_DB, "^[0-9]+$"))) {
stop(glue::glue("Extracting total double bond in {input_acyl} has failed"))
}
if(!isTRUE(stringr::str_detect(measured_C, "^[0-9]+$"))) {
stop(glue::glue("Extracting measured carbon in {input_acyl} has failed"))
}
if(!isTRUE(stringr::str_detect(measured_DB, "^[0-9]+$"))) {
stop(glue::glue("Extracting measured double bond in {input_acyl} has failed"))
}
remaining_C <- (as.numeric(total_C) - as.numeric(measured_C)) %>%
as.character()
remaining_DB <- (as.numeric(total_DB) - as.numeric(measured_DB)) %>%
as.character()
output_acyl <- stringr::str_glue(
"{acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}"
) %>%
as.character()
return(output_acyl)
}
Plan execution
Here is how it looks like when the function is utilised.
cleaned_data <- annotation_data %>%
dplyr::mutate(
`Clean Name For Annotation` = purrr::map(.data[["Given Name"]],
clean_acyl
) )%>%
# Make Given Name and Clean Name the first two columns
dplyr::relocate(
dplyr::any_of(c("Given Name","Clean Name For Annotation"))
)
cleaned_data %>%
reactable::reactable(defaultPageSize = 5)
cleaned_data[["Clean Name For Annotation"]] %>%
rgoslin::parseLipidNames() %>%
reactable::reactable(defaultPageSize = 5)
Package references
get_citation <- function(package_name) {
transform_name <- package_name %>%
citation() %>%
format(style="text")
return(transform_name)
}
packages <- c("base","rgoslin", "reactable", "flair",
"magrittr", "stringr", "glue",
"dplyr", "purrr", "tibble", "report")
table <- tibble::tibble(Packages = packages)
table %>%
dplyr::mutate(
transform_name = purrr::map_chr(.data[["Packages"]],
get_citation)
) %>%
dplyr::pull(.data[["transform_name"]]) %>%
report::as.report_parameters()
- R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
- Kopczynski D, Hoffmann N, Peng B, Ahrends R (2020). “Goslin: A Grammar of Succinct Lipid Nomenclature.” Analytical Chemistry, 92(16), 10957-10960. https://pubs.acs.org/doi/10.1021/acs.analchem.0c01690.
- Lin G (2020). reactable: Interactive Data Tables Based on ‘React Table’. R package version 0.2.3, https://CRAN.R-project.org/package=reactable.
- Bodwin K, Glanz H (2020). flair: Highlight, Annotate, and Format your R Source Code. R package version 0.0.2, https://CRAN.R-project.org/package=flair.
- Bache S, Wickham H (2022). magrittr: A Forward-Pipe Operator for R. R package version 2.0.3, https://CRAN.R-project.org/package=magrittr.
- Wickham H (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0, https://CRAN.R-project.org/package=stringr.
- Hester J, Bryan J (2022). glue: Interpreted String Literals. R package version 1.6.2, https://CRAN.R-project.org/package=glue.
- Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.9, https://CRAN.R-project.org/package=dplyr.
- Henry L, Wickham H (2020). purrr: Functional Programming Tools. R package version 0.3.4, https://CRAN.R-project.org/package=purrr.
- Müller K, Wickham H (2022). tibble: Simple Data Frames. R package version 3.1.7, https://CRAN.R-project.org/package=tibble.
- Makowski D, Ben-Shachar M, Patil I, Lüdecke D (2021). “Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption.” CRAN. https://github.com/easystats/report.
References
1. Kopczynski D, Hoffmann N, Peng B, Ahrends R. Goslin: A grammar of succinct lipid nomenclature. Analytical Chemistry [Internet]. 2020;92(16):10957–60. Available from: https://doi.org/10.1021/acs.analchem.0c01690
2. Kopczynski D, Hoffmann N, Peng B, Liebisch G, Spener F, Ahrends R. Goslin 2.0 implements the recent lipid shorthand nomenclature for MS-derived lipid structures. Analytical Chemistry [Internet]. 2022;94(16):6097–101. Available from: https://doi.org/10.1021/acs.analchem.1c05430
3. Fahy E, Subramaniam S. RefMet: A reference nomenclature for metabolomics. Nature Methods [Internet]. 2020 Dec 1;17(12):1173–4. Available from: https://doi.org/10.1038/s41592-020-01009-y