Cleaning Lipid Names for Annotation Part 2
By Jeremy Selva
March 7, 2022
Introduction
In this blog, I will introduce another set of lipid annotations that my workplace uses that require to be cleaned up and modified so that they can be processed by lipid annotations converter tools like Goslin (1), (2) and RefMet (3).
R Packages Used
library("rgoslin")
library("reactable")
library("flair")
library("readr")
library("magrittr")
library("stringr")
library("dplyr")
library("purrr")
library("tibble")
library("report")
summary(report::report(sessionInfo()))
## The analysis was done using the R Statistical language (v4.2.0; R Core Team, 2022) on Windows 10 x64, using the packages rgoslin (v1.0.0), report (v0.5.1), dplyr (v1.0.9), flair (v0.0.2), magrittr (v2.0.3), purrr (v0.3.4), reactable (v0.2.3), readr (v2.1.2), stringr (v1.4.0) and tibble (v3.1.7).
Labels to clean
Here are the list of lipid names to clean
Given Name | Clean Name For Annotation | Precursor Ion | Product Ion |
---|---|---|---|
DG 32:0 [-16:0] | DG 16:0_16:0 | 586.5 | 313.3 |
DG 36:1 [NL-18:1] | DG 18:1_18:0 | 640.6 | 341.3 |
TG 54:3 [-18:1] | TG 18:1_36:2 | 902.8 | 603.5 |
TG 54:3 [NL-18:2] | TG 18:2_36:1 | 902.8 | 605.5 |
TG 54:3 [SIM] | TG 54:3 | 902.8 | 902.8 |
The word [SIM] stands for selected ion monitoring. SIM is used to detect TG species with a known total number of carbon atoms and double bonds. In the case of TG 54:3 [SIM], the total number of carbon atom is 54 and double bonds is 3.
There are several limitations of this acquisition mode as indicated by Xianlin et. al. (4). One of them is that this method is unable to give information about the fatty acyl chains. As a result, multiple precursor ion and neutral loss (NL) scan mode were introduced to identify potential fatty acyl chains that can be attached to the TG. For example, TG 54:3 [NL-18:2] measures the amount of fatty acyl chain 18:2 attached to TG 54:3 while TG 54:3 [-18:3] measures the amount of fatty acyl chain 18:3 attached to TG 54:3 instead.
Unfortunately lipid annotations converter tools like Goslin and RefMet are unable to parse these given names
c("DG 32:0 [-16:0]",
"TG 54:3 [NL-18:2]",
"TG 54:3 [SIM]") %>%
rgoslin::parseLipidNames() %>%
reactable::reactable(defaultPageSize = 5)
They must be clean up accordingly as indicated in Gerhard et. al. (5)
A positive result should look like the following.
c("DG 16:0_16:0",
"TG 18:2_36:1",
"TG 54:3") %>%
rgoslin::parseLipidNames() %>%
reactable::reactable(defaultPageSize = 5)
## Read Data
Read Data
annotation_data <- readr::read_csv("https://raw.github.com/JauntyJJS/jaunty-blogdown/main/content/blog/2022-03-07-Clean-Lipid-Names-2/Annotation.csv")
reactable::reactable(annotation_data, defaultPageSize = 5)
The Plan
We can split this complex task in the following steps
- Find those transition names that ends with [SIM], remove the [SIM] and return it
Transition names from here on should only be those that are measuring neutral loss a particular fatty acid chain.
We then need to do the following steps to clean such transition names.
-
Get the acyl class of the transition.
-
Get the total carbon number of the transition.
-
Get the total number of double bond of the transition
-
Get the total carbon number of the measured fatty acid chain.
-
Get the total number of double bond of the measured fatty acid chain.
-
Use the tools above to clean the transition name.
In this blog, we will focus only on doing the first part which is to remove the [SIM]
Remove [SIM] at the end
We begin with an empty generic function
clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {
return(input_acyl)
}
The square brackets [
and ]
means one of in regular expression. For example, pattern [a-z]
is telling the software to look for one of the letters ranging from a to z.
Remember the backslash \
To allow the software to look explicitly for the pattern [
and ]
, we need to use the backslash \
giving us \[
and \]
. However, whenever \
appears in a regular expression, we must write as \\
instead in R. Doing so gives us \\[
and \\]
Using TG 54:3 [SIM]
as an example, we have
stringr::str_remove(string = "TG 54:3 [SIM]" ,pattern = "\\[SIM\\]")
## [1] "TG 54:3 "
Removing the whitespaces
Now here you can see that the white spaces is not removed. One way is to use stringr::str_trim()
stringr::str_remove(string = "TG 54:3 [SIM]" ,pattern = "\\[SIM\\]") %>%
stringr::str_trim()
## [1] "TG 54:3"
Another way is to add white spaces in our pattern \\[SIM\\]
. Taking a look at the stringr
cheat sheet, we can try to use \\s
.
To show that we need to remove zero of more whitespaces, we add the “*”
This expands the pattern to \\s*\\[SIM\\]\\s*
stringr::str_remove(string = "TG 54:3 [SIM] ", pattern = "\\s*\\[SIM\\]\\s*")
## [1] "TG 54:3"
Indicating the end of a string
To be more specific that we need to remove [SIM] only at the end of the string. We add $
at the end of the pattern: \\s*\\[SIM\\]\\s*$
stringr::str_remove(string = c ("TG 54:3 [SIM] ",
" [SIM] TG 54:3"),
pattern = "\\s*\\[SIM\\]\\s*$")
## [1] "TG 54:3" " [SIM] TG 54:3"
Using isTRUE
in an if statement
Now that we know how to remove the [SIM] at the end, we need to create an if statement to detect that such a pattern exists. This is because after the [SIM] is removed, the transition names do not need to be modified further and the cleaned name can be returned.
One way to do this is to use stringr::str_detect
to do the job
clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {
# If we have a sum composition labelled as [SIM] at the end,
# remove it and return the results
if (stringr::str_detect(string = input_acyl,
pattern = "\\s*\\[SIM\\]\\s*$"))
{
input_acyl <- input_acyl %>%
stringr::str_remove(pattern = "\\s*\\[SIM\\]\\s*$")
return(input_acyl)
}
return(input_acyl)
}
However, there is a possibility that stringr::str_detect
may not return a boolean value due to an unusual input. This may give an error to the if statement. Here are some examples
if(stringr::str_detect(string = NULL, pattern = "\\s*\\[SIM\\]\\s*$")) {
print("No error")
}
## Error in if (stringr::str_detect(string = NULL, pattern = "\\s*\\[SIM\\]\\s*$")) {: argument is of length zero
if(stringr::str_detect(string = NA, pattern = "\\s*\\[SIM\\]\\s*$")) {
print("No error")
}
## Error in if (stringr::str_detect(string = NA, pattern = "\\s*\\[SIM\\]\\s*$")) {: missing value where TRUE/FALSE needed
To rectify this issue, we make use of the function isTRUE
which gives the boolean value FALSE
, when it receives such unusual input.
isTRUE(stringr::str_detect(string = NULL, pattern = "\\s*\\[SIM\\]\\s*$"))
## [1] FALSE
isTRUE(stringr::str_detect(string = NA, pattern = "\\s*\\[SIM\\]\\s*$"))
## [1] FALSE
Putting it all together, we have the following so far.
clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {
# If we have a sum composition labelled as [SIM] at the end,
# remove it and return the results
if (isTRUE(stringr::str_detect(string = input_acyl,
pattern = "\\s*\\[SIM\\]\\s*$")))
{
input_acyl <- input_acyl %>%
stringr::str_remove(pattern = "\\s*\\[SIM\\]\\s*$")
return(input_acyl)
}
return(input_acyl)
}
Plan Execution
Here is how it looks like when the function is utilised.
clean_acyl("TG 54:3 [SIM]")
## [1] "TG 54:3"
Stay tuned for the next part where we will expand the function to deal with the second part of the plan which is to deal with transitions that indicate a neutral loss of fatty acid chains.
Given Name | Clean Name For Annotation | Precursor Ion | Product Ion |
---|---|---|---|
DG 32:0 [-16:0] | DG 16:0_16:0 | 586.5 | 313.3 |
DG 36:1 [NL-18:1] | DG 18:1_18:0 | 640.6 | 341.3 |
TG 54:3 [-18:1] | TG 18:1_36:2 | 902.8 | 603.5 |
TG 54:3 [NL-18:2] | TG 18:2_36:1 | 902.8 | 605.5 |
Package References
get_citation <- function(package_name) {
transform_name <- package_name %>%
citation() %>%
format(style="text")
return(transform_name)
}
packages <- c("base","rgoslin", "reactable",
"flair", "magrittr",
"stringr", "dplyr", "report",
"tibble", "purrr")
table <- tibble::tibble(Packages = packages)
table %>%
dplyr::mutate(
transform_name = purrr::map_chr(.data[["Packages"]],
get_citation)
) %>%
dplyr::pull(.data[["transform_name"]]) %>%
report::as.report_parameters()
- R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
- Kopczynski D, Hoffmann N, Peng B, Ahrends R (2020). “Goslin: A Grammar of Succinct Lipid Nomenclature.” Analytical Chemistry, 92(16), 10957-10960. https://pubs.acs.org/doi/10.1021/acs.analchem.0c01690.
- Lin G (2020). reactable: Interactive Data Tables Based on ‘React Table’. R package version 0.2.3, https://CRAN.R-project.org/package=reactable.
- Bodwin K, Glanz H (2020). flair: Highlight, Annotate, and Format your R Source Code. R package version 0.0.2, https://CRAN.R-project.org/package=flair.
- Bache S, Wickham H (2022). magrittr: A Forward-Pipe Operator for R. R package version 2.0.3, https://CRAN.R-project.org/package=magrittr.
- Wickham H (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0, https://CRAN.R-project.org/package=stringr.
- Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.9, https://CRAN.R-project.org/package=dplyr.
- Makowski D, Ben-Shachar M, Patil I, Lüdecke D (2021). “Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption.” CRAN. https://github.com/easystats/report.
- Müller K, Wickham H (2022). tibble: Simple Data Frames. R package version 3.1.7, https://CRAN.R-project.org/package=tibble.
- Henry L, Wickham H (2020). purrr: Functional Programming Tools. R package version 0.3.4, https://CRAN.R-project.org/package=purrr.
References
1. Kopczynski D, Hoffmann N, Peng B, Ahrends R. Goslin: A grammar of succinct lipid nomenclature. Analytical Chemistry [Internet]. 2020;92(16):10957–60. Available from: https://doi.org/10.1021/acs.analchem.0c01690
2. Kopczynski D, Hoffmann N, Peng B, Liebisch G, Spener F, Ahrends R. Goslin 2.0 implements the recent lipid shorthand nomenclature for MS-derived lipid structures. Analytical Chemistry [Internet]. 2022;94(16):6097–101. Available from: https://doi.org/10.1021/acs.analchem.1c05430
3. Fahy E, Subramaniam S. RefMet: A reference nomenclature for metabolomics. Nature Methods [Internet]. 2020 Dec 1;17(12):1173–4. Available from: https://doi.org/10.1038/s41592-020-01009-y
4. Han X, Ye H. Overview of lipidomic analysis of triglyceride molecular species in biological lipid extracts. Journal of Agricultural and Food Chemistry [Internet]. 2021 Aug 18;69(32):8895–909. Available from: https://doi.org/10.1021/acs.jafc.0c07175
5. Liebisch F Gerhard, Spener F. Update on LIPID MAPS classification, nomenclature, and shorthand notation for MS-derived lipid structures. Journal of Lipid Research [Internet]. 2020 Dec 1;61(12):1539–55. Available from: https://doi.org/10.1194/jlr.S120001025