Cleaning Lipid Names for Annotation Part 1
By Jeremy Selva
January 30, 2022
Introduction
Despite efforts made to unified lipids shorthand notations and the rise of software dedicated to standardise lipid annotations, lipid names still remains diverse. This is due to limited ways lipid software may label a lipid and/or researchers’ personal preferences to annotate them.
Here, I would like to highlight some lipid annotations that my workplace uses and show how to modify them in R such that it can be processed by lipid annotations converter tools like Goslin (1), (2) and RefMet (3).
R Packages Used
library("rgoslin")
library("reactable")
library("readr")
library("magrittr")
library("stringr")
library("dplyr")
library("purrr")
library("tibble")
library("report")
summary(report::report(sessionInfo()))
## The analysis was done using the R Statistical language (v4.2.0; R Core Team, 2022) on Windows 10 x64, using the packages rgoslin (v1.0.0), report (v0.5.1), dplyr (v1.0.9), magrittr (v2.0.3), purrr (v0.3.4), reactable (v0.2.3), readr (v2.1.2), stringr (v1.4.0) and tibble (v3.1.7).
Labels to clean
Here are the list of lipid names to clean
Given Name | Clean Name For Annotation | Precursor Ion | Product Ion |
---|---|---|---|
LPC 13:0 (ISTD) (a) | LPC 13:0 | 454.3 | 184.1 |
LPC 13:0 (ISTD) (a\b) | LPC 13:0 | 454.3 | 184.1 |
LPC 13:0 (ISTD) (b) | LPC 13:0 | 454.3 | 184.1 |
LPC 15:0-d5 (IS) (a) | LPC 15:0 | 487.4 | 304.3 |
LPC 15:0-d5 (IS) (a\b) | LPC 15:0 | 487.4 | 304.3 |
LPC 15:0-d5 (IS) (b) | LPC 15:0 | 487.4 | 304.3 |
LPC 17:1 (a) | LPC 17:1 | 508.3 | 184.1 |
LPC 17:1 (a/b/c) | LPC 17:1 | 508.3 | 184.1 |
LPC 17:1 (b) | LPC 17:1 | 508.3 | 184.1 |
LPC 17:1 (c) | LPC 17:1 | 508.3 | 184.1 |
The word (IS) or (ISTD) stands for internal standards.
The word -d5 means the compound contains five deuterium instead of hydrogen.
These internal standards can be purchased via these links
The (a\b) in this case refers to which lipid isomer is integrated. Here is an example with LPC 17:1
The identity of such isomers is still an ongoing research process for most lipids. Thankfully for the case of LPC 17:1 in human plasma, the identity can be found in these links.
Unfortunately lipid annotations converter tools like Goslin and RefMet are unable to parse these given names
c("LPC 13:0 (ISTD) (a\\b)",
"LPC 15:0-d5 (IS) (a\\b)",
"LPC 17:1 (a/b/c) ") %>%
rgoslin::parseLipidNames() %>%
reactable::reactable(defaultPageSize = 5)
They must be clean up accordingly
c("LPC 13:0","LPC 15:0","LPC 17:1") %>%
rgoslin::parseLipidNames() %>%
reactable::reactable(defaultPageSize = 5)
The task is to remove the variations of (ISTD), (a\b) and the -d5
Read Data
annotation_data <- readr::read_csv("https://raw.github.com/JauntyJJS/jaunty-blogdown/main/content/blog/2022-01-22-Clean-Lipid-Names-1/Annotation.csv")
reactable::reactable(annotation_data, defaultPageSize = 5)
The Plan
We will do the following in order
-
Remove the -d5
-
Remove the ISTD and its variations
-
Remove the (a\b\c) and its variations
The idea is to first use stringr::str_view
to check what the matched pattern is. Once the matched pattern is correct, we can use stringr::str_remove
function to clean up the matched pattern.
Remove -d5
Here is an example to remove the -d5 using stringr::str_view
. Simply just set the pattern as “-d5”
stringr::str_view("LPC 15:0-d5 (IS) (a)", pattern = "-d5")
stringr::str_remove("LPC 15:0-d5 (IS) (a)", pattern = "-d5")
## [1] "LPC 15:0 (IS) (a)"
Remove the ISTD variation
The first challenge is to create a pattern that is able to remove variations of ISTD such as (ISTD)
and (IS)
Detect the word ISTD
To detect the word ISTD, we can do this in R
stringr::str_view("LPC 13:0 (ISTD) (a)", pattern = "ISTD")
Unfortunately, this will not work with the word IS
stringr::str_view("LPC 15:0-d5 (IS) (a)", pattern = "ISTD")
Alternatively to detect the word IS, we can do this in R
stringr::str_view("LPC 15:0-d5 (IS) (a)", pattern = "IS")
This time, this will not work with the word ISTD
stringr::str_view("LPC 13:0 (ISTD) (a)", pattern = "IS")
To detect words of the form IS and ISTD, we can make use of the fact that the group “TD” in “ISTD” appear zero or one time. Referring to the cheat sheet of stringr
, we can make use of parentheses to create a group.
Hence, adding the group to the existing pattern, we have the form IS(TD)
.
Next, we inform stringr
that the group (TD)
can appear zero or one time. Referring to the cheat sheet of stringr
, we can make use of the ?
symbol
This give our updated pattern to IS(TD)?
Putting this to our existing list of words, we have
stringr::str_view(annotation_data$`Given Name`, pattern = "IS(TD)?")
Detect Parenthesis
The way to detect (
and )
is unfortunately not as simple as stringr::str_view_all("LPC 13:0 (ISTD) (a)", pattern = "(")
, giving rise to this error message.
stringr::str_view_all("LPC 13:0 (ISTD) (a)", pattern = "(")
## Error in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, : Incorrectly nested parentheses in regex pattern. (U_REGEX_MISMATCHED_PAREN, context=`(`)
This is because (
and )
fall under a group called “meta characters” that have other functions in regular expression. In fact, we have just explained what it does earlier which is to group characters together.
To inform that we want to search for the pattern (
and )
explicitly. We need to add two escape character \\
as indicated in the stringr cheat sheet.
stringr::str_view_all("LPC 13:0 (ISTD) (a)", pattern = "\\(")
stringr::str_view_all("LPC 13:0 (ISTD) (a)", pattern = "\\)")
Putting it all together, we have the pattern \\(IS(TD)?\\)
stringr::str_view_all(annotation_data$`Given Name`, pattern = "\\(IS(TD)?\\)")
Remove (a\b\c) and its variations
The second challenge is to create a pattern to remove variations of (a\b\c).
Things that are consistent is that they are written in small letters.
However, the main issue I faced when dealing with this variation
-
The letter does not always start with
a
-
The list of letters can be separated by
\
or/
-
The list can expand indefinitely. For example, it can be
-
(a\b\…\f)
-
(b\d\f)
-
Here is what I have done to resolve the above issues.
Letter does not always start with a
To create a pattern that matches small letters from a to z, we can use the square brackets [
and ]
and hyphen -
as indicated in the stringr cheat sheet.
Applying what we have learnt, we have the pattern \\([a-z]\\)
. The a-z
means the range from a to z. The square brackets [
and ]
means one of. Hence [a-z]
is telling the software to look for one of the letters ranging from a to z.
stringr::str_view_all(annotation_data$`Given Name`, pattern = "\\([a-z]\\)")
The list of letters can be separated by \
or /
Matching /
is easy.
stringr::str_view_all("LPC 17:1 (a/b/c)", pattern = "/")
but not so for \
stringr::str_view_all("LPC 13:0 (ISTD) (a\\b)", pattern = "\\")
## Error in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, : Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE, context=`\`)
\
also fall under a group called “meta characters”. Referring to the stringr cheat sheet, to search for the pattern /
explicitly. We use four escape characters \\\\
stringr::str_view_all("LPC 13:0 (ISTD) (a\\b)", pattern = "\\\\")
The question now is how do we incorporate “or” into the pattern. Referring again to the stringr cheat sheet, it is |
The pattern therefore is [\\\\|/]
as we are looking for one of /
or \
stringr::str_view_all(c("LPC 13:0 (ISTD) (a\\b)", "LPC 17:1 (a/b/c)"), pattern = "[\\\\|/]")
List can expand indefinitely
This one is a bit tricky. The pattern (a\b\…\f) and (a/b/c) can be viewed as
([some small letter][\ or /][some small letter]
)
where the whole pattern [\ or /][some small letter]
appears zero or more times.
To add the element of zero or more times, we use the *
character
This gives the pattern ([\\\\|/][a-z])*
. The parenthesis ()
is to ensure the whole pattern {\ or /}{some small letter}
appears zero or more times.
Putting the three solution altogether, we have the pattern \\([a-z]([\\\\|/][a-z])*\\)
stringr::str_view_all(annotation_data$`Given Name`, pattern = "\\([a-z]([\\\\|/][a-z])*\\)")
Plan Execution
With the three removal plan set, we can clean the transition name as follows.
Clean_Name <- annotation_data[["Given Name"]] %>%
stringr::str_remove(pattern = "-d5") %>%
stringr::str_remove(pattern = "\\(IS(TD)?\\)") %>%
stringr::str_remove(pattern = "\\([a-z]([\\\\|/][a-z])*\\)") %>%
stringr::str_trim()
annotation_data %>%
# Create a new column with the Clean Names
dplyr::mutate(`Clean Name For Annotation` = Clean_Name) %>%
# Make Given Name and Clean Name the first two columns
dplyr::relocate(
dplyr::any_of(c("Given Name","Clean Name For Annotation"))
) %>%
reactable::reactable(defaultPageSize = 5)
Package References
get_citation <- function(package_name) {
transform_name <- package_name %>%
citation() %>%
format(style="text")
return(transform_name)
}
packages <- c("base","rgoslin", "reactable",
"readr", "magrittr", "stringr",
"dplyr", "report",
"tibble", "purrr")
table <- tibble::tibble(Packages = packages)
table %>%
dplyr::mutate(
transform_name = purrr::map_chr(.data[["Packages"]],
get_citation)
) %>%
dplyr::pull(.data[["transform_name"]]) %>%
report::as.report_parameters()
- R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
- Kopczynski D, Hoffmann N, Peng B, Ahrends R (2020). “Goslin: A Grammar of Succinct Lipid Nomenclature.” Analytical Chemistry, 92(16), 10957-10960. https://pubs.acs.org/doi/10.1021/acs.analchem.0c01690.
- Lin G (2020). reactable: Interactive Data Tables Based on ‘React Table’. R package version 0.2.3, https://CRAN.R-project.org/package=reactable.
- Wickham H, Hester J, Bryan J (2022). readr: Read Rectangular Text Data. R package version 2.1.2, https://CRAN.R-project.org/package=readr.
- Bache S, Wickham H (2022). magrittr: A Forward-Pipe Operator for R. R package version 2.0.3, https://CRAN.R-project.org/package=magrittr.
- Wickham H (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0, https://CRAN.R-project.org/package=stringr.
- Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.9, https://CRAN.R-project.org/package=dplyr.
- Makowski D, Ben-Shachar M, Patil I, Lüdecke D (2021). “Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption.” CRAN. https://github.com/easystats/report.
- Müller K, Wickham H (2022). tibble: Simple Data Frames. R package version 3.1.7, https://CRAN.R-project.org/package=tibble.
- Henry L, Wickham H (2020). purrr: Functional Programming Tools. R package version 0.3.4, https://CRAN.R-project.org/package=purrr.
References
1. Kopczynski D, Hoffmann N, Peng B, Ahrends R. Goslin: A grammar of succinct lipid nomenclature. Analytical Chemistry [Internet]. 2020;92(16):10957–60. Available from: https://doi.org/10.1021/acs.analchem.0c01690
2. Kopczynski D, Hoffmann N, Peng B, Liebisch G, Spener F, Ahrends R. Goslin 2.0 implements the recent lipid shorthand nomenclature for MS-derived lipid structures. Analytical Chemistry [Internet]. 2022;94(16):6097–101. Available from: https://doi.org/10.1021/acs.analchem.1c05430
3. Fahy E, Subramaniam S. RefMet: A reference nomenclature for metabolomics. Nature Methods [Internet]. 2020 Dec 1;17(12):1173–4. Available from: https://doi.org/10.1038/s41592-020-01009-y