Cleaning Lipid Names for Annotation Part 1

By Jeremy Selva

January 30, 2022

Introduction

Despite efforts made to unified lipids shorthand notations and the rise of software dedicated to standardise lipid annotations, lipid names still remains diverse. This is due to limited ways lipid software may label a lipid and/or researchers’ personal preferences to annotate them.

Here, I would like to highlight some lipid annotations that my workplace uses and show how to modify them in R such that it can be processed by lipid annotations converter tools like Goslin (1), (2) and RefMet (3).

R Packages Used

library("rgoslin")
library("reactable")
library("readr")
library("magrittr")
library("stringr")
library("dplyr")
library("purrr")
library("tibble")
library("report")
summary(report::report(sessionInfo()))

## The analysis was done using the R Statistical language (v4.2.0; R Core Team, 2022) on Windows 10 x64, using the packages rgoslin (v1.0.0), report (v0.5.1), dplyr (v1.0.9), magrittr (v2.0.3), purrr (v0.3.4), reactable (v0.2.3), readr (v2.1.2), stringr (v1.4.0) and tibble (v3.1.7).

Labels to clean

Here are the list of lipid names to clean

Given Name	Clean Name For Annotation	Precursor Ion	Product Ion
LPC 13:0 (ISTD) (a)	LPC 13:0	454.3	184.1
LPC 13:0 (ISTD) (a\b)	LPC 13:0	454.3	184.1
LPC 13:0 (ISTD) (b)	LPC 13:0	454.3	184.1
LPC 15:0-d5 (IS) (a)	LPC 15:0	487.4	304.3
LPC 15:0-d5 (IS) (a\b)	LPC 15:0	487.4	304.3
LPC 15:0-d5 (IS) (b)	LPC 15:0	487.4	304.3
LPC 17:1 (a)	LPC 17:1	508.3	184.1
LPC 17:1 (a/b/c)	LPC 17:1	508.3	184.1
LPC 17:1 (b)	LPC 17:1	508.3	184.1
LPC 17:1 (c)	LPC 17:1	508.3	184.1

The word (IS) or (ISTD) stands for internal standards.

The word -d5 means the compound contains five deuterium instead of hydrogen.

These internal standards can be purchased via these links

The (a\b) in this case refers to which lipid isomer is integrated. Here is an example with LPC 17:1

$LPC_17_1\_a$

$LPC_17_1\_abc$

$LPC_17_1\_b$

$LPC_17_1\_c$

The identity of such isomers is still an ongoing research process for most lipids. Thankfully for the case of LPC 17:1 in human plasma, the identity can be found in these links.

Unfortunately lipid annotations converter tools like Goslin and RefMet are unable to parse these given names

c("LPC 13:0 (ISTD) (a\\b)",
  "LPC 15:0-d5 (IS) (a\\b)",
  "LPC 17:1 (a/b/c) ") %>%
  rgoslin::parseLipidNames() %>%
  reactable::reactable(defaultPageSize = 5)

They must be clean up accordingly

c("LPC 13:0","LPC 15:0","LPC 17:1") %>%
  rgoslin::parseLipidNames() %>%
  reactable::reactable(defaultPageSize = 5)

Normalized.Name

Original.Name

Grammar

Message

Adduct

Adduct.Charge

Lipid.Maps.Category

Lipid.Maps.Main.Class

Species.Name

Molecular.Species.Name

Sn.Position.Name

Structure.Defined.Name

Full.Structure.Name

Functional.Class.Abbr

Functional.Class.Synonyms

Level

Total.C

Total.OH

Total.DB

Mass

Sum.Formula

FA1.Position

FA1.C

FA1.OH

FA1.DB

FA1.Bond.Type

FA1.DB.Positions

FA2.Position

FA2.C

FA2.OH

FA2.DB

FA2.Bond.Type

FA2.DB.Positions

LCB.Position

LCB.C

LCB.OH

LCB.DB

LCB.Bond.Type

LCB.DB.Positions

FA3.Position

FA3.C

FA3.OH

FA3.DB

FA3.Bond.Type

FA3.DB.Positions

FA4.Position

FA4.C

FA4.OH

FA4.DB

FA4.Bond.Type

FA4.DB.Positions

LPC 13:0

Shorthand2020

LPC

LPC 13:0

[LPC]

[LPC, LysoPC, lysoPC]

MOLECULAR_SPECIES

453.28553995

C21H44NO7P

-1

ESTER

[]

-1

ESTER

[]

LPC 15:0

Shorthand2020

LPC

LPC 15:0

[LPC]

[LPC, LysoPC, lysoPC]

MOLECULAR_SPECIES

481.31684009

C23H48NO7P

-1

ESTER

[]

-1

ESTER

[]

LPC 17:1

Shorthand2020

LPC

LPC 17:1

[LPC]

[LPC, LysoPC, lysoPC]

MOLECULAR_SPECIES

507.33249016

C25H50NO7P

-1

ESTER

[]

-1

ESTER

[]

The task is to remove the variations of (ISTD), (a\b) and the -d5

Read Data

annotation_data <- readr::read_csv("https://raw.github.com/JauntyJJS/jaunty-blogdown/main/content/blog/2022-01-22-Clean-Lipid-Names-1/Annotation.csv")

reactable::reactable(annotation_data, defaultPageSize = 5)

The Plan

We will do the following in order

Remove the -d5
Remove the ISTD and its variations
Remove the (a\b\c) and its variations

The idea is to first use stringr::str_view to check what the matched pattern is. Once the matched pattern is correct, we can use stringr::str_remove function to clean up the matched pattern.

Remove -d5

Here is an example to remove the -d5 using stringr::str_view. Simply just set the pattern as “-d5”

stringr::str_view("LPC 15:0-d5 (IS) (a)", pattern = "-d5")

stringr::str_remove("LPC 15:0-d5 (IS) (a)", pattern = "-d5")

## [1] "LPC 15:0 (IS) (a)"

Remove the ISTD variation

The first challenge is to create a pattern that is able to remove variations of ISTD such as (ISTD) and (IS)

Detect the word ISTD

To detect the word ISTD, we can do this in R

stringr::str_view("LPC 13:0 (ISTD) (a)", pattern = "ISTD")

Unfortunately, this will not work with the word IS

stringr::str_view("LPC 15:0-d5 (IS) (a)", pattern = "ISTD")

Alternatively to detect the word IS, we can do this in R

stringr::str_view("LPC 15:0-d5 (IS) (a)", pattern = "IS")

This time, this will not work with the word ISTD

stringr::str_view("LPC 13:0 (ISTD) (a)", pattern = "IS")

To detect words of the form IS and ISTD, we can make use of the fact that the group “TD” in “ISTD” appear zero or one time. Referring to the cheat sheet of stringr, we can make use of parentheses to create a group.

Hence, adding the group to the existing pattern, we have the form IS(TD).

Next, we inform stringr that the group (TD) can appear zero or one time. Referring to the cheat sheet of stringr, we can make use of the ? symbol

This give our updated pattern to IS(TD)?

Putting this to our existing list of words, we have

stringr::str_view(annotation_data$`Given Name`, pattern = "IS(TD)?")

Detect Parenthesis

The way to detect ( and ) is unfortunately not as simple as stringr::str_view_all("LPC 13:0 (ISTD) (a)", pattern = "("), giving rise to this error message.

stringr::str_view_all("LPC 13:0 (ISTD) (a)", pattern = "(")

## Error in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, : Incorrectly nested parentheses in regex pattern. (U_REGEX_MISMATCHED_PAREN, context=`(`)

This is because ( and ) fall under a group called “meta characters” that have other functions in regular expression. In fact, we have just explained what it does earlier which is to group characters together.

To inform that we want to search for the pattern ( and ) explicitly. We need to add two escape character \\ as indicated in the stringr cheat sheet.

parenthesis

stringr::str_view_all("LPC 13:0 (ISTD) (a)", pattern = "\\(")

stringr::str_view_all("LPC 13:0 (ISTD) (a)", pattern = "\\)")

Putting it all together, we have the pattern \$IS(TD)?\$

stringr::str_view_all(annotation_data$`Given Name`, pattern = "\\(IS(TD)?\\)")

Remove (a\b\c) and its variations

The second challenge is to create a pattern to remove variations of (a\b\c).

Things that are consistent is that they are written in small letters.

However, the main issue I faced when dealing with this variation

The letter does not always start with a
The list of letters can be separated by \ or /
The list can expand indefinitely. For example, it can be
- (a\b\…\f)
- (b\d\f)

Here is what I have done to resolve the above issues.

Letter does not always start with `a`

To create a pattern that matches small letters from a to z, we can use the square brackets [ and ] and hyphen - as indicated in the stringr cheat sheet.

range

Applying what we have learnt, we have the pattern \$[a-z]\$. The a-z means the range from a to z. The square brackets [ and ] means one of. Hence [a-z] is telling the software to look for one of the letters ranging from a to z.

stringr::str_view_all(annotation_data$`Given Name`, pattern = "\\([a-z]\\)")

The list of letters can be separated by `\` or `/`

Matching / is easy.

stringr::str_view_all("LPC 17:1 (a/b/c)", pattern = "/")

but not so for \

stringr::str_view_all("LPC 13:0 (ISTD) (a\\b)", pattern = "\\")

## Error in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, : Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE, context=`\`)

\ also fall under a group called “meta characters”. Referring to the stringr cheat sheet, to search for the pattern / explicitly. We use four escape characters \\\\

backslash

stringr::str_view_all("LPC 13:0 (ISTD) (a\\b)", pattern = "\\\\")

The question now is how do we incorporate “or” into the pattern. Referring again to the stringr cheat sheet, it is |

The pattern therefore is [\\\\|/] as we are looking for one of / or \

stringr::str_view_all(c("LPC 13:0 (ISTD) (a\\b)", "LPC 17:1 (a/b/c)"), pattern = "[\\\\|/]")

List can expand indefinitely

This one is a bit tricky. The pattern (a\b\…\f) and (a/b/c) can be viewed as

([some small letter][\ or /][some small letter])

where the whole pattern [\ or /][some small letter] appears zero or more times.

To add the element of zero or more times, we use the * character

zero_or_more

This gives the pattern ([\\\\|/][a-z])*. The parenthesis () is to ensure the whole pattern {\ or /}{some small letter} appears zero or more times.

Putting the three solution altogether, we have the pattern \$[a-z]([\\\\|/][a-z])*\$

stringr::str_view_all(annotation_data$`Given Name`, pattern = "\\([a-z]([\\\\|/][a-z])*\\)")

Plan Execution

With the three removal plan set, we can clean the transition name as follows.

Clean_Name <- annotation_data[["Given Name"]] %>%
  stringr::str_remove(pattern = "-d5") %>%
  stringr::str_remove(pattern = "\\(IS(TD)?\\)") %>%
  stringr::str_remove(pattern = "\\([a-z]([\\\\|/][a-z])*\\)") %>%
  stringr::str_trim()

annotation_data %>%
  # Create a new column with the Clean Names
  dplyr::mutate(`Clean Name For Annotation` = Clean_Name) %>%
  # Make Given Name and Clean Name the first two columns
  dplyr::relocate(
    dplyr::any_of(c("Given Name","Clean Name For Annotation"))
    ) %>%
  reactable::reactable(defaultPageSize = 5)

Package References

get_citation <- function(package_name) {
  transform_name <- package_name %>% 
    citation() %>% 
    format(style="text")
  return(transform_name)
} 

packages <- c("base","rgoslin", "reactable",
              "readr", "magrittr", "stringr", 
              "dplyr", "report",
              "tibble", "purrr")

table <- tibble::tibble(Packages = packages)

table %>%
  dplyr::mutate(
    transform_name = purrr::map_chr(.data[["Packages"]],
                                    get_citation)
  ) %>% 
  dplyr::pull(.data[["transform_name"]]) %>% 
  report::as.report_parameters()

R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Kopczynski D, Hoffmann N, Peng B, Ahrends R (2020). “Goslin: A Grammar of Succinct Lipid Nomenclature.” Analytical Chemistry, 92(16), 10957-10960. https://pubs.acs.org/doi/10.1021/acs.analchem.0c01690.
Lin G (2020). reactable: Interactive Data Tables Based on ‘React Table’. R package version 0.2.3, https://CRAN.R-project.org/package=reactable.
Wickham H, Hester J, Bryan J (2022). readr: Read Rectangular Text Data. R package version 2.1.2, https://CRAN.R-project.org/package=readr.
Bache S, Wickham H (2022). magrittr: A Forward-Pipe Operator for R. R package version 2.0.3, https://CRAN.R-project.org/package=magrittr.
Wickham H (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0, https://CRAN.R-project.org/package=stringr.
Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.9, https://CRAN.R-project.org/package=dplyr.
Makowski D, Ben-Shachar M, Patil I, Lüdecke D (2021). “Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption.” CRAN. https://github.com/easystats/report.
Müller K, Wickham H (2022). tibble: Simple Data Frames. R package version 3.1.7, https://CRAN.R-project.org/package=tibble.
Henry L, Wickham H (2020). purrr: Functional Programming Tools. R package version 0.3.4, https://CRAN.R-project.org/package=purrr.

References

1. Kopczynski D, Hoffmann N, Peng B, Ahrends R. Goslin: A grammar of succinct lipid nomenclature. Analytical Chemistry [Internet]. 2020;92(16):10957–60. Available from: https://doi.org/10.1021/acs.analchem.0c01690

2. Kopczynski D, Hoffmann N, Peng B, Liebisch G, Spener F, Ahrends R. Goslin 2.0 implements the recent lipid shorthand nomenclature for MS-derived lipid structures. Analytical Chemistry [Internet]. 2022;94(16):6097–101. Available from: https://doi.org/10.1021/acs.analchem.1c05430

3. Fahy E, Subramaniam S. RefMet: A reference nomenclature for metabolomics. Nature Methods [Internet]. 2020 Dec 1;17(12):1173–4. Available from: https://doi.org/10.1038/s41592-020-01009-y

Introduction

R Packages Used

Labels to clean

Read Data

The Plan

Remove -d5

Remove the ISTD variation

Detect the word ISTD

Detect Parenthesis

Remove (a\b\c) and its variations

Letter does not always start with a

The list of letters can be separated by \ or /

List can expand indefinitely

Plan Execution

Package References

References

Letter does not always start with `a`

The list of letters can be separated by `\` or `/`