Cleaning Lipid Names for Annotation Part 1

By Jeremy Selva

January 30, 2022

Introduction

Despite efforts made to unified lipids shorthand notations and the rise of software dedicated to standardise lipid annotations, lipid names still remains diverse. This is due to limited ways lipid software may label a lipid and/or researchers’ personal preferences to annotate them.

Here, I would like to highlight some lipid annotations that my workplace uses and show how to modify them in R such that it can be processed by lipid annotations converter tools like Goslin (1), (2) and RefMet (3).

R Packages Used

library("rgoslin")
library("reactable")
library("readr")
library("magrittr")
library("stringr")
library("dplyr")
library("purrr")
library("tibble")
library("report")
summary(report::report(sessionInfo()))
## The analysis was done using the R Statistical language (v4.2.0; R Core Team, 2022) on Windows 10 x64, using the packages rgoslin (v1.0.0), report (v0.5.1), dplyr (v1.0.9), magrittr (v2.0.3), purrr (v0.3.4), reactable (v0.2.3), readr (v2.1.2), stringr (v1.4.0) and tibble (v3.1.7).

Labels to clean

Here are the list of lipid names to clean

Given Name Clean Name For Annotation Precursor Ion Product Ion
LPC 13:0 (ISTD) (a) LPC 13:0 454.3 184.1
LPC 13:0 (ISTD) (a\b) LPC 13:0 454.3 184.1
LPC 13:0 (ISTD) (b) LPC 13:0 454.3 184.1
LPC 15:0-d5 (IS) (a) LPC 15:0 487.4 304.3
LPC 15:0-d5 (IS) (a\b) LPC 15:0 487.4 304.3
LPC 15:0-d5 (IS) (b) LPC 15:0 487.4 304.3
LPC 17:1 (a) LPC 17:1 508.3 184.1
LPC 17:1 (a/b/c) LPC 17:1 508.3 184.1
LPC 17:1 (b) LPC 17:1 508.3 184.1
LPC 17:1 (c) LPC 17:1 508.3 184.1

The word (IS) or (ISTD) stands for internal standards.

The word -d5 means the compound contains five deuterium instead of hydrogen.

These internal standards can be purchased via these links

The (a\b) in this case refers to which lipid isomer is integrated. Here is an example with LPC 17:1

LPC_17_1\_a

LPC_17_1\_abc

LPC_17_1\_b

LPC_17_1\_c

The identity of such isomers is still an ongoing research process for most lipids. Thankfully for the case of LPC 17:1 in human plasma, the identity can be found in these links.

Unfortunately lipid annotations converter tools like Goslin and RefMet are unable to parse these given names

c("LPC 13:0 (ISTD) (a\\b)",
  "LPC 15:0-d5 (IS) (a\\b)",
  "LPC 17:1 (a/b/c) ") %>%
  rgoslin::parseLipidNames() %>%
  reactable::reactable(defaultPageSize = 5)

They must be clean up accordingly

c("LPC 13:0","LPC 15:0","LPC 17:1") %>%
  rgoslin::parseLipidNames() %>%
  reactable::reactable(defaultPageSize = 5)

The task is to remove the variations of (ISTD), (a\b) and the -d5

Read Data

annotation_data <- readr::read_csv("https://raw.github.com/JauntyJJS/jaunty-blogdown/main/content/blog/2022-01-22-Clean-Lipid-Names-1/Annotation.csv")

reactable::reactable(annotation_data, defaultPageSize = 5)

The Plan

We will do the following in order

  1. Remove the -d5

  2. Remove the ISTD and its variations

  3. Remove the (a\b\c) and its variations

The idea is to first use stringr::str_view to check what the matched pattern is. Once the matched pattern is correct, we can use stringr::str_remove function to clean up the matched pattern.

Remove -d5

Here is an example to remove the -d5 using stringr::str_view. Simply just set the pattern as “-d5”

stringr::str_view("LPC 15:0-d5 (IS) (a)", pattern = "-d5")
stringr::str_remove("LPC 15:0-d5 (IS) (a)", pattern = "-d5")
## [1] "LPC 15:0 (IS) (a)"

Remove the ISTD variation

The first challenge is to create a pattern that is able to remove variations of ISTD such as (ISTD) and (IS)

Detect the word ISTD

To detect the word ISTD, we can do this in R

stringr::str_view("LPC 13:0 (ISTD) (a)", pattern = "ISTD")

Unfortunately, this will not work with the word IS

stringr::str_view("LPC 15:0-d5 (IS) (a)", pattern = "ISTD")

Alternatively to detect the word IS, we can do this in R

stringr::str_view("LPC 15:0-d5 (IS) (a)", pattern = "IS")

This time, this will not work with the word ISTD

stringr::str_view("LPC 13:0 (ISTD) (a)", pattern = "IS")

To detect words of the form IS and ISTD, we can make use of the fact that the group “TD” in “ISTD” appear zero or one time. Referring to the cheat sheet of stringr, we can make use of parentheses to create a group.

Hence, adding the group to the existing pattern, we have the form IS(TD).

Next, we inform stringr that the group (TD) can appear zero or one time. Referring to the cheat sheet of stringr, we can make use of the ? symbol

This give our updated pattern to IS(TD)?

Putting this to our existing list of words, we have

stringr::str_view(annotation_data$`Given Name`, pattern = "IS(TD)?")

Detect Parenthesis

The way to detect ( and ) is unfortunately not as simple as stringr::str_view_all("LPC 13:0 (ISTD) (a)", pattern = "("), giving rise to this error message.

stringr::str_view_all("LPC 13:0 (ISTD) (a)", pattern = "(")
## Error in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, : Incorrectly nested parentheses in regex pattern. (U_REGEX_MISMATCHED_PAREN, context=`(`)

This is because ( and ) fall under a group called “meta characters” that have other functions in regular expression. In fact, we have just explained what it does earlier which is to group characters together.

To inform that we want to search for the pattern ( and ) explicitly. We need to add two escape character \\ as indicated in the stringr cheat sheet.

parenthesis

stringr::str_view_all("LPC 13:0 (ISTD) (a)", pattern = "\\(")
stringr::str_view_all("LPC 13:0 (ISTD) (a)", pattern = "\\)")

Putting it all together, we have the pattern \\(IS(TD)?\\)

stringr::str_view_all(annotation_data$`Given Name`, pattern = "\\(IS(TD)?\\)")

Remove (a\b\c) and its variations

The second challenge is to create a pattern to remove variations of (a\b\c).

Things that are consistent is that they are written in small letters.

However, the main issue I faced when dealing with this variation

  • The letter does not always start with a

  • The list of letters can be separated by \ or /

  • The list can expand indefinitely. For example, it can be

    • (a\b\…\f)

    • (b\d\f)

Here is what I have done to resolve the above issues.

Letter does not always start with a

To create a pattern that matches small letters from a to z, we can use the square brackets [ and ] and hyphen - as indicated in the stringr cheat sheet.

range

Applying what we have learnt, we have the pattern \\([a-z]\\). The a-z means the range from a to z. The square brackets [ and ] means one of. Hence [a-z] is telling the software to look for one of the letters ranging from a to z.

stringr::str_view_all(annotation_data$`Given Name`, pattern = "\\([a-z]\\)")

The list of letters can be separated by \ or /

Matching / is easy.

stringr::str_view_all("LPC 17:1 (a/b/c)", pattern = "/")

but not so for \

stringr::str_view_all("LPC 13:0 (ISTD) (a\\b)", pattern = "\\")
## Error in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, : Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE, context=`\`)

\ also fall under a group called “meta characters”. Referring to the stringr cheat sheet, to search for the pattern / explicitly. We use four escape characters \\\\

backslash

stringr::str_view_all("LPC 13:0 (ISTD) (a\\b)", pattern = "\\\\")

The question now is how do we incorporate “or” into the pattern. Referring again to the stringr cheat sheet, it is |

or

The pattern therefore is [\\\\|/] as we are looking for one of / or \

stringr::str_view_all(c("LPC 13:0 (ISTD) (a\\b)", "LPC 17:1 (a/b/c)"), pattern = "[\\\\|/]")

List can expand indefinitely

This one is a bit tricky. The pattern (a\b\…\f) and (a/b/c) can be viewed as

([some small letter][\ or /][some small letter])

where the whole pattern [\ or /][some small letter] appears zero or more times.

To add the element of zero or more times, we use the * character

zero_or_more

This gives the pattern ([\\\\|/][a-z])*. The parenthesis () is to ensure the whole pattern {\ or /}{some small letter} appears zero or more times.

Putting the three solution altogether, we have the pattern \\([a-z]([\\\\|/][a-z])*\\)

stringr::str_view_all(annotation_data$`Given Name`, pattern = "\\([a-z]([\\\\|/][a-z])*\\)")

Plan Execution

With the three removal plan set, we can clean the transition name as follows.

Clean_Name <- annotation_data[["Given Name"]] %>%
  stringr::str_remove(pattern = "-d5") %>%
  stringr::str_remove(pattern = "\\(IS(TD)?\\)") %>%
  stringr::str_remove(pattern = "\\([a-z]([\\\\|/][a-z])*\\)") %>%
  stringr::str_trim()

annotation_data %>%
  # Create a new column with the Clean Names
  dplyr::mutate(`Clean Name For Annotation` = Clean_Name) %>%
  # Make Given Name and Clean Name the first two columns
  dplyr::relocate(
    dplyr::any_of(c("Given Name","Clean Name For Annotation"))
    ) %>%
  reactable::reactable(defaultPageSize = 5)

Package References

get_citation <- function(package_name) {
  transform_name <- package_name %>% 
    citation() %>% 
    format(style="text")
  return(transform_name)
} 

packages <- c("base","rgoslin", "reactable",
              "readr", "magrittr", "stringr", 
              "dplyr", "report",
              "tibble", "purrr")

table <- tibble::tibble(Packages = packages)

table %>%
  dplyr::mutate(
    transform_name = purrr::map_chr(.data[["Packages"]],
                                    get_citation)
  ) %>% 
  dplyr::pull(.data[["transform_name"]]) %>% 
  report::as.report_parameters()

References

1. Kopczynski D, Hoffmann N, Peng B, Ahrends R. Goslin: A grammar of succinct lipid nomenclature. Analytical Chemistry [Internet]. 2020;92(16):10957–60. Available from: https://doi.org/10.1021/acs.analchem.0c01690

2. Kopczynski D, Hoffmann N, Peng B, Liebisch G, Spener F, Ahrends R. Goslin 2.0 implements the recent lipid shorthand nomenclature for MS-derived lipid structures. Analytical Chemistry [Internet]. 2022;94(16):6097–101. Available from: https://doi.org/10.1021/acs.analchem.1c05430

3. Fahy E, Subramaniam S. RefMet: A reference nomenclature for metabolomics. Nature Methods [Internet]. 2020 Dec 1;17(12):1173–4. Available from: https://doi.org/10.1038/s41592-020-01009-y