Cleaning Lipid Names for Annotation Part 2

By Jeremy Selva

March 7, 2022

Introduction

In this blog, I will introduce another set of lipid annotations that my workplace uses that require to be cleaned up and modified so that they can be processed by lipid annotations converter tools like Goslin (1), (2) and RefMet (3).

R Packages Used

library("rgoslin")
library("reactable")
library("flair")
library("readr")
library("magrittr")
library("stringr")
library("dplyr")
library("purrr")
library("tibble")
library("report")
summary(report::report(sessionInfo()))
## The analysis was done using the R Statistical language (v4.2.0; R Core Team, 2022) on Windows 10 x64, using the packages rgoslin (v1.0.0), report (v0.5.1), dplyr (v1.0.9), flair (v0.0.2), magrittr (v2.0.3), purrr (v0.3.4), reactable (v0.2.3), readr (v2.1.2), stringr (v1.4.0) and tibble (v3.1.7).

Labels to clean

Here are the list of lipid names to clean

Given Name Clean Name For Annotation Precursor Ion Product Ion
DG 32:0 [-16:0] DG 16:0_16:0 586.5 313.3
DG 36:1 [NL-18:1] DG 18:1_18:0 640.6 341.3
TG 54:3 [-18:1] TG 18:1_36:2 902.8 603.5
TG 54:3 [NL-18:2] TG 18:2_36:1 902.8 605.5
TG 54:3 [SIM] TG 54:3 902.8 902.8

The word [SIM] stands for selected ion monitoring. SIM is used to detect TG species with a known total number of carbon atoms and double bonds. In the case of TG 54:3 [SIM], the total number of carbon atom is 54 and double bonds is 3.

There are several limitations of this acquisition mode as indicated by Xianlin et. al. (4). One of them is that this method is unable to give information about the fatty acyl chains. As a result, multiple precursor ion and neutral loss (NL) scan mode were introduced to identify potential fatty acyl chains that can be attached to the TG. For example, TG 54:3 [NL-18:2] measures the amount of fatty acyl chain 18:2 attached to TG 54:3 while TG 54:3 [-18:3] measures the amount of fatty acyl chain 18:3 attached to TG 54:3 instead.

Unfortunately lipid annotations converter tools like Goslin and RefMet are unable to parse these given names

c("DG 32:0 [-16:0]",
  "TG 54:3 [NL-18:2]",
  "TG 54:3 [SIM]") %>%
  rgoslin::parseLipidNames() %>%
  reactable::reactable(defaultPageSize = 5)

refmet_negative_results

They must be clean up accordingly as indicated in Gerhard et. al. (5)

nomenclature_guide nomenclature_example A positive result should look like the following.

c("DG 16:0_16:0",
  "TG 18:2_36:1",
  "TG 54:3") %>%
  rgoslin::parseLipidNames() %>%
  reactable::reactable(defaultPageSize = 5)

refmet_positive_results ## Read Data

Read Data

annotation_data <- readr::read_csv("https://raw.github.com/JauntyJJS/jaunty-blogdown/main/content/blog/2022-03-07-Clean-Lipid-Names-2/Annotation.csv")

reactable::reactable(annotation_data, defaultPageSize = 5)

The Plan

We can split this complex task in the following steps

  • Find those transition names that ends with [SIM], remove the [SIM] and return it

Transition names from here on should only be those that are measuring neutral loss a particular fatty acid chain.

We then need to do the following steps to clean such transition names.

  • Get the acyl class of the transition.

  • Get the total carbon number of the transition.

  • Get the total number of double bond of the transition

  • Get the total carbon number of the measured fatty acid chain.

  • Get the total number of double bond of the measured fatty acid chain.

  • Use the tools above to clean the transition name.

In this blog, we will focus only on doing the first part which is to remove the [SIM]

Remove [SIM] at the end

We begin with an empty generic function

clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {
  return(input_acyl)
}

The square brackets [ and ] means one of in regular expression. For example, pattern [a-z] is telling the software to look for one of the letters ranging from a to z.

Remember the backslash \

To allow the software to look explicitly for the pattern [ and ], we need to use the backslash \ giving us \[ and \]. However, whenever \ appears in a regular expression, we must write as \\ instead in R. Doing so gives us \\[ and \\]

Using TG 54:3 [SIM] as an example, we have

stringr::str_remove(string = "TG 54:3 [SIM]" ,pattern = "\\[SIM\\]")
## [1] "TG 54:3 "

Removing the whitespaces

Now here you can see that the white spaces is not removed. One way is to use stringr::str_trim()

stringr::str_remove(string = "TG 54:3 [SIM]" ,pattern = "\\[SIM\\]") %>%
  stringr::str_trim()
## [1] "TG 54:3"

Another way is to add white spaces in our pattern \\[SIM\\]. Taking a look at the stringr cheat sheet, we can try to use \\s.

whitespace

To show that we need to remove zero of more whitespaces, we add the “*”

zero_or_more

This expands the pattern to \\s*\\[SIM\\]\\s*

stringr::str_remove(string = "TG 54:3 [SIM]  ", pattern = "\\s*\\[SIM\\]\\s*")
## [1] "TG 54:3"

Indicating the end of a string

To be more specific that we need to remove [SIM] only at the end of the string. We add $ at the end of the pattern: \\s*\\[SIM\\]\\s*$

end_of_string

stringr::str_remove(string = c ("TG 54:3 [SIM]  ",
                                " [SIM] TG 54:3"),
                    pattern = "\\s*\\[SIM\\]\\s*$")
## [1] "TG 54:3"        " [SIM] TG 54:3"

Using isTRUE in an if statement

Now that we know how to remove the [SIM] at the end, we need to create an if statement to detect that such a pattern exists. This is because after the [SIM] is removed, the transition names do not need to be modified further and the cleaned name can be returned.

One way to do this is to use stringr::str_detect to do the job

clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {
  
  # If we have a sum composition labelled as [SIM] at the end,
  # remove it and return the results
  if (stringr::str_detect(string = input_acyl,
                           pattern = "\\s*\\[SIM\\]\\s*$"))
  {
    input_acyl <- input_acyl %>%
      stringr::str_remove(pattern = "\\s*\\[SIM\\]\\s*$")
    
    return(input_acyl)
  }
  
  return(input_acyl)
}

However, there is a possibility that stringr::str_detect may not return a boolean value due to an unusual input. This may give an error to the if statement. Here are some examples

if(stringr::str_detect(string = NULL, pattern = "\\s*\\[SIM\\]\\s*$")) {
  print("No error")
}
## Error in if (stringr::str_detect(string = NULL, pattern = "\\s*\\[SIM\\]\\s*$")) {: argument is of length zero
if(stringr::str_detect(string = NA, pattern = "\\s*\\[SIM\\]\\s*$")) {
  print("No error")
}
## Error in if (stringr::str_detect(string = NA, pattern = "\\s*\\[SIM\\]\\s*$")) {: missing value where TRUE/FALSE needed

To rectify this issue, we make use of the function isTRUE which gives the boolean value FALSE, when it receives such unusual input.

isTRUE(stringr::str_detect(string = NULL, pattern = "\\s*\\[SIM\\]\\s*$"))
## [1] FALSE
isTRUE(stringr::str_detect(string = NA, pattern = "\\s*\\[SIM\\]\\s*$"))
## [1] FALSE

Putting it all together, we have the following so far.

clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {
  
  # If we have a sum composition labelled as [SIM] at the end,
  # remove it and return the results
  if (isTRUE(stringr::str_detect(string = input_acyl,
                                 pattern = "\\s*\\[SIM\\]\\s*$")))
  {
    input_acyl <- input_acyl %>%
      stringr::str_remove(pattern = "\\s*\\[SIM\\]\\s*$")
    
    return(input_acyl)
  }
  
  return(input_acyl)
}

Plan Execution

Here is how it looks like when the function is utilised.

clean_acyl("TG 54:3 [SIM]")
## [1] "TG 54:3"

Stay tuned for the next part where we will expand the function to deal with the second part of the plan which is to deal with transitions that indicate a neutral loss of fatty acid chains.

Given Name Clean Name For Annotation Precursor Ion Product Ion
DG 32:0 [-16:0] DG 16:0_16:0 586.5 313.3
DG 36:1 [NL-18:1] DG 18:1_18:0 640.6 341.3
TG 54:3 [-18:1] TG 18:1_36:2 902.8 603.5
TG 54:3 [NL-18:2] TG 18:2_36:1 902.8 605.5

Package References

get_citation <- function(package_name) {
  transform_name <- package_name %>% 
    citation() %>% 
    format(style="text")
  return(transform_name)
} 

packages <- c("base","rgoslin", "reactable",
              "flair", "magrittr",
              "stringr", "dplyr", "report",
              "tibble", "purrr")

table <- tibble::tibble(Packages = packages)

table %>%
  dplyr::mutate(
    transform_name = purrr::map_chr(.data[["Packages"]],
                                    get_citation)
  ) %>% 
  dplyr::pull(.data[["transform_name"]]) %>% 
  report::as.report_parameters()

References

1. Kopczynski D, Hoffmann N, Peng B, Ahrends R. Goslin: A grammar of succinct lipid nomenclature. Analytical Chemistry [Internet]. 2020;92(16):10957–60. Available from: https://doi.org/10.1021/acs.analchem.0c01690

2. Kopczynski D, Hoffmann N, Peng B, Liebisch G, Spener F, Ahrends R. Goslin 2.0 implements the recent lipid shorthand nomenclature for MS-derived lipid structures. Analytical Chemistry [Internet]. 2022;94(16):6097–101. Available from: https://doi.org/10.1021/acs.analchem.1c05430

3. Fahy E, Subramaniam S. RefMet: A reference nomenclature for metabolomics. Nature Methods [Internet]. 2020 Dec 1;17(12):1173–4. Available from: https://doi.org/10.1038/s41592-020-01009-y

4. Han X, Ye H. Overview of lipidomic analysis of triglyceride molecular species in biological lipid extracts. Journal of Agricultural and Food Chemistry [Internet]. 2021 Aug 18;69(32):8895–909. Available from: https://doi.org/10.1021/acs.jafc.0c07175

5. Liebisch F Gerhard, Spener F. Update on LIPID MAPS classification, nomenclature, and shorthand notation for MS-derived lipid structures. Journal of Lipid Research [Internet]. 2020 Dec 1;61(12):1539–55. Available from: https://doi.org/10.1194/jlr.S120001025