Cleaning Lipid Names for Annotation Part 2

By Jeremy Selva

March 7, 2022

Introduction

In this blog, I will introduce another set of lipid annotations that my workplace uses that require to be cleaned up and modified so that they can be processed by lipid annotations converter tools like Goslin (1), (2) and RefMet (3).

R Packages Used

library("rgoslin")
library("reactable")
library("flair")
library("readr")
library("magrittr")
library("stringr")
library("dplyr")
library("purrr")
library("tibble")
library("report")
summary(report::report(sessionInfo()))

## The analysis was done using the R Statistical language (v4.2.0; R Core Team, 2022) on Windows 10 x64, using the packages rgoslin (v1.0.0), report (v0.5.1), dplyr (v1.0.9), flair (v0.0.2), magrittr (v2.0.3), purrr (v0.3.4), reactable (v0.2.3), readr (v2.1.2), stringr (v1.4.0) and tibble (v3.1.7).

Labels to clean

Here are the list of lipid names to clean

Given Name	Clean Name For Annotation	Precursor Ion	Product Ion
DG 32:0 [-16:0]	DG 16:0_16:0	586.5	313.3
DG 36:1 [NL-18:1]	DG 18:1_18:0	640.6	341.3
TG 54:3 [-18:1]	TG 18:1_36:2	902.8	603.5
TG 54:3 [NL-18:2]	TG 18:2_36:1	902.8	605.5
TG 54:3 [SIM]	TG 54:3	902.8	902.8

The word [SIM] stands for selected ion monitoring. SIM is used to detect TG species with a known total number of carbon atoms and double bonds. In the case of TG 54:3 [SIM], the total number of carbon atom is 54 and double bonds is 3.

There are several limitations of this acquisition mode as indicated by Xianlin et. al. (4). One of them is that this method is unable to give information about the fatty acyl chains. As a result, multiple precursor ion and neutral loss (NL) scan mode were introduced to identify potential fatty acyl chains that can be attached to the TG. For example, TG 54:3 [NL-18:2] measures the amount of fatty acyl chain 18:2 attached to TG 54:3 while TG 54:3 [-18:3] measures the amount of fatty acyl chain 18:3 attached to TG 54:3 instead.

Unfortunately lipid annotations converter tools like Goslin and RefMet are unable to parse these given names

c("DG 32:0 [-16:0]",
  "TG 54:3 [NL-18:2]",
  "TG 54:3 [SIM]") %>%
  rgoslin::parseLipidNames() %>%
  reactable::reactable(defaultPageSize = 5)

refmet_negative_results

They must be clean up accordingly as indicated in Gerhard et. al. (5)

nomenclature_guide nomenclature_example A positive result should look like the following.

c("DG 16:0_16:0",
  "TG 18:2_36:1",
  "TG 54:3") %>%
  rgoslin::parseLipidNames() %>%
  reactable::reactable(defaultPageSize = 5)

Normalized.Name

Original.Name

Grammar

Message

Adduct

Adduct.Charge

Lipid.Maps.Category

Lipid.Maps.Main.Class

Species.Name

Molecular.Species.Name

Sn.Position.Name

Structure.Defined.Name

Full.Structure.Name

Functional.Class.Abbr

Functional.Class.Synonyms

Level

Total.C

Total.OH

Total.DB

Mass

Sum.Formula

FA1.Position

FA1.C

FA1.OH

FA1.DB

FA1.Bond.Type

FA1.DB.Positions

FA2.Position

FA2.C

FA2.OH

FA2.DB

FA2.Bond.Type

FA2.DB.Positions

LCB.Position

LCB.C

LCB.OH

LCB.DB

LCB.Bond.Type

LCB.DB.Positions

FA3.Position

FA3.C

FA3.OH

FA3.DB

FA3.Bond.Type

FA3.DB.Positions

FA4.Position

FA4.C

FA4.OH

FA4.DB

FA4.Bond.Type

FA4.DB.Positions

DG 16:0_16:0

Shorthand2020

DG 32:0

DG 16:0_16:0

[DG]

[DG, DAG]

MOLECULAR_SPECIES

568.50667553

C35H68O5

-1

ESTER

[]

-1

ESTER

[]

-1

ESTER

[]

TG 18:2_36:1

Shorthand2020

TG 54:3

TG 18:2_36:1

[TG]

[TG, TAG]

MOLECULAR_SPECIES

870.80402686

C57H106O5

-1

ESTER

[]

-1

ESTER

[]

-1

ESTER

[]

TG 54:3

Shorthand2020

TG 54:3

[TG]

[TG, TAG]

SPECIES

884.78329142

C57H104O6

refmet_positive_results ## Read Data

Read Data

annotation_data <- readr::read_csv("https://raw.github.com/JauntyJJS/jaunty-blogdown/main/content/blog/2022-03-07-Clean-Lipid-Names-2/Annotation.csv")

reactable::reactable(annotation_data, defaultPageSize = 5)

The Plan

We can split this complex task in the following steps

Find those transition names that ends with [SIM], remove the [SIM] and return it

Transition names from here on should only be those that are measuring neutral loss a particular fatty acid chain.

We then need to do the following steps to clean such transition names.

Get the acyl class of the transition.
Get the total carbon number of the transition.
Get the total number of double bond of the transition
Get the total carbon number of the measured fatty acid chain.
Get the total number of double bond of the measured fatty acid chain.
Use the tools above to clean the transition name.

In this blog, we will focus only on doing the first part which is to remove the [SIM]

Remove [SIM] at the end

We begin with an empty generic function

clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {
  return(input_acyl)
}

The square brackets [ and ] means one of in regular expression. For example, pattern [a-z] is telling the software to look for one of the letters ranging from a to z.

Remember the backslash `\`

To allow the software to look explicitly for the pattern [ and ], we need to use the backslash \ giving us \[ and \]. However, whenever \ appears in a regular expression, we must write as \\ instead in R. Doing so gives us \\[ and \\]

Using TG 54:3 [SIM] as an example, we have

stringr::str_remove(string = "TG 54:3 [SIM]" ,pattern = "\\[SIM\\]")

## [1] "TG 54:3 "

Removing the whitespaces

Now here you can see that the white spaces is not removed. One way is to use stringr::str_trim()

stringr::str_remove(string = "TG 54:3 [SIM]" ,pattern = "\\[SIM\\]") %>%
  stringr::str_trim()

## [1] "TG 54:3"

Another way is to add white spaces in our pattern \\[SIM\\]. Taking a look at the stringr cheat sheet, we can try to use \\s.

whitespace

To show that we need to remove zero of more whitespaces, we add the “*”

zero_or_more

This expands the pattern to \\s*\\[SIM\\]\\s*

stringr::str_remove(string = "TG 54:3 [SIM]  ", pattern = "\\s*\\[SIM\\]\\s*")

## [1] "TG 54:3"

Indicating the end of a string

To be more specific that we need to remove [SIM] only at the end of the string. We add $ at the end of the pattern: \\s*\\[SIM\\]\\s*$

end_of_string

stringr::str_remove(string = c ("TG 54:3 [SIM]  ",
                                " [SIM] TG 54:3"),
                    pattern = "\\s*\\[SIM\\]\\s*$")

## [1] "TG 54:3"        " [SIM] TG 54:3"

Using `isTRUE` in an if statement

Now that we know how to remove the [SIM] at the end, we need to create an if statement to detect that such a pattern exists. This is because after the [SIM] is removed, the transition names do not need to be modified further and the cleaned name can be returned.

One way to do this is to use stringr::str_detect to do the job

clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {
  
  # If we have a sum composition labelled as [SIM] at the end,
  # remove it and return the results
  if (stringr::str_detect(string = input_acyl,
                           pattern = "\\s*\\[SIM\\]\\s*$"))
  {
    input_acyl <- input_acyl %>%
      stringr::str_remove(pattern = "\\s*\\[SIM\\]\\s*$")
    
    return(input_acyl)
  }
  
  return(input_acyl)
}

However, there is a possibility that stringr::str_detect may not return a boolean value due to an unusual input. This may give an error to the if statement. Here are some examples

if(stringr::str_detect(string = NULL, pattern = "\\s*\\[SIM\\]\\s*$")) {
  print("No error")
}

## Error in if (stringr::str_detect(string = NULL, pattern = "\\s*\\[SIM\\]\\s*$")) {: argument is of length zero

if(stringr::str_detect(string = NA, pattern = "\\s*\\[SIM\\]\\s*$")) {
  print("No error")
}

## Error in if (stringr::str_detect(string = NA, pattern = "\\s*\\[SIM\\]\\s*$")) {: missing value where TRUE/FALSE needed

To rectify this issue, we make use of the function isTRUE which gives the boolean value FALSE, when it receives such unusual input.

isTRUE(stringr::str_detect(string = NULL, pattern = "\\s*\\[SIM\\]\\s*$"))

## [1] FALSE

isTRUE(stringr::str_detect(string = NA, pattern = "\\s*\\[SIM\\]\\s*$"))

## [1] FALSE

Putting it all together, we have the following so far.

clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {
  
  # If we have a sum composition labelled as [SIM] at the end,
  # remove it and return the results
  if (isTRUE(stringr::str_detect(string = input_acyl,
                                 pattern = "\\s*\\[SIM\\]\\s*$")))
  {
    input_acyl <- input_acyl %>%
      stringr::str_remove(pattern = "\\s*\\[SIM\\]\\s*$")
    
    return(input_acyl)
  }
  
  return(input_acyl)
}

Plan Execution

Here is how it looks like when the function is utilised.

clean_acyl("TG 54:3 [SIM]")

## [1] "TG 54:3"

Stay tuned for the next part where we will expand the function to deal with the second part of the plan which is to deal with transitions that indicate a neutral loss of fatty acid chains.

Given Name	Clean Name For Annotation	Precursor Ion	Product Ion
DG 32:0 [-16:0]	DG 16:0_16:0	586.5	313.3
DG 36:1 [NL-18:1]	DG 18:1_18:0	640.6	341.3
TG 54:3 [-18:1]	TG 18:1_36:2	902.8	603.5
TG 54:3 [NL-18:2]	TG 18:2_36:1	902.8	605.5

Package References

get_citation <- function(package_name) {
  transform_name <- package_name %>% 
    citation() %>% 
    format(style="text")
  return(transform_name)
} 

packages <- c("base","rgoslin", "reactable",
              "flair", "magrittr",
              "stringr", "dplyr", "report",
              "tibble", "purrr")

table <- tibble::tibble(Packages = packages)

table %>%
  dplyr::mutate(
    transform_name = purrr::map_chr(.data[["Packages"]],
                                    get_citation)
  ) %>% 
  dplyr::pull(.data[["transform_name"]]) %>% 
  report::as.report_parameters()

R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Kopczynski D, Hoffmann N, Peng B, Ahrends R (2020). “Goslin: A Grammar of Succinct Lipid Nomenclature.” Analytical Chemistry, 92(16), 10957-10960. https://pubs.acs.org/doi/10.1021/acs.analchem.0c01690.
Lin G (2020). reactable: Interactive Data Tables Based on ‘React Table’. R package version 0.2.3, https://CRAN.R-project.org/package=reactable.
Bodwin K, Glanz H (2020). flair: Highlight, Annotate, and Format your R Source Code. R package version 0.0.2, https://CRAN.R-project.org/package=flair.
Bache S, Wickham H (2022). magrittr: A Forward-Pipe Operator for R. R package version 2.0.3, https://CRAN.R-project.org/package=magrittr.
Wickham H (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0, https://CRAN.R-project.org/package=stringr.
Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.9, https://CRAN.R-project.org/package=dplyr.
Makowski D, Ben-Shachar M, Patil I, Lüdecke D (2021). “Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption.” CRAN. https://github.com/easystats/report.
Müller K, Wickham H (2022). tibble: Simple Data Frames. R package version 3.1.7, https://CRAN.R-project.org/package=tibble.
Henry L, Wickham H (2020). purrr: Functional Programming Tools. R package version 0.3.4, https://CRAN.R-project.org/package=purrr.

References

1. Kopczynski D, Hoffmann N, Peng B, Ahrends R. Goslin: A grammar of succinct lipid nomenclature. Analytical Chemistry [Internet]. 2020;92(16):10957–60. Available from: https://doi.org/10.1021/acs.analchem.0c01690

2. Kopczynski D, Hoffmann N, Peng B, Liebisch G, Spener F, Ahrends R. Goslin 2.0 implements the recent lipid shorthand nomenclature for MS-derived lipid structures. Analytical Chemistry [Internet]. 2022;94(16):6097–101. Available from: https://doi.org/10.1021/acs.analchem.1c05430

3. Fahy E, Subramaniam S. RefMet: A reference nomenclature for metabolomics. Nature Methods [Internet]. 2020 Dec 1;17(12):1173–4. Available from: https://doi.org/10.1038/s41592-020-01009-y

4. Han X, Ye H. Overview of lipidomic analysis of triglyceride molecular species in biological lipid extracts. Journal of Agricultural and Food Chemistry [Internet]. 2021 Aug 18;69(32):8895–909. Available from: https://doi.org/10.1021/acs.jafc.0c07175

5. Liebisch F Gerhard, Spener F. Update on LIPID MAPS classification, nomenclature, and shorthand notation for MS-derived lipid structures. Journal of Lipid Research [Internet]. 2020 Dec 1;61(12):1539–55. Available from: https://doi.org/10.1194/jlr.S120001025

Introduction

R Packages Used

Labels to clean

Read Data

The Plan

Remove [SIM] at the end

Remember the backslash \

Removing the whitespaces

Indicating the end of a string

Using isTRUE in an if statement

Plan Execution

Package References

References

Remember the backslash `\`

Using `isTRUE` in an if statement