Cleaning Lipid Names for Annotation Part 3

By Jeremy Selva

April 23, 2022

Introduction

In this blog, we continue the process of cleaning up these lipid annotations that my workplace uses

Given Name	Clean Name For Annotation	Precursor Ion	Product Ion
DG 32:0 [-16:0]	DG 16:0_16:0	586.5	313.3
DG 36:1 [NL-18:1]	DG 18:1_18:0	640.6	341.3
TG 54:3 [-18:1]	TG 18:1_36:2	902.8	603.5
TG 54:3 [NL-18:2]	TG 18:2_36:1	902.8	605.5
TG 54:3 [SIM]	TG 54:3	902.8	902.8

This is so that they can be processed by lipid annotations converter tools like Goslin (1), (2) and RefMet (3).

Our last function below is able to remove the [SIM] at the end of the transition name.

clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {
  
  # If we have a sum composition labelled as [SIM] at the end,
  # remove it and return the results
  if (isTRUE(stringr::str_detect(string = input_acyl,
                                 pattern = "\\s*\\[SIM\\]\\s*$")))
  {
    input_acyl <- input_acyl %>%
      stringr::str_remove(pattern = "\\s*\\[SIM\\]\\s*$")
    
    return(input_acyl)
  }
  
  return(input_acyl)
}

clean_acyl("TG 54:3 [SIM]")

## [1] "TG 54:3"

We continue the cleaning task for these transitions.

Given Name	Clean Name For Annotation	Precursor Ion	Product Ion
DG 32:0 [-16:0]	DG 16:0_16:0	586.5	313.3
DG 36:1 [NL-18:1]	DG 18:1_18:0	640.6	341.3
TG 54:3 [-18:1]	TG 18:1_36:2	902.8	603.5
TG 54:3 [NL-18:2]	TG 18:2_36:1	902.8	605.5

Just to recall, these transition names measure the amount of fatty acyl chain attached to the given DG or TG. For example, TG 54:3 [NL-18:1] measures the amount of fatty acyl chain 18:1 attached to TG 54:3 while TG 54:3 [-18:2] measures the amount of fatty acyl chain 18:2 attached to TG 54:3 instead.

In general, they are in the form

{acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}].

We need to transform into the form

{acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}

where

{remaining_C} is {total_C - measured_C}
{remaining_DB} is {total_DB - measured_DB}

R packages used

library("rgoslin")
library("reactable")
library("flair")
library("readr")
library("magrittr")
library("stringr")
library("glue")
library("dplyr")
library("purrr")
library("tibble")
library("report")
summary(report::report(sessionInfo()))

## The analysis was done using the R Statistical language (v4.2.0; R Core Team, 2022) on Windows 10 x64, using the packages rgoslin (v1.0.0), report (v0.5.1), dplyr (v1.0.9), flair (v0.0.2), glue (v1.6.2), magrittr (v2.0.3), purrr (v0.3.4), reactable (v0.2.3), readr (v2.1.2), stringr (v1.4.0) and tibble (v3.1.7).

The plan

We recall the following steps needed to clean such transition names.

Get the acyl class of the transition. {acyl_class}
Get the total carbon number of the transition. {total_C}
Get the total number of double bond of the transition {total_DB}
Get the total carbon number of the measured fatty acid chain. {measured_C}
Get the total number of double bond of the measured fatty acid chain. {measured_DB}
Use the tools above to clean the transition name.

Read data

annotation_data <- readr::read_csv("https://raw.github.com/JauntyJJS/jaunty-blogdown/main/content/blog/2022-04-23-Clean-Lipid-Names-3/Annotation.csv")

reactable::reactable(annotation_data, defaultPageSize = 5)

Get the acyl class {acyl_class}

We now try to get {acyl_class} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}].

To extract the (DG or DAG) or (TG or TAG) in the front of the transition name, we first add the ^ to indicate the software to look at the beginning of the string. Next, as the first letter can only be a D or T, we use the pattern [D|T]. Recall that | means or. As A is optional (appear zero or one time), we add a ? add the end of the pattern A. Finally we simply add the G at the end.

This gives the final pattern to be ^[D|T]A?G

acyl_class <- "TG 54:3 [NL-18:1]" %>%
    stringr::str_extract(pattern = "^[D|T]A?G")

acyl_class

## [1] "TG"

Get the total carbon number of the transition {total_C}

The next step is to extract the total carbon number of the given transition or to get {total_C} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. In the case of TG 54:3 [NL-18:1], the total carbon number to 54. The key is to remove the {acyl_class} on the front first and then extract the total carbon.

To do this, we will need to make use of the {acyl_class} extracted earlier as a pattern and remove them using stringr::str_remove.

For me, I use stringr::str_glue to achieve this pattern. \\s* is just a pattern to find white spaces and remove them.

total_C_step_1 <- "TG 54:3 [NL-18:1]" %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*"))

total_C_step_1

## [1] "54:3 [NL-18:1]"

After removing the {acyl_class} on the front, we will extract the pattern {total_C}:{total_DB} with using the pattern ^\\d+:\\d+. This works when the digits are the first character of the string. Remember that ^ tells the software to look at the start of the string.

total_C_step_2 <- "TG 54:3 [NL-18:1]" %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "^\\d+:\\d+")

total_C_step_2

## [1] "54:3"

From {total_C}:{total_DB}, we can then extract {total_C} using the pattern ^\\d+

total_C <- "TG 54:3 [NL-18:1]" %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "^\\d+:\\d+") %>%
    stringr::str_extract(pattern = "^\\d+")

total_C

## [1] "54"

Get the total number of double bond of the transition {total_DB}

The next step is to extract the total double bond number of the given transition or to get {total_DB} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. In the case of TG 54:3 [NL-18:1], the total number of double bonds is 3.

The extraction process is similar to the previous task. The only difference is that during the stage when we have obtained {total_C}:{total_DB}, we can extract {total_DB} using the pattern \\d+$ instead. Recall that $ tells the software to look at the end of the string.

total_DB <- "TG 54:3 [NL-18:1]" %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "^\\d+:\\d+") %>%
    stringr::str_extract(pattern = "\\d+$")

total_DB

## [1] "3"

Get the total carbon number of the measured fatty acid chain {measured_C}

We now try to get {measured_C} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. To do this, first, we again remove the acyl class on the front, giving us {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. In the case of TG 54:3 [NL-18:1], the total number of measured carbon bonds is 18.

measured_C_step1 <- "TG 54:3 [NL-18:1]" %>%
  stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*"))
  
measured_C_step1

## [1] "54:3 [NL-18:1]"

As the measured fatty acid chain comes with a -, we extract -{measured_C}:{measured_DB} from {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. This is done by using the pattern -\\s*\\d+:\\d+

measured_C_step2 <- "TG 54:3 [NL-18:1]" %>%
  stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
  stringr::str_extract(pattern = "-\\s*\\d+:\\d+")
  
measured_C_step2

## [1] "-18:1"

We proceed to remove the - from -{measured_C}:{measured_DB}, giving us {measured_C}:{measured_DB}

measured_C_step3 <- "TG 54:3 [NL-18:1]" %>%
  stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
  stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
  stringr::str_remove(pattern = "-")
  
measured_C_step3

## [1] "18:1"

Finally, we again use the pattern ^\\d+ to extract the {measured_C}

measured_C <- "TG 54:3 [NL-18:1]" %>%
  stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
  stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
  stringr::str_remove(pattern = "-") %>%
  stringr::str_extract(pattern = "^\\d+")

measured_C

## [1] "18"

Get the total number of double bond of the measured fatty acid chain {measured_DB}

We now try to get {measured_DB} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. In the case of TG 54:3 [NL-18:1], the total number of measured double bonds is 1.

The process is similar to the previous section, the difference is that during the stage when we have obtained {measured_C}:{measured_DB}, we can extract {measured_DB} using the pattern \\d+$ instead.

measured_DB <- "TG 54:3 [NL-18:1]" %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
    stringr::str_remove(pattern = "-") %>%
    stringr::str_extract(pattern = "\\d+$")

measured_DB

## [1] "1"

Check if extraction is successful

There will be situation when the extraction will fail due to a new input of transitions that the function has not encountered before. To check for this issue, we have the following.

input_acyl <- "TG 54:3 [NL-18:1]"

if (!isTRUE(stringr::str_detect(total_C, "^[0-9]+$"))) {
  stop(glue::glue("Extracting total carbon in {input_acyl} has failed"))
}

if (!isTRUE(stringr::str_detect(total_DB, "^[0-9]+$"))) {
  stop(glue::glue("Extracting total double bond in {input_acyl} has failed"))
}

if (!isTRUE(stringr::str_detect(measured_C, "^[0-9]+$"))) {
  stop(glue::glue("Extracting measured carbon in {input_acyl} has failed"))
}

if (!isTRUE(stringr::str_detect(measured_DB, "^[0-9]+$"))) {
  stop(glue::glue("Extracting measured double bond in {input_acyl} has failed"))
}

The pattern ^[0-9]+$ means the software will try to detect digits 0 to 9 (indicated by [0-9]). The digits can appear one or more times (indicated by +) and the start (indicated by ^) and the end (indicated by $) of the string must contain only these digits.

The reason we use isTRUE is to ensure that only logicals TRUE and FALSE are returned. It is possible for stringr::str_detect to return NA. Take a look at this Win Vector LLC blog post for more information.

Use the tools above to clean the transition name

With the {acyl_class}, {total_C}, {total_DB}, {measured_C} and {measured_DB} extracted successfully, we can proceed to calculate

{remaining_C} as {total_C - measured_C}
{remaining_DB} as {total_DB - measured_DB}

remaining_C <- (as.numeric(total_C) - as.numeric(measured_C)) %>%
    as.character()

remaining_C

## [1] "36"

remaining_DB <- (as.numeric(total_DB) - as.numeric(measured_DB)) %>%
  as.character()

remaining_DB

## [1] "2"

Lastly, we use stringr::str_glue to construct {acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}

clean_acyl <- stringr::str_glue(
    "{acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}"
  ) %>%
    as.character()

clean_acyl

## [1] "TG 18:1_36:2"

Putting it all together

Combining what we have accomplished from the previous and this section, we have the following function and corresponding documentation.

#' @title Clean Acyl Lipids
#' @description Clean Acyl Lipids for `rgoslin` input
#' @param input_acyl A character string highlighting a lipid,
#' Default: 'DG 30:0 \[NL-15:0\]'
#' @return A acyl lipid that `rgoslin` can accept
#' @details We only accept DG, DAG, TG, TAG for now.
#' Input Acyl Lipid is of the form
#' {acyl_class} {total_C}:{total_DB} \[NL-{measured_C}:{measured_DB}\]
#' where C is carbon and DB is double bond
#' Output Acyl Lipid is of the form
#' {acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}
#' where {remaining_C} is {total_C - measured_C}
#' and {remaining_DB} is {total_DB - measured_DB}
#' @examples
#' clean_acyl(input_acyl = "DG 32:0 [-16:0]")
#' @rdname clean_acyl
#' @export
clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {

  # If we have a sum composition labelled as [SIM] at the end
  # after removing the (a\b) and (ISTD)
  if(isTRUE(stringr::str_detect(string = input_acyl,
                         pattern = "\\s*\\[SIM\\]\\s*$"))) {

    input_acyl <- input_acyl %>%
      stringr::str_remove(pattern = "\\s*\\[SIM\\]\\s*$")

    return(input_acyl)

  }

  # If we have no neutral loss labeling
  if(!isTRUE(stringr::str_detect(string = input_acyl,
                                pattern = "\\s*(\\[)?(NL)?-\\s*\\d+:\\d+(\\])?\\s*$"))) {

    return(input_acyl)

  }

  # From here all should end with [NL-XX:X]

  acyl_class <- input_acyl %>%
    stringr::str_extract(pattern = "^[D|T]A?G")

  total_C <- input_acyl %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "^\\d+:\\d+") %>%
    stringr::str_extract(pattern = "^\\d+")

  total_DB <- input_acyl %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "^\\d+:\\d+") %>%
    stringr::str_extract(pattern = "\\d+$")

  measured_C <- input_acyl %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
    stringr::str_remove(pattern = "-") %>%
    stringr::str_extract(pattern = "^\\d+")

  measured_DB <- input_acyl %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
    stringr::str_remove(pattern = "-") %>%
    stringr::str_extract(pattern = "\\d+$")

  if(!isTRUE(stringr::str_detect(total_C, "^[0-9]+$"))) {
    stop(glue::glue("Extracting total carbon in {input_acyl} has failed"))
  }

  if(!isTRUE(stringr::str_detect(total_DB, "^[0-9]+$"))) {
    stop(glue::glue("Extracting total double bond in {input_acyl} has failed"))
  }

  if(!isTRUE(stringr::str_detect(measured_C, "^[0-9]+$"))) {
    stop(glue::glue("Extracting measured carbon in {input_acyl} has failed"))
  }

  if(!isTRUE(stringr::str_detect(measured_DB, "^[0-9]+$"))) {
    stop(glue::glue("Extracting measured double bond in {input_acyl} has failed"))
  }

  remaining_C <- (as.numeric(total_C) - as.numeric(measured_C)) %>%
    as.character()

  remaining_DB <- (as.numeric(total_DB) - as.numeric(measured_DB)) %>%
    as.character()

  output_acyl <- stringr::str_glue(
    "{acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}"
  ) %>%
    as.character()

  return(output_acyl)

}

Plan execution

Here is how it looks like when the function is utilised.

cleaned_data <- annotation_data %>% 
  dplyr::mutate(
    `Clean Name For Annotation` = purrr::map(.data[["Given Name"]],
                                             clean_acyl
                                             ) )%>% 
  # Make Given Name and Clean Name the first two columns
  dplyr::relocate(
    dplyr::any_of(c("Given Name","Clean Name For Annotation"))
    ) 

cleaned_data %>%
  reactable::reactable(defaultPageSize = 5)

cleaned_data[["Clean Name For Annotation"]] %>% 
  rgoslin::parseLipidNames() %>%
  reactable::reactable(defaultPageSize = 5)

Normalized.Name

Original.Name

Grammar

Message

Adduct

Adduct.Charge

Lipid.Maps.Category

Lipid.Maps.Main.Class

Species.Name

Molecular.Species.Name

Sn.Position.Name

Structure.Defined.Name

Full.Structure.Name

Functional.Class.Abbr

Functional.Class.Synonyms

Level

Total.C

Total.OH

Total.DB

Mass

Sum.Formula

FA1.Position

FA1.C

FA1.OH

FA1.DB

FA1.Bond.Type

FA1.DB.Positions

FA2.Position

FA2.C

FA2.OH

FA2.DB

FA2.Bond.Type

FA2.DB.Positions

LCB.Position

LCB.C

LCB.OH

LCB.DB

LCB.Bond.Type

LCB.DB.Positions

FA3.Position

FA3.C

FA3.OH

FA3.DB

FA3.Bond.Type

FA3.DB.Positions

FA4.Position

FA4.C

FA4.OH

FA4.DB

FA4.Bond.Type

FA4.DB.Positions

DG 16:0_16:0

Shorthand2020

DG 32:0

DG 16:0_16:0

[DG]

[DG, DAG]

MOLECULAR_SPECIES

568.50667553

C35H68O5

-1

ESTER

[]

-1

ESTER

[]

-1

ESTER

[]

DG 18:1_18:0

Shorthand2020

DG 36:1

DG 18:1_18:0

[DG]

[DG, DAG]

MOLECULAR_SPECIES

622.55362574

C39H74O5

-1

ESTER

[]

-1

ESTER

[]

-1

ESTER

[]

TG 18:1_36:2

Shorthand2020

TG 54:3

TG 18:1_36:2

[TG]

[TG, TAG]

MOLECULAR_SPECIES

870.80402686

C57H106O5

-1

ESTER

[]

-1

ESTER

[]

-1

ESTER

[]

TG 18:2_36:1

Shorthand2020

TG 54:3

TG 18:2_36:1

[TG]

[TG, TAG]

MOLECULAR_SPECIES

870.80402686

C57H106O5

-1

ESTER

[]

-1

ESTER

[]

-1

ESTER

[]

TG 54:3

Shorthand2020

TG 54:3

[TG]

[TG, TAG]

SPECIES

884.78329142

C57H104O6

Package references

get_citation <- function(package_name) {
  transform_name <- package_name %>% 
    citation() %>% 
    format(style="text")
  return(transform_name)
} 

packages <- c("base","rgoslin", "reactable", "flair",
              "magrittr", "stringr", "glue", 
              "dplyr", "purrr", "tibble", "report")

table <- tibble::tibble(Packages = packages)

table %>%
  dplyr::mutate(
    transform_name = purrr::map_chr(.data[["Packages"]],
                                    get_citation)
  ) %>% 
  dplyr::pull(.data[["transform_name"]]) %>% 
  report::as.report_parameters()

R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Kopczynski D, Hoffmann N, Peng B, Ahrends R (2020). “Goslin: A Grammar of Succinct Lipid Nomenclature.” Analytical Chemistry, 92(16), 10957-10960. https://pubs.acs.org/doi/10.1021/acs.analchem.0c01690.
Lin G (2020). reactable: Interactive Data Tables Based on ‘React Table’. R package version 0.2.3, https://CRAN.R-project.org/package=reactable.
Bodwin K, Glanz H (2020). flair: Highlight, Annotate, and Format your R Source Code. R package version 0.0.2, https://CRAN.R-project.org/package=flair.
Bache S, Wickham H (2022). magrittr: A Forward-Pipe Operator for R. R package version 2.0.3, https://CRAN.R-project.org/package=magrittr.
Wickham H (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0, https://CRAN.R-project.org/package=stringr.
Hester J, Bryan J (2022). glue: Interpreted String Literals. R package version 1.6.2, https://CRAN.R-project.org/package=glue.
Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.9, https://CRAN.R-project.org/package=dplyr.
Henry L, Wickham H (2020). purrr: Functional Programming Tools. R package version 0.3.4, https://CRAN.R-project.org/package=purrr.
Müller K, Wickham H (2022). tibble: Simple Data Frames. R package version 3.1.7, https://CRAN.R-project.org/package=tibble.
Makowski D, Ben-Shachar M, Patil I, Lüdecke D (2021). “Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption.” CRAN. https://github.com/easystats/report.

References

1. Kopczynski D, Hoffmann N, Peng B, Ahrends R. Goslin: A grammar of succinct lipid nomenclature. Analytical Chemistry [Internet]. 2020;92(16):10957–60. Available from: https://doi.org/10.1021/acs.analchem.0c01690

2. Kopczynski D, Hoffmann N, Peng B, Liebisch G, Spener F, Ahrends R. Goslin 2.0 implements the recent lipid shorthand nomenclature for MS-derived lipid structures. Analytical Chemistry [Internet]. 2022;94(16):6097–101. Available from: https://doi.org/10.1021/acs.analchem.1c05430

3. Fahy E, Subramaniam S. RefMet: A reference nomenclature for metabolomics. Nature Methods [Internet]. 2020 Dec 1;17(12):1173–4. Available from: https://doi.org/10.1038/s41592-020-01009-y