Cleaning Lipid Names for Annotation Part 3

By Jeremy Selva

April 23, 2022

Introduction

In this blog, we continue the process of cleaning up these lipid annotations that my workplace uses

Given Name Clean Name For Annotation Precursor Ion Product Ion
DG 32:0 [-16:0] DG 16:0_16:0 586.5 313.3
DG 36:1 [NL-18:1] DG 18:1_18:0 640.6 341.3
TG 54:3 [-18:1] TG 18:1_36:2 902.8 603.5
TG 54:3 [NL-18:2] TG 18:2_36:1 902.8 605.5
TG 54:3 [SIM] TG 54:3 902.8 902.8

This is so that they can be processed by lipid annotations converter tools like Goslin (1), (2) and RefMet (3).

Our last function below is able to remove the [SIM] at the end of the transition name.

clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {
  
  # If we have a sum composition labelled as [SIM] at the end,
  # remove it and return the results
  if (isTRUE(stringr::str_detect(string = input_acyl,
                                 pattern = "\\s*\\[SIM\\]\\s*$")))
  {
    input_acyl <- input_acyl %>%
      stringr::str_remove(pattern = "\\s*\\[SIM\\]\\s*$")
    
    return(input_acyl)
  }
  
  return(input_acyl)
}
clean_acyl("TG 54:3 [SIM]")
## [1] "TG 54:3"

We continue the cleaning task for these transitions.

Given Name Clean Name For Annotation Precursor Ion Product Ion
DG 32:0 [-16:0] DG 16:0_16:0 586.5 313.3
DG 36:1 [NL-18:1] DG 18:1_18:0 640.6 341.3
TG 54:3 [-18:1] TG 18:1_36:2 902.8 603.5
TG 54:3 [NL-18:2] TG 18:2_36:1 902.8 605.5

Just to recall, these transition names measure the amount of fatty acyl chain attached to the given DG or TG. For example, TG 54:3 [NL-18:1] measures the amount of fatty acyl chain 18:1 attached to TG 54:3 while TG 54:3 [-18:2] measures the amount of fatty acyl chain 18:2 attached to TG 54:3 instead.

In general, they are in the form

  • {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}].

We need to transform into the form

  • {acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}

where

  • {remaining_C} is {total_C - measured_C}
  • {remaining_DB} is {total_DB - measured_DB}

R packages used

library("rgoslin")
library("reactable")
library("flair")
library("readr")
library("magrittr")
library("stringr")
library("glue")
library("dplyr")
library("purrr")
library("tibble")
library("report")
summary(report::report(sessionInfo()))
## The analysis was done using the R Statistical language (v4.2.0; R Core Team, 2022) on Windows 10 x64, using the packages rgoslin (v1.0.0), report (v0.5.1), dplyr (v1.0.9), flair (v0.0.2), glue (v1.6.2), magrittr (v2.0.3), purrr (v0.3.4), reactable (v0.2.3), readr (v2.1.2), stringr (v1.4.0) and tibble (v3.1.7).

The plan

We recall the following steps needed to clean such transition names.

  • Get the acyl class of the transition. {acyl_class}

  • Get the total carbon number of the transition. {total_C}

  • Get the total number of double bond of the transition {total_DB}

  • Get the total carbon number of the measured fatty acid chain. {measured_C}

  • Get the total number of double bond of the measured fatty acid chain. {measured_DB}

  • Use the tools above to clean the transition name.

Read data

annotation_data <- readr::read_csv("https://raw.github.com/JauntyJJS/jaunty-blogdown/main/content/blog/2022-04-23-Clean-Lipid-Names-3/Annotation.csv")

reactable::reactable(annotation_data, defaultPageSize = 5)

Get the acyl class {acyl_class}

We now try to get {acyl_class} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}].

To extract the (DG or DAG) or (TG or TAG) in the front of the transition name, we first add the ^ to indicate the software to look at the beginning of the string. Next, as the first letter can only be a D or T, we use the pattern [D|T]. Recall that | means or. As A is optional (appear zero or one time), we add a ? add the end of the pattern A. Finally we simply add the G at the end.

This gives the final pattern to be ^[D|T]A?G

acyl_class <- "TG 54:3 [NL-18:1]" %>%
    stringr::str_extract(pattern = "^[D|T]A?G")

acyl_class
## [1] "TG"

Get the total carbon number of the transition {total_C}

The next step is to extract the total carbon number of the given transition or to get {total_C} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. In the case of TG 54:3 [NL-18:1], the total carbon number to 54. The key is to remove the {acyl_class} on the front first and then extract the total carbon.

To do this, we will need to make use of the {acyl_class} extracted earlier as a pattern and remove them using stringr::str_remove.

For me, I use stringr::str_glue to achieve this pattern. \\s* is just a pattern to find white spaces and remove them.

total_C_step_1 <- "TG 54:3 [NL-18:1]" %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*"))

total_C_step_1
## [1] "54:3 [NL-18:1]"

After removing the {acyl_class} on the front, we will extract the pattern {total_C}:{total_DB} with using the pattern ^\\d+:\\d+. This works when the digits are the first character of the string. Remember that ^ tells the software to look at the start of the string.

total_C_step_2 <- "TG 54:3 [NL-18:1]" %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "^\\d+:\\d+")

total_C_step_2
## [1] "54:3"

From {total_C}:{total_DB}, we can then extract {total_C} using the pattern ^\\d+

total_C <- "TG 54:3 [NL-18:1]" %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "^\\d+:\\d+") %>%
    stringr::str_extract(pattern = "^\\d+")

total_C
## [1] "54"

Get the total number of double bond of the transition {total_DB}

The next step is to extract the total double bond number of the given transition or to get {total_DB} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. In the case of TG 54:3 [NL-18:1], the total number of double bonds is 3.

The extraction process is similar to the previous task. The only difference is that during the stage when we have obtained {total_C}:{total_DB}, we can extract {total_DB} using the pattern \\d+$ instead. Recall that $ tells the software to look at the end of the string.

total_DB <- "TG 54:3 [NL-18:1]" %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "^\\d+:\\d+") %>%
    stringr::str_extract(pattern = "\\d+$")

total_DB
## [1] "3"

Get the total carbon number of the measured fatty acid chain {measured_C}

We now try to get {measured_C} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. To do this, first, we again remove the acyl class on the front, giving us {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. In the case of TG 54:3 [NL-18:1], the total number of measured carbon bonds is 18.

measured_C_step1 <- "TG 54:3 [NL-18:1]" %>%
  stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*"))
  
measured_C_step1
## [1] "54:3 [NL-18:1]"

As the measured fatty acid chain comes with a -, we extract -{measured_C}:{measured_DB} from {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. This is done by using the pattern -\\s*\\d+:\\d+

measured_C_step2 <- "TG 54:3 [NL-18:1]" %>%
  stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
  stringr::str_extract(pattern = "-\\s*\\d+:\\d+")
  
measured_C_step2
## [1] "-18:1"

We proceed to remove the - from -{measured_C}:{measured_DB}, giving us {measured_C}:{measured_DB}

measured_C_step3 <- "TG 54:3 [NL-18:1]" %>%
  stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
  stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
  stringr::str_remove(pattern = "-")
  
measured_C_step3
## [1] "18:1"

Finally, we again use the pattern ^\\d+ to extract the {measured_C}

measured_C <- "TG 54:3 [NL-18:1]" %>%
  stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
  stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
  stringr::str_remove(pattern = "-") %>%
  stringr::str_extract(pattern = "^\\d+")

measured_C
## [1] "18"

Get the total number of double bond of the measured fatty acid chain {measured_DB}

We now try to get {measured_DB} from {acyl_class} {total_C}:{total_DB} [NL-{measured_C}:{measured_DB}]. In the case of TG 54:3 [NL-18:1], the total number of measured double bonds is 1.

The process is similar to the previous section, the difference is that during the stage when we have obtained {measured_C}:{measured_DB}, we can extract {measured_DB} using the pattern \\d+$ instead.

measured_DB <- "TG 54:3 [NL-18:1]" %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
    stringr::str_remove(pattern = "-") %>%
    stringr::str_extract(pattern = "\\d+$")

measured_DB
## [1] "1"

Check if extraction is successful

There will be situation when the extraction will fail due to a new input of transitions that the function has not encountered before. To check for this issue, we have the following.

input_acyl <- "TG 54:3 [NL-18:1]"

if (!isTRUE(stringr::str_detect(total_C, "^[0-9]+$"))) {
  stop(glue::glue("Extracting total carbon in {input_acyl} has failed"))
}

if (!isTRUE(stringr::str_detect(total_DB, "^[0-9]+$"))) {
  stop(glue::glue("Extracting total double bond in {input_acyl} has failed"))
}

if (!isTRUE(stringr::str_detect(measured_C, "^[0-9]+$"))) {
  stop(glue::glue("Extracting measured carbon in {input_acyl} has failed"))
}

if (!isTRUE(stringr::str_detect(measured_DB, "^[0-9]+$"))) {
  stop(glue::glue("Extracting measured double bond in {input_acyl} has failed"))
}

The pattern ^[0-9]+$ means the software will try to detect digits 0 to 9 (indicated by [0-9]). The digits can appear one or more times (indicated by +) and the start (indicated by ^) and the end (indicated by $) of the string must contain only these digits.

The reason we use isTRUE is to ensure that only logicals TRUE and FALSE are returned. It is possible for stringr::str_detect to return NA. Take a look at this Win Vector LLC blog post for more information.

Use the tools above to clean the transition name

With the {acyl_class}, {total_C}, {total_DB}, {measured_C} and {measured_DB} extracted successfully, we can proceed to calculate

  • {remaining_C} as {total_C - measured_C}
  • {remaining_DB} as {total_DB - measured_DB}
remaining_C <- (as.numeric(total_C) - as.numeric(measured_C)) %>%
    as.character()

remaining_C
## [1] "36"
remaining_DB <- (as.numeric(total_DB) - as.numeric(measured_DB)) %>%
  as.character()

remaining_DB
## [1] "2"

Lastly, we use stringr::str_glue to construct {acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}

clean_acyl <- stringr::str_glue(
    "{acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}"
  ) %>%
    as.character()

clean_acyl
## [1] "TG 18:1_36:2"

Putting it all together

Combining what we have accomplished from the previous and this section, we have the following function and corresponding documentation.

#' @title Clean Acyl Lipids
#' @description Clean Acyl Lipids for `rgoslin` input
#' @param input_acyl A character string highlighting a lipid,
#' Default: 'DG 30:0 \[NL-15:0\]'
#' @return A acyl lipid that `rgoslin` can accept
#' @details We only accept DG, DAG, TG, TAG for now.
#' Input Acyl Lipid is of the form
#' {acyl_class} {total_C}:{total_DB} \[NL-{measured_C}:{measured_DB}\]
#' where C is carbon and DB is double bond
#' Output Acyl Lipid is of the form
#' {acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}
#' where {remaining_C} is {total_C - measured_C}
#' and {remaining_DB} is {total_DB - measured_DB}
#' @examples
#' clean_acyl(input_acyl = "DG 32:0 [-16:0]")
#' @rdname clean_acyl
#' @export
clean_acyl <- function(input_acyl = "DG 32:0 [NL-16:0]") {

  # If we have a sum composition labelled as [SIM] at the end
  # after removing the (a\b) and (ISTD)
  if(isTRUE(stringr::str_detect(string = input_acyl,
                         pattern = "\\s*\\[SIM\\]\\s*$"))) {

    input_acyl <- input_acyl %>%
      stringr::str_remove(pattern = "\\s*\\[SIM\\]\\s*$")

    return(input_acyl)

  }

  # If we have no neutral loss labeling
  if(!isTRUE(stringr::str_detect(string = input_acyl,
                                pattern = "\\s*(\\[)?(NL)?-\\s*\\d+:\\d+(\\])?\\s*$"))) {

    return(input_acyl)

  }

  # From here all should end with [NL-XX:X]

  acyl_class <- input_acyl %>%
    stringr::str_extract(pattern = "^[D|T]A?G")

  total_C <- input_acyl %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "^\\d+:\\d+") %>%
    stringr::str_extract(pattern = "^\\d+")

  total_DB <- input_acyl %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "^\\d+:\\d+") %>%
    stringr::str_extract(pattern = "\\d+$")

  measured_C <- input_acyl %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
    stringr::str_remove(pattern = "-") %>%
    stringr::str_extract(pattern = "^\\d+")

  measured_DB <- input_acyl %>%
    stringr::str_remove(pattern = stringr::str_glue("{acyl_class}\\s*")) %>%
    stringr::str_extract(pattern = "-\\s*\\d+:\\d+") %>%
    stringr::str_remove(pattern = "-") %>%
    stringr::str_extract(pattern = "\\d+$")

  if(!isTRUE(stringr::str_detect(total_C, "^[0-9]+$"))) {
    stop(glue::glue("Extracting total carbon in {input_acyl} has failed"))
  }

  if(!isTRUE(stringr::str_detect(total_DB, "^[0-9]+$"))) {
    stop(glue::glue("Extracting total double bond in {input_acyl} has failed"))
  }

  if(!isTRUE(stringr::str_detect(measured_C, "^[0-9]+$"))) {
    stop(glue::glue("Extracting measured carbon in {input_acyl} has failed"))
  }

  if(!isTRUE(stringr::str_detect(measured_DB, "^[0-9]+$"))) {
    stop(glue::glue("Extracting measured double bond in {input_acyl} has failed"))
  }

  remaining_C <- (as.numeric(total_C) - as.numeric(measured_C)) %>%
    as.character()

  remaining_DB <- (as.numeric(total_DB) - as.numeric(measured_DB)) %>%
    as.character()

  output_acyl <- stringr::str_glue(
    "{acyl_class} {measured_C}:{measured_DB}_{remaining_C}:{remaining_DB}"
  ) %>%
    as.character()

  return(output_acyl)

}

Plan execution

Here is how it looks like when the function is utilised.

cleaned_data <- annotation_data %>% 
  dplyr::mutate(
    `Clean Name For Annotation` = purrr::map(.data[["Given Name"]],
                                             clean_acyl
                                             ) )%>% 
  # Make Given Name and Clean Name the first two columns
  dplyr::relocate(
    dplyr::any_of(c("Given Name","Clean Name For Annotation"))
    ) 

cleaned_data %>%
  reactable::reactable(defaultPageSize = 5)
cleaned_data[["Clean Name For Annotation"]] %>% 
  rgoslin::parseLipidNames() %>%
  reactable::reactable(defaultPageSize = 5)

Package references

get_citation <- function(package_name) {
  transform_name <- package_name %>% 
    citation() %>% 
    format(style="text")
  return(transform_name)
} 

packages <- c("base","rgoslin", "reactable", "flair",
              "magrittr", "stringr", "glue", 
              "dplyr", "purrr", "tibble", "report")

table <- tibble::tibble(Packages = packages)

table %>%
  dplyr::mutate(
    transform_name = purrr::map_chr(.data[["Packages"]],
                                    get_citation)
  ) %>% 
  dplyr::pull(.data[["transform_name"]]) %>% 
  report::as.report_parameters()

References

1. Kopczynski D, Hoffmann N, Peng B, Ahrends R. Goslin: A grammar of succinct lipid nomenclature. Analytical Chemistry [Internet]. 2020;92(16):10957–60. Available from: https://doi.org/10.1021/acs.analchem.0c01690

2. Kopczynski D, Hoffmann N, Peng B, Liebisch G, Spener F, Ahrends R. Goslin 2.0 implements the recent lipid shorthand nomenclature for MS-derived lipid structures. Analytical Chemistry [Internet]. 2022;94(16):6097–101. Available from: https://doi.org/10.1021/acs.analchem.1c05430

3. Fahy E, Subramaniam S. RefMet: A reference nomenclature for metabolomics. Nature Methods [Internet]. 2020 Dec 1;17(12):1173–4. Available from: https://doi.org/10.1038/s41592-020-01009-y