Home > Blockchain >  How to apply gsub or similar to change column names but only if column name contain specific word
How to apply gsub or similar to change column names but only if column name contain specific word

Time:09-23

I have a proteomic dataset, which have automatically printed ridicously long column names.

  PG.BiologicalProcess PG.MolecularFunction X.1..20210427_EXPL2_Evo1_CML_DIA44min_CSF-01-R_49.raw.PG.MS1Quantity X.2..20210427_EXPL2_Evo1_CML_DIA44min_CSF-02-R_50.raw.PG.MS1Quantity
1                  NaN                  NaN                                                             642500.0                                                             174625.3
2                  NaN                  NaN                                                             790875.8                                                             910906.9
  X.3..20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-30R_109.raw.PG.MS1Quantity X.4..20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-Sr.raw.PG.MS1Quantity
1                                                                       866325                                                                300197.3
2 

If a column name contain the word CSF, gsub or similar should be applied to extract everything that comes in between CSF- and the first . or _. Then, the number and letter(s) that are extracted should be separated by a -.

So, X.1..20210427_EXPL2_Evo1_CML_DIA44min_CSF-01-R_49.raw.PG.MS1Quantity becomes 01-R (important that it is 01 and not just 1).

And

X.4..20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-Sr.raw.PG.MS1Quantity becomes 01-Sr.

I tried different approaches, such as gsub(".*CSF|\\s.*", ".", a), but that did not solve it. All columns not containing the word CSF should remain unchanged.

Expected output

  PG.BiologicalProcess PG.MolecularFunction     01-R     02-R  01-30R    01-Sr
1                  NaN                  NaN 642500.0 174625.3  866325 300197.3
2                  NaN                  NaN 790875.8 910906.9 2164413 682274.3

Data sample

a <- structure(list(PG.BiologicalProcess = c(NaN, NaN), PG.MolecularFunction = c(NaN, 
NaN), `[1] 20210427_EXPL2_Evo1_CML_DIA44min_CSF-01-R_49.raw.PG.MS1Quantity` = c(642500, 
790875.75), `[2] 20210427_EXPL2_Evo1_CML_DIA44min_CSF-02-R_50.raw.PG.MS1Quantity` = c(174625.3281, 
910906.875), `[3] 20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-30R_109.raw.PG.MS1Quantity` = c(866325, 
2164413), `[4] 20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-Sr.raw.PG.MS1Quantity` = c(300197.3125, 
682274.3125)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", 
"data.frame"))

CodePudding user response:

You can use

colnames(a) <- sub(".*CSF-([^._]*).*", "\\1", colnames(a))

See the regex demo. Details:

  • .* - any zero or more chars as many as possible
  • CSF- - CSF- text
  • ([^._]*) - capturing group 1 (\1 refers to the group value from the replacement pattern): any zero or more chars other than . and _
  • .* - the rest of the string.
  • Related