How to apply gsub or similar to change column names but only if column name contain specific word-CodePudding

I have a proteomic dataset, which have automatically printed ridicously long column names.

  PG.BiologicalProcess PG.MolecularFunction X.1..20210427_EXPL2_Evo1_CML_DIA44min_CSF-01-R_49.raw.PG.MS1Quantity X.2..20210427_EXPL2_Evo1_CML_DIA44min_CSF-02-R_50.raw.PG.MS1Quantity
1                  NaN                  NaN                                                             642500.0                                                             174625.3
2                  NaN                  NaN                                                             790875.8                                                             910906.9
  X.3..20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-30R_109.raw.PG.MS1Quantity X.4..20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-Sr.raw.PG.MS1Quantity
1                                                                       866325                                                                300197.3
2

If a column name contain the word CSF, gsub or similar should be applied to extract everything that comes in between CSF- and the first . or _. Then, the number and letter(s) that are extracted should be separated by a -.

So, X.1..20210427_EXPL2_Evo1_CML_DIA44min_CSF-01-R_49.raw.PG.MS1Quantity becomes 01-R (important that it is 01 and not just 1).

And

X.4..20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-Sr.raw.PG.MS1Quantity becomes 01-Sr.

I tried different approaches, such as gsub(".*CSF|\\s.*", ".", a), but that did not solve it. All columns not containing the word CSF should remain unchanged.

Expected output

  PG.BiologicalProcess PG.MolecularFunction     01-R     02-R  01-30R    01-Sr
1                  NaN                  NaN 642500.0 174625.3  866325 300197.3
2                  NaN                  NaN 790875.8 910906.9 2164413 682274.3

Data sample

a <- structure(list(PG.BiologicalProcess = c(NaN, NaN), PG.MolecularFunction = c(NaN, 
NaN), `[1] 20210427_EXPL2_Evo1_CML_DIA44min_CSF-01-R_49.raw.PG.MS1Quantity` = c(642500, 
790875.75), `[2] 20210427_EXPL2_Evo1_CML_DIA44min_CSF-02-R_50.raw.PG.MS1Quantity` = c(174625.3281, 
910906.875), `[3] 20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-30R_109.raw.PG.MS1Quantity` = c(866325, 
2164413), `[4] 20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-Sr.raw.PG.MS1Quantity` = c(300197.3125, 
682274.3125)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", 
"data.frame"))

CodePudding user response：

You can use

colnames(a) <- sub(".*CSF-([^._]*).*", "\\1", colnames(a))

See the regex demo. Details:

.* - any zero or more chars as many as possible
CSF- - CSF- text
([^._]*) - capturing group 1 (\1 refers to the group value from the replacement pattern): any zero or more chars other than . and _
.* - the rest of the string.