I have a proteomic dataset, which have automatically printed ridicously long column names.
PG.BiologicalProcess PG.MolecularFunction X.1..20210427_EXPL2_Evo1_CML_DIA44min_CSF-01-R_49.raw.PG.MS1Quantity X.2..20210427_EXPL2_Evo1_CML_DIA44min_CSF-02-R_50.raw.PG.MS1Quantity
1 NaN NaN 642500.0 174625.3
2 NaN NaN 790875.8 910906.9
X.3..20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-30R_109.raw.PG.MS1Quantity X.4..20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-Sr.raw.PG.MS1Quantity
1 866325 300197.3
2
If a column name contain the word CSF
, gsub
or similar should be applied to extract everything that comes in between CSF-
and the first .
or _
. Then, the number and letter(s) that are extracted should be separated by a -
.
So, X.1..20210427_EXPL2_Evo1_CML_DIA44min_CSF-01-R_49.raw.PG.MS1Quantity
becomes 01-R
(important that it is 01
and not just 1
).
And
X.4..20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-Sr.raw.PG.MS1Quantity
becomes 01-Sr
.
I tried different approaches, such as gsub(".*CSF|\\s.*", ".", a)
, but that did not solve it. All columns not containing the word CSF
should remain unchanged.
Expected output
PG.BiologicalProcess PG.MolecularFunction 01-R 02-R 01-30R 01-Sr
1 NaN NaN 642500.0 174625.3 866325 300197.3
2 NaN NaN 790875.8 910906.9 2164413 682274.3
Data sample
a <- structure(list(PG.BiologicalProcess = c(NaN, NaN), PG.MolecularFunction = c(NaN,
NaN), `[1] 20210427_EXPL2_Evo1_CML_DIA44min_CSF-01-R_49.raw.PG.MS1Quantity` = c(642500,
790875.75), `[2] 20210427_EXPL2_Evo1_CML_DIA44min_CSF-02-R_50.raw.PG.MS1Quantity` = c(174625.3281,
910906.875), `[3] 20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-30R_109.raw.PG.MS1Quantity` = c(866325,
2164413), `[4] 20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-Sr.raw.PG.MS1Quantity` = c(300197.3125,
682274.3125)), row.names = c(NA, -2L), class = c("tbl_df", "tbl",
"data.frame"))
CodePudding user response:
You can use
colnames(a) <- sub(".*CSF-([^._]*).*", "\\1", colnames(a))
See the regex demo. Details:
.*
- any zero or more chars as many as possibleCSF-
-CSF-
text([^._]*)
- capturing group 1 (\1
refers to the group value from the replacement pattern): any zero or more chars other than.
and_
.*
- the rest of the string.