Home > Blockchain >  R - substr over multiple columns in Dataframe
R - substr over multiple columns in Dataframe

Time:08-28

Lets say I have a dataframe that looks like this

Column1,  Column2,  Column3
 a_2019    b_2020    c_2021
 d_2019    e_2020    f_2021
 a_2019    b_2020    c_2021
 d_2019    e_2020    f_2021

And I would like to take out "_2019", "_2020", and "_2021". I could use

df$Column1 <- substr(df$Column1, 1, nchar(df$Column1)-5)

For every column, but I have multiple dataframes with quite a few columns. substr need a text or a vector for it to work, so using df[,3:10] doesn´t work, lapply either.

Any suggestion on how to achieve this in an elegant way? Thank you

CodePudding user response:

We can try using lapply along with sub for a base R option:

df[cols] <- lapply(df[cols], function(x) sub("_(?:2019|2020|2021)$", "", x))

Here cols should be a vector containing the column names on which you seek to make the replacement.

More generally, to target underscore followed by any number, we can use:

df[cols] <- lapply(df[cols], function(x) sub("_\\d $", "", x))  # or _\\d{4} for a year
  • Related