Using sub() to extract after a character over multiple columns-CodePudding

consider the following code

x <- c('2','75% (3/4)','80% (4/5)','70% (7/10)','90% (9/10)') 
y <- c('1', '50% (1/2)', '25% (1/4)', '30% (3/10)', '40% (2/5)')

df <- data.frame(rbind(x, y))

I would like to extract the values before the % sign i.e. the whole numbers.

I understand that I can do this using the following:

df$X2 <- sub("%.*", "", df$X2)

But to avoid copy and pasting, and going through each column, is there a way to do it in one step?

I have tried to do the following:

df[-1] <- sub("%.*", "", df[-1])

But this leaves the format as 'c("75' which is not what I am after - what has gone wrong here? Is there another suitable way to do this?

Thanks

CodePudding user response：

The easiest way would likely be to do this using dplyr:

library(dplyr)

mutate(df, across(everything(), stringr::str_remove, "%.*"))

  X1 X2 X3 X4 X5
x  2 75 80 70 90
y  1 50 25 30 40

CodePudding user response：

Base R:

df[] <- lapply(df, sub, pattern = "%.*", replacement = "")
df
#   X1 X2 X3 X4 X5
# x  2 75 80 70 90
# y  1 50 25 30 40

The df[] <- is necessary because by default, lapply returns a list (not a data.frame). By using df[] on the LHS of the assignment, the contents of the columns are replaced within the structure of the frame. This also works well when operating on a subset of columns, as in

df[c(2,3,5)] <- lapply(df[c(2,3,5)], sub, pattern = "%.*", replacement = "")

which is admittedly not what you want here, but provides a way to customize which columns are affected.

The lapply(df, sub, ...) is identical to the use of an anonymous function:

lapply(df, function(z) sub("%.*", "", z))

Because the elements of the argument (df here) are passed a the first argument to the function (which would be pattern=), we explicitly pass the constant values to those as supplement arguments to lapply, where anything after the first two arguments (X, our df; and FUN) are provided as unchanging arguments to the function.

CodePudding user response：

Maybe this might be the output you were looking for?

for (i in colnames(df)){
  df[,i] <- sub("%.*", "", df[,i])
}
print(df)
  X1 X2 X3 X4 X5
x  2 75 80 70 90
y  1 50 25 30 40