Home > Software design >  How to apply same operation to multiple columns spread out in dataframe in R?
How to apply same operation to multiple columns spread out in dataframe in R?

Time:12-30

beginner here! I have a dataframe with over 600 columns. I want to apply some simple operations to specific columns that are all named similarly (see example below) but spread out throughout the dataframe. How can I build a loop for all these columns based on their names and the ascending numbers at the end of the column name?

Dataframe example:

Name  Col_x_1  Company  Col_x_2  Start_Year  End_Year  Col_x_3
asd   4 Col    Test     2 Col    1902        1933      1 Col
kfj   5 Cols   Test_2   10 Col   1933        1954      0
ale   0        Test_3   11 Cols  1988        1999      5 Col
...

Output example: I want to remove the string following the space behind the value in the columns named "Col_x_1, Col_x_2, ..." (going up until 56) and turn the columns into dbl.

Name  Col_x_1  Company  Col_x_2  Start_Year  End_Year  Col_x_3
asd   4        Test     2        1902        1933      1 
kfj   5        Test_2   10       1933        1954      0
ale   0        Test_3   11       1988        1999      5 
...

How can I build a loop to perform this simple operation? Thank you in advance!

CodePudding user response:

You could use tidyverse:

library(tidyverse)

df %>% 
  mutate(across(starts_with("Col_x"), ~str_extract(., "^\\d ") %>% as.numeric()))

This returns

# A tibble: 3 × 7
  Name  Col_x_1 Company Col_x_2 Start_Year End_Year Col_x_3
  <chr>   <dbl> <chr>     <dbl>      <dbl>    <dbl>   <dbl>
1 asd         4 Test          2       1902     1933       1
2 kfj         5 Test_2       10       1933     1954       0
3 ale         0 Test_3       11       1988     1999       5

Or, depending on your actual data

df %>% 
  mutate(across(starts_with("Col_x"), ~str_remove_all(., "(?<=\\d)\\s [A-z] ") %>% as.numeric()))

The first one extracts the starting number (assuming it's an integer) and removes the remaining content. The second one looks for a number followed by a space and some letters and removes them.

Finally both convert the remaining part into a numeric.

We use starts_with() to select the columns by a names pattern. There are other possibilities like ends_with(), matches() or contains(). The selection of the appropriate function depends on the actual structure of the names.

CodePudding user response:

If you must use a loop, you could use grep to get the columns names that contains "Col_x_" in the beginning and gsub to remove the words that you want

names_dat <- grep("^Col_x_",names(dat), value = TRUE)
for (i in names_dat) {
  dat[[i]] <- as.numeric(gsub("Cols?","", dat[[i]]))
}
  • Related