Home > Enterprise >  How do I drop numbers from the start of column names? (preferably through tidyverse)
How do I drop numbers from the start of column names? (preferably through tidyverse)

Time:11-24

I'm working through a task where I need to bind a few survey datasets, but unfortunately the survey questions are inconsistently numbered (wording is consistent). To solve this, I want to drop the question number from the start of each question.

Currently I am doing this manually with rename(), but it is time consuming to repeat for every question across each dataset. Any tips to do this in a quicker, more efficient way?

Here's an example dataset and my current process:

df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
                  `1. First Question`  = c('a', 'b', 'c', 'd', 'e'),
                  `2. Second Question` = c(1, 1, 3, 0, 1),
                  `3. Third Question`  = c(1, 2, 0, 2, 1),
                   Year = 2021) %>%
       rename(`First Question` = `1. First Question`,
              `Second Question` = `2. Second Question`,
              `Third Question` = `3. Third Question`)

df2 <- data.frame(ID = c(1, 2, 3, 4, 5),
                  `1. First Question`  = c('a', 'b', 'c', 'd', 'e'),
                  `2. Third Question`  = c(2, 1, 3, 1, 2),
                  `3. Second Question` = c(2, 2, 1, 3, 2),
                  Year = 2022) %>% 
       rename(`First Question`  = `1. First Question`,
              `Second Question` = `3. Second Question`,
              `Third Question`  = `2. Third Question`)

end_df <- rbind(df1, df2)

CodePudding user response:

You can use rename_with, which uses a function, here sub, to change the column names based on a regex pattern:

df1 %>%
    rename_with(~ sub("^X\\d\\.\\.", "", .))
  ID First.Question Second.Question Third.Question Year
1  1              a               1              1 2021
2  2              b               1              2 2021
3  3              c               3              0 2021
4  4              d               0              2 2021
5  5              e               1              1 2021

As noted by @zephryl you can do it in one go for a list of all your dataframes:

list(df1, df2) %>%
  map(rename_with, ~ sub("^X\\d\\.\\.", "", .))

Data:

df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
                  `1. First Question` = c('a', 'b', 'c', 'd', 'e'),
                  `2. Second Question` = c(1, 1, 3, 0, 1),
                  `3. Third Question` = c(1, 2, 0, 2, 1),
                  Year = 2021)

df2 <- data.frame(ID = c(1, 2, 3, 4, 5),
                  `1. First Question` = c('a', 'b', 'c', 'd', 'e'),
                  `2. Third Question` = c(2, 1, 3, 1, 2),
                  `3. Second Question` = c(2, 2, 1, 3, 2),
                  Year = 2022)

CodePudding user response:

Using a regex with stringr::str_remove() inside dplyr::rename_with() for each dataframe:

library(purrr)
library(dplyr)
library(stringr)

list(df1, df2) %>%
  map(rename_with, ~ str_remove(.x, "^\\d\\.\\s")) %>%
  bind_rows()
# A tibble: 10 × 5
      ID `First Question` `Second Question` `Third Question`  Year
   <dbl> <chr>                        <dbl>            <dbl> <dbl>
 1     1 a                                1                1  2021
 2     2 b                                1                2  2021
 3     3 c                                3                0  2021
 4     4 d                                0                2  2021
 5     5 e                                1                1  2021
 6     1 a                                2                2  2022
 7     2 b                                2                1  2022
 8     3 c                                1                3  2022
 9     4 d                                3                1  2022
10     5 e                                2                2  2022

CodePudding user response:

A base R alternative

colnames(df1)[2:4] <- sub("^[0-9]\\. ", "", colnames(df1)[2:4])
colnames(df2)[2:4] <- sub("^[0-9]\\. ", "", colnames(df2)[2:4])

rbind(df1, df2)
   ID First Question Second Question Third Question Year
1   1              a               1              1 2021
2   2              b               1              2 2021
3   3              c               3              0 2021
4   4              d               0              2 2021
5   5              e               1              1 2021
6   1              a               2              2 2022
7   2              b               2              1 2022
8   3              c               1              3 2022
9   4              d               3              1 2022
10  5              e               2              2 2022

Important side note. Create the data frames with check.names = F, otherwise names get substituted with something like this X1..First.Question etc.

df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
                  `1. First Question` = c('a', 'b', 'c', 'd', 'e'),
                  `2. Second Question` = c(1, 1, 3, 0, 1),
                  `3. Third Question` = c(1, 2, 0, 2, 1),
                  Year = 2021, check.names = F)
  • Related