Extract parts of column names for renaming-CodePudding

I have a dataframe where some of the columns are named as dates. For example, something like this:

df_1 <- data_frame("id" = c('a','b','c','d'),
                 "gender" = c('m','f','f','m'),
                 "05/16/2017" = c(1,2,3,4),
                 "11/08/2016" = c(1,2,3,4),
                 "08/15/2016" = c(1,2,3,4))

df_1
# A tibble: 4 x 5
  id    gender `05/16/2017` `11/08/2016` `08/15/2016`
  <chr> <chr>         <dbl>        <dbl>        <dbl>
1 a     m                 1            1            1
2 b     f                 2            2            2
3 c     f                 3            3            3
4 d     m                 4            4            4

For the columns that are currently dates, in the format mm/dd/yyyy, i would like to extract the mm and yyyy components and use these to rename the columns to election_yyyy_mm. I.e. i would end up with df that looks like this:

df_2 <- data_frame("id" = c('a','b','c','d'),
                 "gender" = c('m','f','f','m'),
                 "election_2017_05" = c(1,2,3,4),
                 "election_2016_11" = c(1,2,3,4),
                 "election_2016_08" = c(1,2,3,4))

df_2
# A tibble: 4 x 5
  id    gender election_2017_05 election_2016_11 election_2016_08
  <chr> <chr>             <dbl>            <dbl>            <dbl>
1 a     m                     1                1                1
2 b     f                     2                2                2
3 c     f                     3                3                3
4 d     m                     4                4                4

I think I have a partial solution involving stringr, but currently I have to run str_extract twice to get the mm and the yyyy components respectively. I'm also not sure how I can pass a vector to rename().

These are the two snippets I have so far:

stringr::str_extract(c("05/16/2017", "11/08/2016", "08/15/2016"), "^[^/] ")
[1] "05" "11" "08"

stringr::str_extract(c("05/16/2017", "11/08/2016", "08/15/2016"), "[0-9]{4}")
[1] "2017" "2016" "2016"

Can anyone help me a) extract both elements (the yyyy and mm bits) in one call to str_extract (or some other function), and b) pass the resulting vector to rename?

CodePudding user response：

We can use rename_with to rename with a function. Inside the renaming function, we can first parse the characters as dates with mdy(), then extract the month() and year(). Finally, glue() the elements back together.

library(dplyr)
library(glue)
library(lubridate)

df_1 %>% rename_with( ~glue('election_{year(mdy(.x))}_{month(mdy(.x))}'),
                      matches("\\d{2}/\\d{2}/\\d{4}"))

output

# A tibble: 4 × 5
  id    gender election_2017_5 election_2016_11 election_2016_8
  <chr> <chr>            <dbl>            <dbl>           <dbl>
1 a     m                    1                1               1
2 b     f                    2                2               2
3 c     f                    3                3               3
4 d     m                    4                4               4

We can also use stringr::string_extract_all to work on vectors instead of single character elements. Using a modified regex from the OPs attempt, we can extract both month and year in a single call. Just extract either (|) the digits(\\d ) from the beginning (^) or end ($) of the string: "^\\d |\\d $".

The answer would be like this:

df_1 %>% rename_with( ~stringr::str_extract_all(.x, "^\\d |\\d $") %>%
                              map_chr(~glue('election_{.x[2]}_{.x[1]}')),
                      matches("\\d{2}/\\d{2}/\\d{4}"))

CodePudding user response：

Using tidyverse (dplyr and stringr), we can rename the columns like this:

library(dplyr)

df_1 %>% 
  rename_with(
    .cols = contains("/"), # selects only the date columns
    ~ paste0(
      "election_",  
      stringr::str_sub(.x, -4, -1), # last 4 digits/letters
      "_",
      stringr::str_sub(.x, 1, 2) # first 2 digits/letters
    )
  )

Result:

# A tibble: 4 x 5
  id    gender election_2017_05 election_2016_11 election_2016_08
  <chr> <chr>             <dbl>            <dbl>            <dbl>
1 a     m                     1                1                1
2 b     f                     2                2                2
3 c     f                     3                3                3
4 d     m                     4                4                4

CodePudding user response：

Here's a one-liner using regex:

names(df_1) <- sub("(\\d ).*?(\\d )$", "election_\\2_\\1", names(df_1))

How this works: First, you divide the column names into two capture groups:

(\\d ): the first capture group, captured first two digits
.*? anything thereafter until ...
(\\d )$: ... the second capture group, capturing the last digits.

Then, using sub's replacment argument, you add the string election_ to the matching names and refer back to the two capture groups in reversed order using backreferences \\1 and \\2.

Using stringr:

library(stringr)
names(df_1) <- str_replace(names(df_1), "(\\d ).*?(\\d )$", "election_\\2_\\1")

Result:

df_1 
# A tibble: 4 × 5
  id    gender election_2017_05 election_2016_11 election_2016_08
  <chr> <chr>             <dbl>            <dbl>            <dbl>
1 a     m                     1                1                1
2 b     f                     2                2                2
3 c     f                     3                3                3
4 d     m                     4                4                4

CodePudding user response：

Here is an alternative approach:

library(dplyr)
library(stringr)
df_1 %>% 
  rename_with(~str_c('election',str_sub(.x, -4,-1),str_sub(.x,-10,-9), sep = "_"), where(is.numeric))

  id    gender election_2017_05 election_2016_11 election_2016_08
  <chr> <chr>             <dbl>            <dbl>            <dbl>
1 a     m                     1                1                1
2 b     f                     2                2                2
3 c     f                     3                3                3
4 d     m                     4                4                4

CodePudding user response：

Another approach with dplyr but without stringr.

Here using rename_with to select out columns with /, splitting the strings on / and using sapply to concatenate the result of the split back together as a vector that can be used for renaming.


df_1 %>%
    rename_with(.cols = contains('/'),
    ~ strsplit(.x, '/') %>% 
    sapply(
      function(x) paste0('election_',x[3],'_',x[2]),
      simplify=TRUE)
    )

Edited to remove as.character calls as explained by @GuedesBF in the comments.