Rename values in a single column that end with the same string-CodePudding

I have several large data frames that contain one column (we can call ittimeperiod) with variables in it that are text strings. All of the variables end in specific strings (like V.1to2 or V.2to3) but the beginnings are different I want the values with the same endings to be changed to different values. Here is an example:

With a data frame like this:

df <- data.frame (Location  = c("a","b","c","d","e","f","g","h"),
                   timeperiod = c("A.V.1to2", "D.V.1to2", "A.V.1to2","D.V.2to3","A.V.3to4","H.V.3to4","A.V.4to5","D.V.4to5"))

Looking like this:

  Location timeperiod
1        a   A.V.1to2
2        b   D.V.1to2
3        c   A.V.1to2
4        d   D.V.2to3
5        e   A.V.3to4
6        f   H.V.3to4
7        g   A.V.4to5
8        h   D.V.4to5

My expected/hoped for output would look like this:

df2
  Location timeperiod
1        a          1
2        b          1
3        c          1
4        d          2
5        e          3
6        f          3
7        g          4
8        h          4

df2 <- data.frame (Location  = c("a","b","c","d","e","f","g","h"),
                  timeperiod = c(1, 1, 1, 2, 3, 3, 4, 4))

I know about:

df$timeperiod[df$timeperiod =="A.V.1to2"] <- "1"

But because of the size of my data set and because I need to repeat this for multiple data frames that are not consistent in the prefix for the timeperiod values I would like to use something like this with dplyr:

library(dplyr)
df$timeperiod <- revalue(df$timeperiod, c(ends_with(V.1to2)="1"))
df$timeperiod <- revalue(df$timeperiod, c(ends_with(V.2to3)="2"))
#etc..

So that I can repeat the process over many different values and across many different sheets. This doesn't work though and even this seems inefficient so any solution that is faster than renaming every specific value would be sufficient.

Thanks for any help.

CodePudding user response：

We could use str_extract:

library(dplyr)
library(stringr)

df %>% 
  mutate(timeperiod = str_extract(timeperiod, '\\d '))

  Location timeperiod
1        a          1
2        b          1
3        c          1
4        d          2
5        e          3
6        f          3
7        g          4
8        h          4

CodePudding user response：

We can use dplyr, and stringr. First extract the last 6 characters of timeperiod. Then, group_by timeperiod, and finally use cur_group_id

library(dplyr)
library(stringr)

df %>% mutate(timeperiod = str_extract(timeperiod, '.{6}$'))%>%
    group_by(timeperiod)%>%
    mutate(timeperiod = cur_group_id())%>%
    ungroup()

# A tibble: 8 × 2
  Location timeperiod
  <chr>         <int>
1 a                 1
2 b                 1
3 c                 1
4 d                 2
5 e                 3
6 f                 3
7 g                 4
8 h                 4

CodePudding user response：

Maybe this is what you are looking for

df <- data.frame (Location  = c("a","b","c","d","e","f","g","h"),
              timeperiod = c("A.V.1to2", "D.V.1to2", "A.V.1to2","D.V.2to3","A.V.3to4","H.V.3to4","A.V.4to5","D.V.4to5"))

df$timeperiod <- substr(gsub('[[:alpha:]]|[[:punct:]]', '', df$timeperiod), 1, 1)

df

  Location timeperiod
1        a          1
2        b          1
3        c          1
4        d          2
5        e          3
6        f          3
7        g          4
8        h          4