Home > front end >  Replacing 2nd and 3rd Colons but Keeping 1st in String in R
Replacing 2nd and 3rd Colons but Keeping 1st in String in R

Time:05-13

I have a large data set structured like the demo data frame below and need to replace the second and third (but not the first) colons with dashes in the datetime rows. I have tried using various regex constructions with str_replace(), gsub(), and substr() in R, but I cannot figure out how to keep the first colon while replacing the second and third.

# Demo data
df <- data.frame(V1=c(
             "case:   1",
             "myvar2: 36",
             "myvar3: First",
             "datetime: 2018-11-29 02:27:16",
             "case:   2",
             "myvar2: 37",
             "myvar3: Second",
             "datetime: 2018-11-30 04:33:18",
             "case:   3",
             "myvar2: 38",
             "myvar3: Third",
             "datetime: 2018-12-01 15:21:48",            
             "case:   4",            
             "myvar2: 39",
             "myvar3: Fourth",
             "datetime: 2018-12-02 12:27:01"))
df

I'm trying to extend my rudimentary understanding of regex with R and would appreciate guidance on how to solve this problem.

CodePudding user response:

Use str_replace after capturing groups and then replace by inserting the - between the backreference of the captured groups

library(dplyr)
library(stringr)
df %>% 
  mutate(V1 = str_replace(V1,
    '(datetime: \\d{4}-\\d{2}-\\d{2} \\d ):(\\d ):', '\\1-\\2-'))

-output

                              V1
1                      case:   1
2                     myvar2: 36
3                  myvar3: First
4  datetime: 2018-11-29 02-27-16
5                      case:   2
6                     myvar2: 37
7                 myvar3: Second
8  datetime: 2018-11-30 04-33-18
9                      case:   3
10                    myvar2: 38
11                 myvar3: Third
12 datetime: 2018-12-01 15-21-48
13                     case:   4
14                    myvar2: 39
15                myvar3: Fourth
16 datetime: 2018-12-02 12-27-01

CodePudding user response:

1) Since the colons that are to be changed are always followed by a digit and the colons that are not to be changed are not we can replace colon digit with dash digit. No packages are used.

transform(df, V1 = gsub(":(\\d)", "-\\1", V1))

This variant which uses a perl regex with a zero width lookahead (see ?regex) to check for a digit in the next position without having it being replaced by the replacement string.

transform(df, V1 = gsub(":(?=\\d)", "-", V1, perl = TRUE))

2) If a different format for df is preferable then we could use this. Insert a newline before each case to separate the records and then use read.dcf giving a character matrix. Then convert that to a data.frame and using type.convert convert the columns that can be numeric to such. Finally operate on the datetime column replacing colons with dashes. No packages are used.

df |>
  unlist() |>
  gsub(pattern = "^case:", replacement = "\ncase:") |>
  textConnection() |>
  read.dcf() |>
  as.data.frame() |>
  type.convert(as.is = TRUE) |>
  transform(datetime = gsub(":", "-", datetime))

giving:

  case myvar2 myvar3            datetime
1    1     36  First 2018-11-29 02-27-16
2    2     37 Second 2018-11-30 04-33-18
3    3     38  Third 2018-12-01 15-21-48
4    4     39 Fourth 2018-12-02 12-27-01
  • Related