I have a large data set structured like the demo data frame below and need to replace the second and third (but not the first) colons with dashes in the datetime rows. I have tried using various regex constructions with str_replace(), gsub(), and substr() in R, but I cannot figure out how to keep the first colon while replacing the second and third.
# Demo data
df <- data.frame(V1=c(
"case: 1",
"myvar2: 36",
"myvar3: First",
"datetime: 2018-11-29 02:27:16",
"case: 2",
"myvar2: 37",
"myvar3: Second",
"datetime: 2018-11-30 04:33:18",
"case: 3",
"myvar2: 38",
"myvar3: Third",
"datetime: 2018-12-01 15:21:48",
"case: 4",
"myvar2: 39",
"myvar3: Fourth",
"datetime: 2018-12-02 12:27:01"))
df
I'm trying to extend my rudimentary understanding of regex with R and would appreciate guidance on how to solve this problem.
CodePudding user response:
Use str_replace
after capturing groups and then replace by inserting the -
between the backreference of the captured groups
library(dplyr)
library(stringr)
df %>%
mutate(V1 = str_replace(V1,
'(datetime: \\d{4}-\\d{2}-\\d{2} \\d ):(\\d ):', '\\1-\\2-'))
-output
V1
1 case: 1
2 myvar2: 36
3 myvar3: First
4 datetime: 2018-11-29 02-27-16
5 case: 2
6 myvar2: 37
7 myvar3: Second
8 datetime: 2018-11-30 04-33-18
9 case: 3
10 myvar2: 38
11 myvar3: Third
12 datetime: 2018-12-01 15-21-48
13 case: 4
14 myvar2: 39
15 myvar3: Fourth
16 datetime: 2018-12-02 12-27-01
CodePudding user response:
1) Since the colons that are to be changed are always followed by a digit and the colons that are not to be changed are not we can replace colon digit with dash digit. No packages are used.
transform(df, V1 = gsub(":(\\d)", "-\\1", V1))
This variant which uses a perl regex with a zero width lookahead (see ?regex) to check for a digit in the next position without having it being replaced by the replacement string.
transform(df, V1 = gsub(":(?=\\d)", "-", V1, perl = TRUE))
2) If a different format for df is preferable then we could use this. Insert a newline before each case to separate the records and then use read.dcf giving a character matrix. Then convert that to a data.frame and using type.convert convert the columns that can be numeric to such. Finally operate on the datetime column replacing colons with dashes. No packages are used.
df |>
unlist() |>
gsub(pattern = "^case:", replacement = "\ncase:") |>
textConnection() |>
read.dcf() |>
as.data.frame() |>
type.convert(as.is = TRUE) |>
transform(datetime = gsub(":", "-", datetime))
giving:
case myvar2 myvar3 datetime
1 1 36 First 2018-11-29 02-27-16
2 2 37 Second 2018-11-30 04-33-18
3 3 38 Third 2018-12-01 15-21-48
4 4 39 Fourth 2018-12-02 12-27-01