How to separate numbers (including dot decimal separator) from letters in tidyr::separate
regex? In my current attempts, it seems the first letter of the second string is getting chopped off.
Reprex:
df <- data.frame(x = c("24.1234AAA", "14.4321BBB"))
df
#> x
#> 1 24.1234AAA
#> 2 14.4321BBB
# This works but it is missing the first letter of the string
tidyr::separate(df, x, c("part1", "part2"), sep = "[^0-9 | {.}]", extra = "merge", convert = TRUE)
#> part1 part2
#> 1 24.1234 AA
#> 2 14.4321 BB
# This gets the letter string completely, but not the numbers
tidyr::separate(df, x, c("part1", "part2"), sep = "([0-9.] )", extra = "merge", convert = TRUE)
#> part1 part2
#> 1 NA AAA
#> 2 NA BBB
Created on 2022-12-31 with reprex v2.0.2
Note: the numbers and letters are not always the same length so we cannot use a numeric vector for the sep
argument of tidyr::separate
.
CodePudding user response:
Use a regex lookaround to split between the digit (\\d
) and letter ([A-Z]
)
tidyr::separate(df, x, c("part1", "part2"),
sep = "(?<=\\d)(?=[A-Z])", extra = "merge", convert = TRUE)
-output
part1 part2
1 24.1234 AAA
2 14.4321 BBB
Or use extract
with capture groups
tidyr::extract(df, x, c("part1", "part2"), "^([0-9.] )(\\D )", convert = TRUE)
part1 part2
1 24.1234 AAA
2 14.4321 BBB