Home > Software design >  How to separate numbers (including dot decimal separator) from letters in `tidyr::separate` regex?
How to separate numbers (including dot decimal separator) from letters in `tidyr::separate` regex?

Time:01-01

How to separate numbers (including dot decimal separator) from letters in tidyr::separate regex? In my current attempts, it seems the first letter of the second string is getting chopped off.

Reprex:

df <- data.frame(x = c("24.1234AAA", "14.4321BBB"))
df
#>            x
#> 1 24.1234AAA
#> 2 14.4321BBB

# This works but it is missing the first letter of the string
tidyr::separate(df, x, c("part1", "part2"), sep = "[^0-9 | {.}]", extra = "merge", convert = TRUE)
#>     part1 part2
#> 1 24.1234    AA
#> 2 14.4321    BB

# This gets the letter string completely, but not the numbers
tidyr::separate(df, x, c("part1", "part2"), sep = "([0-9.] )", extra = "merge", convert = TRUE)
#>   part1 part2
#> 1    NA   AAA
#> 2    NA   BBB

Created on 2022-12-31 with reprex v2.0.2

Note: the numbers and letters are not always the same length so we cannot use a numeric vector for the sep argument of tidyr::separate.

CodePudding user response:

Use a regex lookaround to split between the digit (\\d) and letter ([A-Z])

tidyr::separate(df, x, c("part1", "part2"), 
    sep = "(?<=\\d)(?=[A-Z])", extra = "merge", convert = TRUE)

-output

    part1 part2
1 24.1234   AAA
2 14.4321   BBB

Or use extract with capture groups

tidyr::extract(df, x, c("part1", "part2"), "^([0-9.] )(\\D )", convert = TRUE)
    part1 part2
1 24.1234   AAA
2 14.4321   BBB
  • Related