The case is that I have some data separated by commas that originally are two variables. One categorical and one numerical. Here you can see a sample:
-5,50,D
-5,50,S
0,00,T
-5,50,S
-5,28,S
-5,25,C
As you can see in the previous sample if I separate the file by commas I get a dataset of 3 columns when there are only two:
-5.50,D
-5.50,S
0,00,T
-5.50,S
-5.28,S
-5.25,C
I thought that the best idea to do it would be through a regex. Any code proposal?
CodePudding user response:
Since you mentioned "columns," I assume this is a column in a dataframe? If so, you can use tidyr::extract()
:
library(tidyr)
extract(dat, x, into = c("num", "char"), "(-?\\d*,\\d*),(\\w*)")
num char
1 -5,50 D
2 -5,50 S
3 0,00 T
4 -5,50 S
5 -5,28 S
6 -5,25 C
Example data:
dat <- data.frame(
x = c("-5,50,D", "-5,50,S", "0,00,T", "-5,50,S", "-5,28,S", "-5,25,C")
)
CodePudding user response:
Here is another option. Replace the "," with "." and then separate the columns.
library(tidyverse)
dat |>
mutate(x = sub("(.*)(?<=\\d),(?=\\d)(.*?$)", "\\1.\\2", x, perl = TRUE)) |>
separate(x, into = c("num", "char"), sep = ",")
#> num char
#> 1 -5.50 D
#> 2 -5.50 S
#> 3 0.00 T
#> 4 -5.50 S
#> 5 -5.28 S
#> 6 -5.25 C
CodePudding user response:
library(tidyr)
dat %>%
# extract into two columns:
extract(x,
into = c("num", "char"),
regex = "(.*),(.*)") %>%
# change "," to ".":
mutate(num = sub(",", ".", num))
num char
1 -5.50 D
2 -5.50 S
3 0.00 T
4 -5.50 S
5 -5.28 S
6 -5.25 C
Here, the regex
used is maximally frugal in that it simply splits the strings into two capturing groups by means of the last comma (the first comma is matched by .
in the first capture group).
Data: (thanks to zephryl):
dat <- data.frame(
x = c("-5,50,D", "-5,50,S", "0,00,T", "-5,50,S", "-5,28,S", "-5,25,C")
)