I'm having an issue with a variable in some data that I have in R. I have a table that looks something like this,
Variable X | Variable Y |
---|---|
11 | [1] |
15 | [400] |
17 | [1,2] |
21 | [13,14] |
What I want to do, is that for every entry in 'Variable Y', if there is only one number in the square brackets, I want to get rid of the square brackets. If there are more than one number, I want to get rid of the brackets and then duplicate the entire row, just changing the value in 'Variable Y' in each duplicate. I basically want this table to look like this,
Variable X | Variable Y |
---|---|
11 | 1 |
15 | 400 |
17 | 1 |
17 | 2 |
21 | 13 |
21 | 14 |
I've been able to convert the single bracket entries to numbers using the parse number function in the readr package, but it's very slow so I'd like to improve it. I also have no idea how I can get the duplicates based on more than one entry in the square brackets. Any help would be greatly appreciated, thank you.
CodePudding user response:
If you are using tidyverse
functions, you can parse the strings as JSON arrays using fromJSON
and then use unnest
to unfold array items to multiple rows:
library(tidyverse)
library(jsonlite)
d %>%
mutate(Y_lst = map(Y, fromJSON)) %>%
unnest(Y_lst)
In your example:
> d
# A tibble: 4 x 2
X Y
<dbl> <chr>
1 11 [1]
2 15 [400]
3 17 [1,2]
4 21 [13,14]
> d %>% mutate(Y_lst = map(Y, fromJSON)) %>% unnest(Y_lst)
# A tibble: 6 x 3
X Y Y_lst
<dbl> <chr> <int>
1 11 [1] 1
2 15 [400] 400
3 17 [1,2] 1
4 17 [1,2] 2
5 21 [13,14] 13
6 21 [13,14] 14
CodePudding user response:
You can first remove the square brackets and then separate_rows
:
library(dplyr)
library(tidyr)
df %>%
mutate(Y = gsub("\\[|\\]", "", Y)) %>%
separate_rows(Y, sep = ",")
# A tibble: 6 × 2
X Y
<dbl> <chr>
1 11 1
2 15 400
3 17 1
4 17 2
5 21 13
6 21 14
CodePudding user response:
gsub
the brackets away and strsplit
at the comma, unlist
the nested column then with a type.convert
to get numbers.
dat |>
transform(Y=type.convert(strsplit(gsub('\\[|\\]', '', Y), ','), as.is=TRUE)) |>
split(x=_, dat$X) |>
lapply(\(x) data.frame(X=x$X, Y=unlist(x$Y))) |>
c(make.row.names=FALSE) |>
do.call(what='rbind')
# X Y
# 1 11 1
# 2 15 400
# 3 17 1
# 4 17 2
# 5 21 13
# 6 21 14
Note: R 4.2 used.
Alternatively more classically:
r <- transform(dat, Y=type.convert(strsplit(gsub('\\[|\\]', '', Y), ','), as.is=TRUE))
r <- by(r, r$X, function(x) data.frame(X=x$X, Y=unlist(x$Y)))
do.call(rbind, c(r, make.row.names=FALSE))
# X Y
# 1 11 1
# 2 15 400
# 3 17 1
# 4 17 2
# 5 21 13
# 6 21 14
Data:
dat <- structure(list(X = c(11L, 15L, 17L, 21L), Y = c("[1]", "[400]",
"[1,2]", "[13,14]")), class = "data.frame", row.names = c(NA,
-4L))