Change column value from character to number or duplicate row in R-CodePudding

I'm having an issue with a variable in some data that I have in R. I have a table that looks something like this,

Variable X	Variable Y
11	`[1]`
15	`[400]`
17	`[1,2]`
21	`[13,14]`

What I want to do, is that for every entry in 'Variable Y', if there is only one number in the square brackets, I want to get rid of the square brackets. If there are more than one number, I want to get rid of the brackets and then duplicate the entire row, just changing the value in 'Variable Y' in each duplicate. I basically want this table to look like this,

Variable X	Variable Y
11	1
15	400
17	1
17	2
21	13
21	14

I've been able to convert the single bracket entries to numbers using the parse number function in the readr package, but it's very slow so I'd like to improve it. I also have no idea how I can get the duplicates based on more than one entry in the square brackets. Any help would be greatly appreciated, thank you.

CodePudding user response：

If you are using tidyverse functions, you can parse the strings as JSON arrays using fromJSON and then use unnest to unfold array items to multiple rows:

library(tidyverse)
library(jsonlite)

d %>%
    mutate(Y_lst = map(Y, fromJSON)) %>%
    unnest(Y_lst)

In your example:

> d
# A tibble: 4 x 2
      X Y      
  <dbl> <chr>  
1    11 [1]    
2    15 [400]  
3    17 [1,2]  
4    21 [13,14]

> d %>% mutate(Y_lst = map(Y, fromJSON)) %>% unnest(Y_lst)
# A tibble: 6 x 3
      X Y       Y_lst
  <dbl> <chr>   <int>
1    11 [1]         1
2    15 [400]     400
3    17 [1,2]       1
4    17 [1,2]       2
5    21 [13,14]    13
6    21 [13,14]    14

CodePudding user response：

You can first remove the square brackets and then separate_rows:

library(dplyr)
library(tidyr)
df %>%
  mutate(Y = gsub("\\[|\\]", "", Y)) %>%
  separate_rows(Y, sep = ",")
# A tibble: 6 × 2
      X Y    
  <dbl> <chr>
1    11 1    
2    15 400  
3    17 1    
4    17 2    
5    21 13   
6    21 14

CodePudding user response：

gsub the brackets away and strsplit at the comma, unlist the nested column then with a type.convert to get numbers.

dat |>
  transform(Y=type.convert(strsplit(gsub('\\[|\\]', '', Y), ','), as.is=TRUE)) |>
  split(x=_, dat$X) |>
  lapply(\(x) data.frame(X=x$X, Y=unlist(x$Y))) |>
  c(make.row.names=FALSE) |>
  do.call(what='rbind')
#    X   Y
# 1 11   1
# 2 15 400
# 3 17   1
# 4 17   2
# 5 21  13
# 6 21  14

Note: R 4.2 used.

Alternatively more classically:

r <- transform(dat, Y=type.convert(strsplit(gsub('\\[|\\]', '', Y), ','), as.is=TRUE))
r <- by(r, r$X, function(x) data.frame(X=x$X, Y=unlist(x$Y)))
do.call(rbind, c(r, make.row.names=FALSE))
#    X   Y
# 1 11   1
# 2 15 400
# 3 17   1
# 4 17   2
# 5 21  13
# 6 21  14

Data:

dat <- structure(list(X = c(11L, 15L, 17L, 21L), Y = c("[1]", "[400]", 
"[1,2]", "[13,14]")), class = "data.frame", row.names = c(NA, 
-4L))