Home > Blockchain >  `dplyr` way to break variable into multiple columns acording to layout-dictionary (with varname and
`dplyr` way to break variable into multiple columns acording to layout-dictionary (with varname and

Time:11-02

I am looking for a dplyr way to break variable into multiple columns according to dictionary:

vardic <- data.frame(varname=c('a','b','c','d'),
                     length=c(2,6,3,1) ) %>% 
               mutate(end=cumsum(length),start=end-length 1)

d <- data.frame(orig_string=c('11333333444A',
                              '22444444111C',
                              '55666666000B'))

The desired output is:

d2 <- data.frame(a=c(11,22,55),b=c(333333,444444,666666),c=c(444,111,000),d=c('A','C','B')

This has to be done using only dplyr commands because this will be implemented via arrow on a larger than memory dataset (asked in this other question)

UPDATE (responding to comments): functions outside dplyr could be used, as long as supported by arrow. arrow's changelog describes what has been implemented so far. Hopefully this pseudocode illustrates the pipeline:

library(tidyverse)
library(arrow)
d %>% write_dataset('myfile',format='parquet')
'myfile' %>% open_dataset %>% 
    sequence_of_arrowsupported_commands_to_split_columns

Update2: added cols indicating start and end position in vardic

Update3: made the arrow pipeline, above, more reproducible. then tested @akrun's solution. But separate is not supported by arrow

CodePudding user response:

Not sure, as others, what exactly you mean by 'only' dplyr. If by that you mean only tidyverse, here's a solution that relies on dplyr, tidyr and stringr, and which I'm sure can be refined by others esp. as regards the column names conversion:

library(dplyr)
library(stringr)
library(tidyr)
d %>%
  mutate(orig_string = str_extract_all(orig_string, "(.)\\1 (?!\\1)|[A-Z]$")) %>%
  unnest_wider(orig_string) %>%
  rename(a =...1, b = ...2, c = ...3, d = ...4)
# A tibble: 3 × 4
  a     b      c     d    
  <chr> <chr>  <chr> <chr>
1 11    333333 444   A    
2 22    444444 111   C    
3 55    666666 000   B  

CodePudding user response:

This is all base R with the exception of dplyr::lag().

stops <- cumsum(vardic$length)
starts <- dplyr::lag(stops, default = 0)   1

# instantiate d2 with nrow(d) rows and 0 columns
d2 <- d
d2$orig_string <- NULL

for (i in seq(to = nrow(vardic))) {
  d2[[vardic$varname[[i]]]] <- substr(d$orig_string, starts[[i]], stops[[i]])
}

d2
   a      b   c d
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B

If you can use other tidyverse packages, here is a solution using purrr:pmap_dfc():

library(dplyr)
library(purrr)
library(stringr)

vardic <- vardic %>% 
  mutate(
    end = cumsum(length),
    start = dplyr::lag(end, default = 0)   1,
    length = NULL
  )

pmap_dfc(vardic, \(varname, start, end) tibble(
  !!varname := str_sub(d$orig_string, start = start, end = end)
))
# A tibble: 3 × 4
  a     b      c     d    
  <chr> <chr>  <chr> <chr>
1 11    333333 444   A    
2 22    444444 111   C    
3 55    666666 000   B    

CodePudding user response:

Using base R (without any exceptions)

read.fwf(textConnection(d$orig_string), widths = c(2, 6, 3, 1), 
    col.names = letters[1:4], colClasses = "character")

-output

   a      b   c d
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B

Or with separate

library(tidyr)
separate(d, orig_string, into = vardic$varname, sep = head(vardic$end, -1))

-output

   a      b   c d
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B
  • Related