I am looking for a dplyr
way to break variable into multiple columns according to dictionary:
vardic <- data.frame(varname=c('a','b','c','d'),
length=c(2,6,3,1) ) %>%
mutate(end=cumsum(length),start=end-length 1)
d <- data.frame(orig_string=c('11333333444A',
'22444444111C',
'55666666000B'))
The desired output is:
d2 <- data.frame(a=c(11,22,55),b=c(333333,444444,666666),c=c(444,111,000),d=c('A','C','B')
This has to be done using only dplyr commands because this will be implemented via arrow on a larger than memory dataset (asked in this other question)
UPDATE (responding to comments): functions outside dplyr
could be used, as long as supported by arrow. arrow's changelog describes what has been implemented so far. Hopefully this pseudocode illustrates the pipeline:
library(tidyverse)
library(arrow)
d %>% write_dataset('myfile',format='parquet')
'myfile' %>% open_dataset %>%
sequence_of_arrowsupported_commands_to_split_columns
Update2: added cols indicating start
and end
position in vardic
Update3: made the arrow pipeline, above, more reproducible. then tested @akrun's solution. But separate
is not supported by arrow
CodePudding user response:
Not sure, as others, what exactly you mean by 'only' dplyr
. If by that you mean only tidyverse
, here's a solution that relies on dplyr
, tidyr
and stringr
, and which I'm sure can be refined by others esp. as regards the column names conversion:
library(dplyr)
library(stringr)
library(tidyr)
d %>%
mutate(orig_string = str_extract_all(orig_string, "(.)\\1 (?!\\1)|[A-Z]$")) %>%
unnest_wider(orig_string) %>%
rename(a =...1, b = ...2, c = ...3, d = ...4)
# A tibble: 3 × 4
a b c d
<chr> <chr> <chr> <chr>
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B
CodePudding user response:
This is all base R with the exception of dplyr::lag()
.
stops <- cumsum(vardic$length)
starts <- dplyr::lag(stops, default = 0) 1
# instantiate d2 with nrow(d) rows and 0 columns
d2 <- d
d2$orig_string <- NULL
for (i in seq(to = nrow(vardic))) {
d2[[vardic$varname[[i]]]] <- substr(d$orig_string, starts[[i]], stops[[i]])
}
d2
a b c d
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B
If you can use other tidyverse packages, here is a solution using purrr:pmap_dfc()
:
library(dplyr)
library(purrr)
library(stringr)
vardic <- vardic %>%
mutate(
end = cumsum(length),
start = dplyr::lag(end, default = 0) 1,
length = NULL
)
pmap_dfc(vardic, \(varname, start, end) tibble(
!!varname := str_sub(d$orig_string, start = start, end = end)
))
# A tibble: 3 × 4
a b c d
<chr> <chr> <chr> <chr>
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B
CodePudding user response:
Using base R
(without any exceptions)
read.fwf(textConnection(d$orig_string), widths = c(2, 6, 3, 1),
col.names = letters[1:4], colClasses = "character")
-output
a b c d
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B
Or with separate
library(tidyr)
separate(d, orig_string, into = vardic$varname, sep = head(vardic$end, -1))
-output
a b c d
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B