I have a large dataframe in which some columns contain long strings of comma-separated numerical data of unequal lengths, here the columns A
, B
, and C
:
df <- data.frame(
id = 1:3,
A = c("200, 100, 80, 100","120, 210, 220", "170, 200"),
B = c("0.1, 0.2, 0.3","0.2, 0.3, 1.0, 0.4, 0.9", "0.55, 0.77, 0.99, 0.35"),
C = c("700.1, 701.0, 699.2", "702.5, 702.9", "705.4, 705.4, 706.0")
)
I need to calculate for this numerical data in A
, B
, and C
the percentage change. I take it that to facilitate this, I need to separate each number into its own row using separate_rows
. But how can I do this step for all three columns A
, B
, and C
in one go?
All I can do is do it column by column - first for A
, then for B
, and finally for C
:
library(tidyverse)
df %>%
# Step 1 - column `A`:
separate_rows(A, sep = ",", convert = TRUE) %>%
mutate(A_0 = lag((lead(A)-A)/A*100)) %>%
group_by(id) %>%
summarise(across(c(B,C), first),
A = paste0(A, collapse = ", "),
A_0 = paste0(A_0, collapse = ", ")
) %>%
ungroup() %>%
# Step 2 - column `B`:
separate_rows(B, sep = ",", convert = TRUE) %>%
mutate(B_0 = lag((lead(B)-B)/B*100)) %>%
group_by(id) %>%
summarise(across(c(A,A_0,C), first),
B = paste0(B, collapse = ", "),
B_0 = paste0(B_0, collapse = ", ")
) %>%
ungroup() %>%
# Step 3 - column `C`:
separate_rows(C, sep = ",", convert = TRUE) %>%
mutate(C_0 = lag((lead(C)-C)/C*100)) %>%
group_by(id) %>%
summarise(across(c(A,A_0,B,B_0), first),
C = paste0(C, collapse = ", "),
C_0 = paste0(C_0, collapse = ", ")
)
# A tibble: 3 × 7
id A A_0 B B_0 C C_0
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 200, 100, 80, 100 NA, -50, -20, 25 0.1, 0.… NA, 100, 50 700.1… NA, 0.12…
2 2 120, 210, 220 20, 75, 4.76190476190476 0.2, 0.… -33.33333333… 702.5… 0.471967…
3 3 170, 200 -22.7272727272727, 17.6470588235294 0.55, 0… -38.88888888… 705.4… 0.355669…
Is there a better way?
CodePudding user response:
We may loop across
the columns, split at the ,
followed by one or more spaces (\\s
), loop over the list
with map
, convert to numeric
, get the lag
of the difference between lead
and the current value proportion, pasted
(toString
) return as a character vector (_chr
), and if needed order the columns within select
library(dplyr)
library(purrr)
df %>%
mutate(across(A:C, ~ {
map_chr(strsplit(.x, ",\\s "), ~ {
tmp <- as.numeric(.x)
toString(lag((lead(tmp)- tmp)/tmp *100))})
}, .names = "{.col}_0")) %>%
select(id, gtools::mixedsort(names(.)[-1]))
-output
id A A_0 B B_0 C
1 1 200, 100, 80, 100 NA, -50, -20, 25 0.1, 0.2, 0.3 NA, 100, 50 700.1, 701.0, 699.2
2 2 120, 210, 220 NA, 75, 4.76190476190476 0.2, 0.3, 1.0, 0.4, 0.9 NA, 50, 233.333333333333, -60, 125 702.5, 702.9
3 3 170, 200 NA, 17.6470588235294 0.55, 0.77, 0.99, 0.35 NA, 40, 28.5714285714286, -64.6464646464647 705.4, 705.4, 706.0
C_0
1 NA, 0.128553063848018, -0.256776034236798
2 NA, 0.0569395017793562
3 NA, 0, 0.0850581230507546
CodePudding user response:
Something like this can get you started
library(tidyverse)
#> Warning: package 'tidyr' was built under R version 4.1.3
#> Warning: package 'readr' was built under R version 4.1.3
#> Warning: package 'dplyr' was built under R version 4.1.3
df_example <- data.frame(
id = 1:3,
A = c("200, 100, 80, 100","120, 210, 220", "170, 200"),
B = c("0.1, 0.2, 0.3","0.2, 0.3, 1.0, 0.4, 0.9", "0.55, 0.77, 0.99, 0.35"),
C = c("700.1, 701.0, 699.2", "702.5, 702.9", "705.4, 705.4, 706.0")
)
df_example |>
as_tibble() |>
mutate(across(-id,
.names = "{.col}_0",
\(x) x |>
str_split(',') |>
map(as.numeric) |>
map(\(x) lag((lead(x)-x)/x*100)) |>
map_chr(toString))
)
#> # A tibble: 3 x 7
#> id A B C A_0 B_0 C_0
#> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 200, 100, 80, 100 0.1, 0.2, 0.3 700.1, 701.~ NA, ~ NA, ~ NA, ~
#> 2 2 120, 210, 220 0.2, 0.3, 1.0, 0.4, 0.9 702.5, 702.9 NA, ~ NA, ~ NA, ~
#> 3 3 170, 200 0.55, 0.77, 0.99, 0.35 705.4, 705.~ NA, ~ NA, ~ NA, ~
Created on 2022-03-28 by the reprex package (v2.0.1)