Home > front end >  Do mathematical operation on multiple columns with comma-separated strings of numbers of unequal len
Do mathematical operation on multiple columns with comma-separated strings of numbers of unequal len

Time:03-29

I have a large dataframe in which some columns contain long strings of comma-separated numerical data of unequal lengths, here the columns A, B, and C:

df <- data.frame(
  id = 1:3,
  A = c("200, 100, 80, 100","120, 210, 220", "170, 200"),
  B = c("0.1, 0.2, 0.3","0.2, 0.3, 1.0, 0.4, 0.9", "0.55, 0.77, 0.99, 0.35"),
  C = c("700.1, 701.0, 699.2", "702.5, 702.9", "705.4, 705.4, 706.0")
) 

I need to calculate for this numerical data in A, B, and C the percentage change. I take it that to facilitate this, I need to separate each number into its own row using separate_rows. But how can I do this step for all three columns A, B, and C in one go?

All I can do is do it column by column - first for A, then for B, and finally for C:

library(tidyverse)
df %>%
  # Step 1 - column `A`:
  separate_rows(A, sep = ",", convert = TRUE) %>%
  mutate(A_0 = lag((lead(A)-A)/A*100)) %>%
  group_by(id) %>%
  summarise(across(c(B,C), first),
    A = paste0(A, collapse = ", "),       
    A_0 = paste0(A_0, collapse = ", ")
    ) %>%
  ungroup() %>%
  # Step 2 - column `B`:
  separate_rows(B, sep = ",", convert = TRUE) %>%
  mutate(B_0 = lag((lead(B)-B)/B*100)) %>%
  group_by(id) %>%
  summarise(across(c(A,A_0,C), first),
            B = paste0(B, collapse = ", "),       
            B_0 = paste0(B_0, collapse = ", ")
            ) %>%
  ungroup() %>%
  # Step 3 - column `C`:
  separate_rows(C, sep = ",", convert = TRUE) %>%
  mutate(C_0 = lag((lead(C)-C)/C*100)) %>%
  group_by(id) %>%
  summarise(across(c(A,A_0,B,B_0), first),
            C = paste0(C, collapse = ", "),       
            C_0 = paste0(C_0, collapse = ", ")
  )
# A tibble: 3 × 7
     id A                 A_0                                 B        B_0           C      C_0      
  <int> <chr>             <chr>                               <chr>    <chr>         <chr>  <chr>    
1     1 200, 100, 80, 100 NA, -50, -20, 25                    0.1, 0.… NA, 100, 50   700.1… NA, 0.12…
2     2 120, 210, 220     20, 75, 4.76190476190476            0.2, 0.… -33.33333333… 702.5… 0.471967…
3     3 170, 200          -22.7272727272727, 17.6470588235294 0.55, 0… -38.88888888… 705.4… 0.355669…

Is there a better way?

CodePudding user response:

We may loop across the columns, split at the , followed by one or more spaces (\\s ), loop over the list with map, convert to numeric, get the lag of the difference between lead and the current value proportion, pasted (toString) return as a character vector (_chr), and if needed order the columns within select

library(dplyr)
library(purrr)
df %>% 
  mutate(across(A:C, ~ {
         map_chr(strsplit(.x, ",\\s "), ~ {
            tmp <- as.numeric(.x)
            toString(lag((lead(tmp)- tmp)/tmp *100))})
      }, .names = "{.col}_0")) %>% 
  select(id, gtools::mixedsort(names(.)[-1]))

-output

id                 A                      A_0                       B                                         B_0                   C
1  1 200, 100, 80, 100         NA, -50, -20, 25           0.1, 0.2, 0.3                                 NA, 100, 50 700.1, 701.0, 699.2
2  2     120, 210, 220 NA, 75, 4.76190476190476 0.2, 0.3, 1.0, 0.4, 0.9          NA, 50, 233.333333333333, -60, 125        702.5, 702.9
3  3          170, 200     NA, 17.6470588235294  0.55, 0.77, 0.99, 0.35 NA, 40, 28.5714285714286, -64.6464646464647 705.4, 705.4, 706.0
                                        C_0
1 NA, 0.128553063848018, -0.256776034236798
2                    NA, 0.0569395017793562
3                 NA, 0, 0.0850581230507546

CodePudding user response:

Something like this can get you started

library(tidyverse)
#> Warning: package 'tidyr' was built under R version 4.1.3
#> Warning: package 'readr' was built under R version 4.1.3
#> Warning: package 'dplyr' was built under R version 4.1.3

df_example <- data.frame(
  id = 1:3,
  A = c("200, 100, 80, 100","120, 210, 220", "170, 200"),
  B = c("0.1, 0.2, 0.3","0.2, 0.3, 1.0, 0.4, 0.9", "0.55, 0.77, 0.99, 0.35"),
  C = c("700.1, 701.0, 699.2", "702.5, 702.9", "705.4, 705.4, 706.0")
) 

df_example |>
  as_tibble() |> 
  mutate(across(-id,
                .names =  "{.col}_0",
                \(x) x |> 
                  str_split(',') |> 
                  map(as.numeric) |> 
                  map(\(x) lag((lead(x)-x)/x*100)) |> 
                  map_chr(toString))
         )
#> # A tibble: 3 x 7
#>      id A                 B                       C            A_0   B_0   C_0  
#>   <int> <chr>             <chr>                   <chr>        <chr> <chr> <chr>
#> 1     1 200, 100, 80, 100 0.1, 0.2, 0.3           700.1, 701.~ NA, ~ NA, ~ NA, ~
#> 2     2 120, 210, 220     0.2, 0.3, 1.0, 0.4, 0.9 702.5, 702.9 NA, ~ NA, ~ NA, ~
#> 3     3 170, 200          0.55, 0.77, 0.99, 0.35  705.4, 705.~ NA, ~ NA, ~ NA, ~

Created on 2022-03-28 by the reprex package (v2.0.1)

  • Related