Home > Enterprise >  Is there a better R way to expand a dataframe by a function on rows?
Is there a better R way to expand a dataframe by a function on rows?

Time:12-02

Question:

Below works, but is there a better "R way" of achieving similar result? I am essentially trying to create / distribute groups into individual line items according to a user defined function (currently just using a loop).

Example:

df1 <- data.frame(group = c("A", "B", "C"), 
              volume = c(200L, 45L, 104L)
              )
print(df1)
#>   group volume
#> 1     A    200
#> 2     B     45
#> 3     C    104

I want the volume to be broken across multiple rows according to group so that the final result is a dataframe where the new volume (vol2 in the below) would add up to original volume above. In this example, I'm applying integer math with a divisor of 52, so my final result should be:

print(df3)
#>    group vol2
#> 1      A   52
#> 2      A   52
#> 3      A   52
#> 4      A   44
#> 21     B   45
#> 31     C   52
#> 32     C   52

This works

The code below DOES get me to the desired result shown above:

div <- 52L
df1$intgr <- df1$volume %/% div
df1$remainder <- df1$volume %% div

print(df1)
#>   group volume intgr remainder
#> 1     A    200     3        44
#> 2     B     45     0        45
#> 3     C    104     2         0

df2 <- data.frame()

for (r in 1:nrow(df1)){
  if(df1[r,"intgr"] > 0){
    for (k in 1:as.integer(df1[r,"intgr"])){
      df1[r,"vol2"] <- div
      df2 <- rbind(df2, df1[r,])
    }
  }
  if(df1[r,"remainder"]>0){
    df1[r, "vol2"] <- as.integer(df1[r, "remainder"])
    df2 <- rbind(df2, df1[r,])
  }
}

print(df2)
#>    group volume intgr remainder vol2
#> 1      A    200     3        44   52
#> 2      A    200     3        44   52
#> 3      A    200     3        44   52
#> 4      A    200     3        44   44
#> 21     B     45     0        45   45
#> 31     C    104     2         0   52
#> 32     C    104     2         0   52

df3 <- subset(df2, select = c("group", "vol2"))

print(df3)
#>    group vol2
#> 1      A   52
#> 2      A   52
#> 3      A   52
#> 4      A   44
#> 21     B   45
#> 31     C   52
#> 32     C   52

Being still relatively new to R, I'm just curious if someone knows a better way / function / method that gets to the same place. Seems like there might be. I could potentially have a more complex way of breaking up the rows and I was thinking maybe there's a method that applies a UDF to the dataframe to do something like this. I was searching for "expand group/groups" but was finding mostly "expand.grid" which isn't what I'm doing here.

Thank you for any suggestions!

CodePudding user response:

A quick function to help split each number by the modulus,

fun <- function(num, mod) c(rep(mod, ceiling(num / mod)-1), (num-1) %% mod   1)
fun(200, 52)
# [1] 52 52 52 44
fun(45, 52)
# [1] 45
fun(104, 52)
# [1] 52 52

And we can apply this a number of ways:

library(dplyr)
df1 %>%
  group_by(group) %>%
  summarize(vol2 = fun(volume, 52), .groups = "drop")
# # A tibble: 7 x 2
#   group  vol2
#   <chr> <dbl>
# 1 A        52
# 2 A        52
# 3 A        52
# 4 A        44
# 5 B        45
# 6 C        52
# 7 C        52
do.call(rbind, by(df1, seq(nrow(df1)),
                  FUN = function(z) data.frame(group = z$group, vol2 = fun(z$volume, 52))))
#     group vol2
# 1.1     A   52
# 1.2     A   52
# 1.3     A   52
# 1.4     A   44
# 2       B   45
# 3.1     C   52
# 3.2     C   52
library(data.table)
setDT(df1)
df1[, .(vol2 = fun(volume, 52)), by = group]
#     group  vol2
#    <char> <num>
# 1:      A    52
# 2:      A    52
# 3:      A    52
# 4:      A    44
# 5:      B    45
# 6:      C    52
# 7:      C    52

CodePudding user response:

A tidyverse approach using purrr::pmap and tidyr::unnest_longer may look like so:

library(dplyr, w = FALSE)
library(tidyr)
library(purrr)

div <- 52

df1 |> 
  mutate(intgr = volume %/% div, remainder = volume %% div, intgr1 =  (remainder > 0)) |> 
  mutate(vol2 = purrr::pmap(list(intgr, intgr1, remainder), ~ c(rep(div, ..1), rep(..3, ..2)))) |> 
  tidyr::unnest_longer(vol2) |> 
  select(-intgr1)
#> # A tibble: 7 × 5
#>   group volume intgr remainder  vol2
#>   <chr>  <int> <dbl>     <dbl> <dbl>
#> 1 A        200     3        44    52
#> 2 A        200     3        44    52
#> 3 A        200     3        44    52
#> 4 A        200     3        44    44
#> 5 B         45     0        45    45
#> 6 C        104     2         0    52
#> 7 C        104     2         0    52

CodePudding user response:

With data.table and rep:

library(data.table)

setDT(df1)[, .(vol2 = c(rep(52, volume%/R), (volume%R)[sign(volume%R)])), group][]
#>    group vol2
#> 1:     A   52
#> 2:     A   52
#> 3:     A   52
#> 4:     A   44
#> 5:     B   45
#> 6:     C   52
#> 7:     C   52

Or

setDT(df1)[, .(vol2 = c(rep(52, volume%/R), volume%R)), group][vol2 != 0][]
#>    group vol2
#> 1:     A   52
#> 2:     A   52
#> 3:     A   52
#> 4:     A   44
#> 5:     B   45
#> 6:     C   52
#> 7:     C   52

CodePudding user response:

Vectorised and without grouping:

df1 <- data.frame(group = c("A", "B", "C"),
                  volume = c(200L, 45L, 104L))

n <- 52
idx <- df1$volume %/% n   ((sel <- df1$volume %% n) != 0)
out <- df1[rep(seq_len(nrow(df1)), idx),]
out$volume <- n
out$volume[cumsum(idx)[sel != 0]] <- sel[sel != 0]
##    group volume
##1       A     52
##1.1     A     52
##1.2     A     52
##1.3     A     44
##2       B     45
##3       C     52
##3.1     C     52
  • Related