Home > Back-end >  R: Most efficient way to slice data frame records in groups and sub-groups (via directories and CSV)
R: Most efficient way to slice data frame records in groups and sub-groups (via directories and CSV)

Time:06-04

Given the following sample data (R data frame), I would like to create folders for each group, containing folders for each sub group, (finally) containing a CSV file with corresponding values val1, using R.

Grp <- c("A", "A", "A", "B", "B", "B")
Subgrp <- c("k", "l", "m", "n", "n", "n")
val1 <- c(1.1, 3.2, 4.5, 5.6, 6.7, 7.7)
df <- data.frame(Grp, Subgrp, val1)

This is what I have tried so far, creating a CSV file for each sub-group.

by(df, df$Subgrp, FUN=function(i) write.csv(i,paste0("C:/Temp/",i$Subgrp[1], ".csv")))

I need to iterate this for large data frame. What is the most efficient way of achieving this? I am open for suggestions based on looping and dpyr.

CodePudding user response:

You're close.

My interpretation:

  • "create folders for each group, containing folders for each sub group": using directories such as .../A/k and .../B/n
  • "a CSV file with corresponding values val1": remove the Grp and Subgrp columns from the frame when saving to CSV. (If not this, then replace subset(...) with just i below.)
basepath <- "c:/Temp"

# pre-create directories
for (p in unique(file.path(basepath, df$Grp, df$Subgrp))) dir.create(p, recursive = TRUE)

by(df, df$Subgrp, FUN=function(i) {
  write.csv(subset(i, select = -c(Grp, Subgrp)),
            file.path(basepath, i$Grp[1], i$Subgrp[1], "value.csv"),
            row.names = FALSE)
})
# df$Subgrp: k
# NULL
# ------------------------------------------------------------ 
# df$Subgrp: l
# NULL
# ------------------------------------------------------------ 
# df$Subgrp: m
# NULL
# ------------------------------------------------------------ 
# df$Subgrp: n
# NULL

lf <- list.files(".", pattern = "csv$", recursive = TRUE, full.names = TRUE)
lf
# [1] "./A/k/value.csv" "./A/l/value.csv" "./A/m/value.csv" "./B/n/value.csv"

read.csv(lf[1])
#   val1
# 1  1.1

Since the return value from by is inconsequential, you can either wrap it in invisible(.) or capture and ignore the output, ign <- by(..).

CodePudding user response:

I would rather use lapply so you can conveniently use the names of the subgroups.

lapply(df$Subgrp, \(x) write.csv(subset(df, Subgrp == x), file=sprintf('C:/Temp/%s.csv', x)))

CodePudding user response:

Using tidyverse:

df %>%
  unite(file, Grp, Subgrp, sep='/')%>%
  group_by(file = paste0(file, '.csv'))%>%
  summarise(write.csv(cur_data(), 
          if(dir.exists(dirname(file[1]))) file[1]
          else {dir.create(dirname(file[1]),recursive = T);file[1]},row.names = FALSE))
  •  Tags:  
  • r
  • Related