I am trying to split a dataset into non-overlapping chunks, perform some operations (e.g. multiply the columns by 100) on each chunk and save the results.
For example - I have this data:
my_data = data.frame(id = 1:837, a = rnorm(837,100,100), b = rnorm(837,100,100), c = rnorm(837,100,100))
I want to split this 10 chunks:
res = round(seq(1, 837, by = 837/10))[-1]
[1] 85 168 252 336 420 503 587 671 754
res <- append(res, 1, after=0)
res <- append(res, 837, after=10)
[1] 1 85 168 252 336 420 503 587 671 754 837
This means, I would try to aim for the following:
- data_1 : id = 1 to id = 84
- data_2: id = 85 to id = 168
- data_3: id = 169 to id = 251
- data_4 : id = 252 to id = 335
- etc.
I tried to write a loop to do this process:
for (i in 1:length(res))
{
index_1_i = res[i]
index_2_i = res[i 1]
data_i = my_data[index_1_i : index_2_i,]
data_i$a = data_i$a*100
data_i$b = data_i$b*100 - data_i$a*100
data_i$c = data_i$c*100 - data_i$b
# unwanted space appearing in the file when saved (e.g. "data_ 10.csv")
write.csv(data_i, paste("data_", i, ".csv"))
# I would like to delete each file from the R environment to stop it from filling up
data_i = NULL
}
This is giving me an error: Error in index_1_i:index_2_i : NA/NaN argument
- Can someone please show me how to fix this?
Thank you!
CodePudding user response:
I think I solved my own question!
library(dplyr)
for (i in 1:length(res))
{
index_1_i = res[i]
index_2_i = res[i 1]
data_i = my_data %>% filter(id > index_1_i, id < index_2_i )
data_i$a = data_i$a*100
data_i$b = data_i$b*100 - data_i$a*100
data_i$c = data_i$c*100 - data_i$b
# FIXED: unwanted space appearing in the file when saved (e.g. "data_ 10.csv")
write.csv(data_i, paste0("data_", i, ".csv"))
# I would like to delete each file from the R environment to stop it from filling up
data_i = NULL
}
For some reason, the DPLYR package handles this better?
The only question now: Is there a way to fix the "res" object so that the index numbers better won't select duplicate rows?
res
[1] 1 85 168 252 336 420 503 587 671 754 837
CodePudding user response:
With tidyverse
, we could create a grouping column and then write to csv within group_modify
or split the data using group_split
or (split
from base R
) and then use map/imap
to loop over the list and then write
library(dplyr)
library(stringr)
out <- my_data %>%
mutate(a = a * 100, b = b * 100 - a *100, c = c*100 - b ) %>%
group_by(grp = as.integer(gl(n(), round(n()/10), n()))) %>%
group_modify(~ readr::write_csv(.x, file.path(getwd(),
str_c('data_', .y$grp, ".csv")))) %>%
ungroup
The grouping column 'grp' is created with gl
. Also, the calculations showed in the post doesn't depend on grouping info, so it is done before creating the group (if the original calculation involves rows that should be limited within group, switch the mutate
statement after group_by