Correctly Specifying Index Values in a Loop (Error in index_1_i:index_2

I am trying to split a dataset into non-overlapping chunks, perform some operations (e.g. multiply the columns by 100) on each chunk and save the results.

For example - I have this data:

my_data = data.frame(id = 1:837, a = rnorm(837,100,100), b = rnorm(837,100,100), c = rnorm(837,100,100))

I want to split this 10 chunks:

 res = round(seq(1, 837, by = 837/10))[-1]
[1]  85 168 252 336 420 503 587 671 754


res <- append(res, 1, after=0)
res <- append(res, 837, after=10)

[1]   1  85 168 252 336 420 503 587 671 754 837

This means, I would try to aim for the following:

data_1 : id = 1 to id = 84
data_2: id = 85 to id = 168
data_3: id = 169 to id = 251
data_4 : id = 252 to id = 335
etc.

I tried to write a loop to do this process:

for (i in 1:length(res))

{

index_1_i = res[i]
index_2_i = res[i 1]

data_i = my_data[index_1_i : index_2_i,]

data_i$a = data_i$a*100
data_i$b = data_i$b*100 - data_i$a*100
data_i$c = data_i$c*100 - data_i$b

# unwanted space appearing in the file when saved (e.g. "data_ 10.csv")
write.csv(data_i,  paste("data_", i, ".csv")) 

# I would like to delete each file from the R environment to stop it from filling up
data_i = NULL


}

This is giving me an error: Error in index_1_i:index_2_i : NA/NaN argument

Can someone please show me how to fix this?

Thank you!

CodePudding user response：

I think I solved my own question!

library(dplyr)

for (i in 1:length(res))
    
{
    
    index_1_i = res[i]
    index_2_i = res[i 1]
    
   data_i =  my_data %>% filter(id >  index_1_i, id < index_2_i  )
    
    data_i$a = data_i$a*100
    data_i$b = data_i$b*100 - data_i$a*100
    data_i$c = data_i$c*100 - data_i$b
    
    # FIXED: unwanted space appearing in the file when saved (e.g. "data_ 10.csv")
    write.csv(data_i,  paste0("data_", i, ".csv")) 
    
    # I would like to delete each file from the R environment to stop it from filling up
    data_i = NULL
    
    
}

For some reason, the DPLYR package handles this better?

The only question now: Is there a way to fix the "res" object so that the index numbers better won't select duplicate rows?

res
 [1]   1  85 168 252 336 420 503 587 671 754 837

CodePudding user response：

With tidyverse, we could create a grouping column and then write to csv within group_modify or split the data using group_split or (split from base R) and then use map/imap to loop over the list and then write

library(dplyr)
library(stringr)
out <- my_data %>% 
  mutate(a = a * 100, b = b * 100 - a *100, c = c*100 - b ) %>% 
  group_by(grp = as.integer(gl(n(), round(n()/10), n()))) %>%
  group_modify(~ readr::write_csv(.x, file.path(getwd(), 
        str_c('data_', .y$grp, ".csv")))) %>%
  ungroup

The grouping column 'grp' is created with gl. Also, the calculations showed in the post doesn't depend on grouping info, so it is done before creating the group (if the original calculation involves rows that should be limited within group, switch the mutate statement after group_by