Home > Software engineering >  R: Splitting a File Based on Conditions in the File itself
R: Splitting a File Based on Conditions in the File itself

Time:06-04

I am working with the R programming language.

I have the following data frames (groups, my_data):

library(dplyr)

id_sample <- 1:25
id <- sample(id_sample, replace = TRUE, 1000)
var_1 = rnorm(1000,100,100)
var_2 = rnorm(1000,100,100)
var_3 = rnorm(1000,100,100)

data = data.frame(id, var_1, var_2, var_3)


my_data =  data.frame(data %>% group_by(id) %>% mutate(index = row_number(id)))
my_data <- my_data[order(my_data$id),]

groups = data.frame(my_data %>% group_by(id) %>% summarise( count = n()))
groups = transform(groups, rand = ceiling(runif(count) * count))

head(groups)
  id count rand
1  1    48   33
2  2    32    7
3  3    36    4
4  4    57   43
5  5    34   23
6  6    51    5

Using "groups" and "my_data", I would like to create two datasets (e.g. "data_a", "data_b") that split "my_data" based on "groups$count" and "groups$rand". For example:

  • For id = 1, "data_a" would contain the first 33 rows and "data_b" would contain the remaining 48-33 rows
  • For id = 2, "data_a" would contain the first 7 rows and "data_b" would contain the remaining 32-7 rows
  • For id = 3, "data_a" would contain the first 4 rows and "data_b" would contain the remaining 36-4 rows
  • etc.

In the end, I would just use the rbind() command from each of these "bullets" and create the final "data_a" and "data_b" files.

Can someone please show me how to do this?

Thanks!

CodePudding user response:

You may join the groups data with my_data by id, create a group column where group "A" is for first rand rows and group b for remaining. Finally we split the dataset into two groups.

library(dplyr)

list_data <- my_data %>%
  inner_join(groups, by = 'id') %>%
  group_by(id) %>%
  mutate(group = letters[as.integer(row_number() > rand)   1]) %>%
  ungroup %>%
  split(.$group)

To get them as 2 separate datasets

#If you want to name the two datasets differently. 
#names(list_data) <- c('data_a', 'data_b')
list2env(list_data, .GlobalEnv)
  • Related