I am working with the R programming language.
I have the following data frames (groups, my_data):
library(dplyr)
id_sample <- 1:25
id <- sample(id_sample, replace = TRUE, 1000)
var_1 = rnorm(1000,100,100)
var_2 = rnorm(1000,100,100)
var_3 = rnorm(1000,100,100)
data = data.frame(id, var_1, var_2, var_3)
my_data = data.frame(data %>% group_by(id) %>% mutate(index = row_number(id)))
my_data <- my_data[order(my_data$id),]
groups = data.frame(my_data %>% group_by(id) %>% summarise( count = n()))
groups = transform(groups, rand = ceiling(runif(count) * count))
head(groups)
id count rand
1 1 48 33
2 2 32 7
3 3 36 4
4 4 57 43
5 5 34 23
6 6 51 5
Using "groups" and "my_data", I would like to create two datasets (e.g. "data_a", "data_b") that split "my_data" based on "groups$count" and "groups$rand". For example:
- For id = 1, "data_a" would contain the first 33 rows and "data_b" would contain the remaining 48-33 rows
- For id = 2, "data_a" would contain the first 7 rows and "data_b" would contain the remaining 32-7 rows
- For id = 3, "data_a" would contain the first 4 rows and "data_b" would contain the remaining 36-4 rows
- etc.
In the end, I would just use the rbind()
command from each of these "bullets" and create the final "data_a" and "data_b" files.
Can someone please show me how to do this?
Thanks!
CodePudding user response:
You may join the groups
data with my_data
by id
, create a group column where group "A" is for first rand
rows and group b
for remaining. Finally we split
the dataset into two groups.
library(dplyr)
list_data <- my_data %>%
inner_join(groups, by = 'id') %>%
group_by(id) %>%
mutate(group = letters[as.integer(row_number() > rand) 1]) %>%
ungroup %>%
split(.$group)
To get them as 2 separate datasets
#If you want to name the two datasets differently.
#names(list_data) <- c('data_a', 'data_b')
list2env(list_data, .GlobalEnv)