I'm working on pulling data for multiple accounts and running them into a predictive model. Before I can do that I need to split the data set out by account number so that all the records that belong to each account number are in the same data set. The main data set will never have the same accounts pulled and the number of groups won't remain the same. This eliminates the ability for me to code for a set number of datasets. How am I able to handle this? I feel that I will be relying on a loop. Hoping there is an easy solution to this! I'm relatively new to R.
Dataset example:
AccountNum, col1, col2
123,x,2
123,y 3
334,t,9
334,t,8
334,t,9
84,i,10
84,i,17
Problem: Number of accounts (three in this case) will never remain the same. How do I programmatically split this data set into multiple datasets (one for each account number) and run each data set in my predictive model?
Here is where I am at thus far:
data <- [source]
my_splits <- split(data,data$AccountNum) ## splitting data set by account number
my_splits ## shows me all the groupings and data
length(my_splits) ## gives me the number of groups
[Here is where I get lost]
for (i in 1:length(my_splits)) {
Here I need to somehow reference each dataset in my_splits and run it into my model.
Then output results into .csv
}
CodePudding user response:
Here is one way how we could perform a linear model after splitting into groups:
library(dplyr)
library(broom)
library(tibble)
df %>%
mutate(AccountNum = factor(AccountNum)) %>%
group_split(AccountNum) %>%
map_dfr(.f = function(df) {
lm(col2 ~ col3, data = df) %>%
tidy() %>%
add_column(AccountNum = unique(df$AccountNum), .before = 1)
})
AccountNum term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 84 (Intercept) 27.8 NaN NaN NaN
2 84 col3 -0.318 NaN NaN NaN
3 123 (Intercept) 0.850 NaN NaN NaN
4 123 col3 0.05 NaN NaN NaN
5 334 (Intercept) 8.13 0.663 12.3 0.0518
6 334 col3 0.0186 0.0196 0.950 0.516
fake data:
df <- structure(list(AccountNum = c(123L, 123L, 334L, 334L, 334L, 84L,
84L), col1 = c("x", "y", "t", "t", "t", "i", "i"), col2 = c(2L,
3L, 9L, 8L, 9L, 10L, 17L), col3 = c(23L, 43L, 22L, 12L, 53L,
56L, 34L)), class = "data.frame", row.names = c(NA, -7L))
CodePudding user response:
Will something like this work do what you want?
for (i in 1:length(my_splits)) {
# Get the current data set
current_data <- my_splits[[i]]
# Run the data set through your predictive model
result <- model(current_data)
# Output the result to a .csv file
write.csv(result, file = paste0("result_", i, ".csv"))
}