Home > Software design >  How to split dataset into N groups and place each dataset into predictive model
How to split dataset into N groups and place each dataset into predictive model

Time:01-10

I'm working on pulling data for multiple accounts and running them into a predictive model. Before I can do that I need to split the data set out by account number so that all the records that belong to each account number are in the same data set. The main data set will never have the same accounts pulled and the number of groups won't remain the same. This eliminates the ability for me to code for a set number of datasets. How am I able to handle this? I feel that I will be relying on a loop. Hoping there is an easy solution to this! I'm relatively new to R.

Dataset example:

AccountNum, col1, col2
123,x,2
123,y 3
334,t,9
334,t,8
334,t,9
84,i,10
84,i,17

Problem: Number of accounts (three in this case) will never remain the same. How do I programmatically split this data set into multiple datasets (one for each account number) and run each data set in my predictive model?

Here is where I am at thus far:

data <- [source]

my_splits <- split(data,data$AccountNum) ## splitting data set by account number

my_splits ## shows me all the groupings and data

length(my_splits) ## gives me the number of groups

[Here is where I get lost]

for (i in 1:length(my_splits)) {

Here I need to somehow reference each dataset in my_splits and run it into my model.

Then output results into .csv

}

CodePudding user response:

Here is one way how we could perform a linear model after splitting into groups:

library(dplyr)
library(broom)
library(tibble)

df %>% 
  mutate(AccountNum = factor(AccountNum)) %>% 
  group_split(AccountNum) %>% 
  map_dfr(.f = function(df) {
    lm(col2 ~ col3, data = df) %>% 
      tidy() %>% 
      add_column(AccountNum = unique(df$AccountNum), .before = 1)
  })

  AccountNum term        estimate std.error statistic  p.value
  <fct>      <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 84         (Intercept)  27.8     NaN        NaN     NaN     
2 84         col3         -0.318   NaN        NaN     NaN     
3 123        (Intercept)   0.850   NaN        NaN     NaN     
4 123        col3          0.05    NaN        NaN     NaN     
5 334        (Intercept)   8.13      0.663     12.3     0.0518
6 334        col3          0.0186    0.0196     0.950   0.516 

fake data:

df <- structure(list(AccountNum = c(123L, 123L, 334L, 334L, 334L, 84L, 
84L), col1 = c("x", "y", "t", "t", "t", "i", "i"), col2 = c(2L, 
3L, 9L, 8L, 9L, 10L, 17L), col3 = c(23L, 43L, 22L, 12L, 53L, 
56L, 34L)), class = "data.frame", row.names = c(NA, -7L))

CodePudding user response:

Will something like this work do what you want?

for (i in 1:length(my_splits)) {
  # Get the current data set
  current_data <- my_splits[[i]]
  
  # Run the data set through your predictive model
  result <- model(current_data)
  
  # Output the result to a .csv file
  write.csv(result, file = paste0("result_", i, ".csv"))
}
  •  Tags:  
  • r
  • Related