Home > database >  Add a new column generated from predict() to a list of dataframes
Add a new column generated from predict() to a list of dataframes

Time:12-20

I have a logistic regression model. I would like to predict the morphology of items in multiple dataframes that have been put into a list.

I have lots of dataframes (most say working with a list of dataframes is better).

I need help with 1:

  1. Applying the predict function to a list of dataframes.
  2. Adding these predictions to their corresponding dataframe inside the list.

I am not sure whether it is better to have the 1000 dataframes separately and predict using loops etc, or to continue having them inside a list.

Prior to this code I have split my data into train and test sets. I then trained the model using:

library(nnet)
#Training the multinomial model
multinom_model <- multinom(Morphology ~ ., data=morph, maxit=500)

#Checking the model
summary(multinom_model)

This was then followed by validation etc.

My new dataset, consisting of multiple dataframes stored in a list, called rose.list was formatted by the following:

filesrose <- list.files(pattern = "_rose.csv")

#Rename all files of rose dataset 'rose.i'
for (i in seq_along(filesrose)) {
  assign(paste("rose", i, sep = "."), read.csv(filesrose[i]))
}

#Make a list of the dataframes
rose.list <- lapply(ls(pattern="rose."), function(x) get(x))

I have been using this function to predict on a singular new dataframe

# Predicting the classification for individual datasets
rose.1$Morph <- predict(multinom_model, newdata=rose.1, "class")

Which gives me the dataframe, with the new prediction column 'Morph'

But how would I do this for multiple dataframes in my rose.list? I have tried:

lapply(rose.list, predict(multinom_model, "class"))

Error in eval(predvars, data, env) : object 'Area' not found

and, but also has the error:

lapply(rose.list, predict(multinom_model, newdata = rose.list, "class"))

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows:

CodePudding user response:

You can use an anonymous function (those with function(x) or abbreviated \(x)).

library(nnet)

multinom_model <- multinom(low ~ ., birthwt)

lapply(df_list, \(x) predict(multinom_model, newdata=x, type='class'))
# $rose_1
# [1] 1 0 1 1 0 0 0 1 0 1 1 1 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 0
# [40] 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 0 0 1
# [79] 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0
# [118] 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1
# [157] 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 1
# Levels: 0 1
# 
# $rose_2
# [1] 0 1 0 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 1 0 0 1 0 1 1 0 1
# [40] 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1
# [79] 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0
# [118] 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0
# [157] 0 0 0 1 1 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0
# Levels: 0 1
# 
# $rose_3
# [1] 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 1
# [40] 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1
# [79] 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1
# [118] 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
# [157] 0 1 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0
# Levels: 0 1

update

To add the predictions as new column to each data frame in the list, modify the code like so:

res <- lapply(df_list, \(x) cbind(x, pred=predict(multinom_model, newdata=x, type='class'))

lapply(res, head)
# $rose_1
#     low age lwt race smoke ptl ht ui ftv  bwt pred
# 136   0  24 115    1     0   0  0  0   2 3090    0
# 154   0  26 133    3     1   2  0  0   0 3260    0
# 34    1  19 112    1     1   0  0  1   0 2084    1
# 166   0  16 112    2     0   0  0  0   0 3374    0
# 27    1  20 150    1     1   0  0  0   2 1928    1
# 218   0  26 160    3     0   0  0  0   0 4054    0
# 
# $rose_2
#     low age lwt race smoke ptl ht ui ftv  bwt pred
# 167   0  16 135    1     1   0  0  0   0 3374    0
# 26    1  25  92    1     1   0  0  0   0 1928    1
# 149   0  23 119    3     0   0  0  0   2 3232    0
# 98    0  22  95    3     0   0  1  0   0 2751    0
# 222   0  31 120    1     0   0  0  0   2 4167    0
# 220   0  22 129    1     0   0  0  0   0 4111    0
# 
# $rose_3
#     low age lwt race smoke ptl ht ui ftv  bwt pred
# 183   0  36 175    1     0   0  0  0   0 3600    0
# 86    0  33 155    3     0   0  0  0   3 2551    0
# 51    1  20 121    1     1   1  0  1   0 2296    1
# 17    1  23  97    3     0   0  0  1   1 1588    1
# 78    1  14 101    3     1   1  0  0   0 2466    1
# 167   0  16 135    1     1   0  0  0   0 3374    0

Data:

data('birthwt', package='MASS')
set.seed(42)
df_list <- replicate(3, birthwt[sample(nrow(birthwt), replace=TRUE), ], simplify=FALSE) |>
  setNames(paste0('rose_', 1:3))
  • Related