I have pretty large dataframe -- about 235K rows and I want to do multivariate regression:
model <- lm(var~., data=data)
but I get an error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf в 'y'
In addition: Warning message:
In storage.mode(v) <- "double" : NAs introduced by coercion
Neither na.omit
, nor other methods of getting rid of NA's didn't help.
So I've tried to find NA by myself. I've split dataframe into two parts:
Second UPD
data1 <- data[1:(dim(data)[1]/2), ]
data2 <- data[(dim(data)[1]/2):(dim(data)[1]), ]
and I again get result for both lm
and no errors from previous UPD section! NB: I've restarted RStudio.
First UPD
data1 <- data[1:(dim(data)[1]/2),]
and when I call lm
instead of previous error I get next:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
To reach this error I reduced data from 235K to 14.5K. So, what is the problem now? Some of offcasted slices don't throw any errors.
Origin version
data1 <- data[1:(dim(data)[1]/2)]
data2 <- data[(dim(data)[1]/2):(dim(data)[1])]
and call lm
for each of them:
model1 <- lm(var~., data=data1)
model2 <- lm(var~., data=data2)
and I reciece no errors! So, I suppose problem is in big size of dataframe. Is there any way to fix it?
CodePudding user response:
From the outputs of str(data)
it looks like some of your numeric predictors are coded as "characters".
Re-code them to numeric using as.numeric
and see if that fixes the issue.
If it does you might want to check why they're coded as characters. Are there rogue punctuation or spaces in your data?