Home > Software engineering >  Phantom NA's in dataframe when regression in R
Phantom NA's in dataframe when regression in R

Time:02-10

I have pretty large dataframe -- about 235K rows and I want to do multivariate regression:

model <- lm(var~., data=data)

but I get an error:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  NA/NaN/Inf в 'y'
In addition: Warning message:
In storage.mode(v) <- "double" : NAs introduced by coercion

Neither na.omit, nor other methods of getting rid of NA's didn't help.

So I've tried to find NA by myself. I've split dataframe into two parts:

Second UPD

data1 <- data[1:(dim(data)[1]/2), ]
data2 <- data[(dim(data)[1]/2):(dim(data)[1]), ]

and I again get result for both lm and no errors from previous UPD section! NB: I've restarted RStudio.

First UPD

data1 <- data[1:(dim(data)[1]/2),]

and when I call lm instead of previous error I get next:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1   isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

To reach this error I reduced data from 235K to 14.5K. So, what is the problem now? Some of offcasted slices don't throw any errors.

Origin version

data1 <- data[1:(dim(data)[1]/2)]
data2 <- data[(dim(data)[1]/2):(dim(data)[1])]

and call lm for each of them:

model1 <- lm(var~., data=data1)
model2 <- lm(var~., data=data2)

and I reciece no errors! So, I suppose problem is in big size of dataframe. Is there any way to fix it?

CodePudding user response:

From the outputs of str(data) it looks like some of your numeric predictors are coded as "characters".

Re-code them to numeric using as.numeric and see if that fixes the issue.

If it does you might want to check why they're coded as characters. Are there rogue punctuation or spaces in your data?

  • Related