Home > front end >  R importing large .csv shows value although it should be a NA
R importing large .csv shows value although it should be a NA

Time:01-13

I'm importing a csv with 700k rows and R shows that there are no NA values, although the data has blank values and the code defines them as NA. Additionally when I filter a particular row that I know for sure that doesn't have a specific value, it shows a random value of the 2 available besides NA. Exporting the csv, after filtering, the file shows some blank values that R shows with a random value. However, I tested the code with a small sample of the data and didn't have this problem, and the NAs were shown normally. What could be the problem? There's a solution or an alternative? I don't understand the difference in the size of the data that could generate that error in R.

The import code

DB <- read.csv("DB.CSV", sep = "~", header=FALSE,  na.strings = c(""))

The filter code

DB_MA <- DB %>% filter(ID_COM == 11001, REGISTER == "MA")

Export code

write.csv(DB_MA,"DB_MA.csv", row.names = FALSE)

CodePudding user response:

Here's three MCVE that show that using a tilde as a separator does not change the behavior of read.csv with respect to how length-0 character values are handled by the na.strings parameter.

DB <- read.csv(text='a~b~""~d\ne~""f~g\n', sep = "~", 
               header=FALSE,  na.strings = c(""))
DB
#------------
  V1 V2   V3   V4
1  a  b <NA>    d
2  e  f    g <NA>
#--- Now use an explicit length-o character value---
DB2 <- read.csv(text='a~b~~d\ne~f~g\n', sep = "~", 
                header=FALSE,  na.strings = c(""))
DB2
#----------
  V1 V2   V3   V4
1  a  b <NA>    d
2  e  f    g <NA>
#---Now leave out the na.strings parameter so it is the default= "NA"
DB3 <- read.csv(text='a~b~~d\ne~f~g\n', sep = "~", 
                header=FALSE)
DB3
#-------
  V1 V2 V3 V4
1  a  b     d
2  e  f  g   

nchar(DB3[1,3])
#[1] 0

I suspect you have some places where you have either an instance of " " or some non-printing character. You should be able to use the nchar function to examine those locations to test my hypothesis ( more of a firm prediction).

  •  Tags:  
  • Related