I'm importing a csv with 700k rows and R shows that there are no NA
values, although the data has blank values and the code defines them as NA
. Additionally when I filter a particular row that I know for sure that doesn't have a specific value, it shows a random value of the 2 available besides NA
. Exporting the csv, after filtering, the file shows some blank values that R shows with a random value.
However, I tested the code with a small sample of the data and didn't have this problem, and the NA
s were shown normally. What could be the problem? There's a solution or an alternative? I don't understand the difference in the size of the data that could generate that error in R.
The import code
DB <- read.csv("DB.CSV", sep = "~", header=FALSE, na.strings = c(""))
The filter code
DB_MA <- DB %>% filter(ID_COM == 11001, REGISTER == "MA")
Export code
write.csv(DB_MA,"DB_MA.csv", row.names = FALSE)
CodePudding user response:
Here's three MCVE that show that using a tilde as a separator does not change the behavior of read.csv
with respect to how length-0 character values are handled by the na.strings
parameter.
DB <- read.csv(text='a~b~""~d\ne~""f~g\n', sep = "~",
header=FALSE, na.strings = c(""))
DB
#------------
V1 V2 V3 V4
1 a b <NA> d
2 e f g <NA>
#--- Now use an explicit length-o character value---
DB2 <- read.csv(text='a~b~~d\ne~f~g\n', sep = "~",
header=FALSE, na.strings = c(""))
DB2
#----------
V1 V2 V3 V4
1 a b <NA> d
2 e f g <NA>
#---Now leave out the na.strings parameter so it is the default= "NA"
DB3 <- read.csv(text='a~b~~d\ne~f~g\n', sep = "~",
header=FALSE)
DB3
#-------
V1 V2 V3 V4
1 a b d
2 e f g
nchar(DB3[1,3])
#[1] 0
I suspect you have some places where you have either an instance of " "
or some non-printing character. You should be able to use the nchar
function to examine those locations to test my hypothesis ( more of a firm prediction).