Why does R introduce NA's? No commas, just plain numbers like 4438-CodePudding

I don't think this has been answered (before). Why do I get "NAs introduced by coercion", when the values are indeed plain numbers? As far as I can tell there is no commas involved (which seems like a common source for this problem).

This is my script:

# load packages

library(tidyverse)


# Get data

UI_url <- "https://raw.githubusercontent.com/OpportunityInsights/EconomicTracker/main/data/UI Claims - State - Weekly.csv"

ui_state_weekly <- read.csv(url(UI_url))

# convert chr to numeric
test <- ui_state_weekly %>% mutate(test = as.numeric(contclaims_rate_combined))
summary(test) # 102 NA's

# Find the values that cause the problem with:

which(is.na(as.numeric(ui_state_weekly$contclaims_rate_combined)) != is.na(ui_state_weekly$contclaims_rate_combined))

As you can tell, r reports numbers like:

[1] 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458
[22] 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479

They should work with as.numeric(). What am I doing wrong?

CodePudding user response：

You have some single dot values (".") which cannot be changed to numeric.

ui_state_weekly$contclaims_rate_combined[c(4438, 4439, 4440)]
#[1] "." "." "."

You can turn them to empty value and then change to numeric.

library(dplyr)

test <- ui_state_weekly %>% 
          mutate(test = as.numeric(sub('.', '', contclaims_rate_combined, 
                          fixed = TRUE)))

CodePudding user response：

In regards to this part of your code:

# Find the values that cause the problem with:

which(is.na(as.numeric(ui_state_weekly$contclaims_rate_combined)) != is.na(ui_state_weekly$contclaims_rate_combined))

[1] 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458
[22] 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479

What you are getting is not the list of values, but indexes of values that result in NA when converted to numerical. Try this:

index = which(is.na(as.numeric(ui_state_weekly$contclaims_rate_combined)) != is.na(ui_state_weekly$contclaims_rate_combined))

ui_state_weekly$contclaims_rate_combined[index]

And you will see that rows with these indexes contain dots in column contclaims_rate_combined:

[1] "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "."
 [38] "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "."
 [75] "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "."

CodePudding user response：

The other solutions fix the problem after the fact. The issue isn't really an issue, more like a data definition problem. In this data set, it seems that "." is being used for "NA". So set that when reading in the file rather than trying to fix it after the fact.

Use read.table in base and set the na.strings.

ui_state_weekly <- read.table(url(UI_url), header = TRUE, sep = ",", quote = "\"",
                              na.strings = ".")

Or this may be easier since you have to worry about fewer settings.

library(readr)

ui_state_weekly <- read_csv(url(UI_url), na = c("", "NA", "."))

When you do this, you can see that the column of interest now reads in as double and does not require coercion.

CodePudding user response：

Add a zero where needed with a fixer function fix_num. Single . with no numbers are turned to NA.

fix_num <- function(x) {
  rpl <- grep('^\\.\\d ', x)
  x[rpl] <- gsub('\\.', '0\\.', x[rpl])
  x[grep('^.$', x)] <- NA
  return(as.numeric(x))
}

library(dplyr)
test <- ui_state_weekly %>% 
  mutate(contclaims_rate_combined_fix=fix_num(contclaims_rate_combined)) 
summary(test[18:19])
# contclaims_rate_combined contclaims_rate_combined_fix
# Length:4539              Min.   : 0.307              
# Class :character         1st Qu.: 2.980              
# Mode  :character         Median : 7.130              
#                          Mean   : 8.890              
#                          3rd Qu.:12.600              
#                          Max.   :71.500              
#                          NA's   :126