Has the behavior of is.character()
changed in R 4.x ?
Here I read a simple tab-delimited text file into a data frame, and then confirm all columns are correctly marked as character data:
> raw <- read.table( creditDataPath, header = TRUE, colClasses="character", sep = "\t")
> str(raw)
'data.frame': 407 obs. of 18 variables:
$ NAME : chr "Hope Gorman" "Sarah Coriano" "Ernest Farmer" "John Coleman" ...
$ ADDRESS : chr "179 Del Mar Blvd." "640 Prospect Lane" "474 Green Street" "452 Green Street" ...
$ ZIP : chr "99975" "99904" "99900" "99924" ...
$ SSN : chr "470-17-7670" "355-91-5677" "129-21-0468" "121-57-2753" ...
$ SEX : chr "F" "F" "M" "M" ...
$ MARITALSTATUS : chr "M" "M" "M" "M" ...
$ CHILDREN : chr "2" "1" "0" "0" ...
$ OCCUPATION : chr "Professional" "Unknown" "Unknown" "Unknown" ...
$ HOMEOWNERSHIP : chr "O" "O" "O" "O" ...
$ INCOME : chr "3212" "3145" "3165" "3248" ...
$ EXPENSES : chr "1124" "1100" "1266" "974" ...
$ CHECKING : chr "N" "N" "N" "N" ...
$ SAVINGS : chr "Y" "Y" "Y" "Y" ...
$ MSTRCARD : chr "1" "1" "1" "1" ...
$ VISA : chr "5" "5" "5" "5" ...
$ AMEX : chr "0" "0" "0" "0" ...
$ MERCHANT : chr "9" "9" "9" "9" ...
$ PAYMENTHISTORY: chr "2" "0" "2" "3" ...
However, is.character(raw)
for the data frame and is.character(raw[3,1:17])
for a portion of a row in the data frame both return FALSE:
> is.character(raw)
[1] FALSE
> is.character(raw[3,1:17])
[1] FALSE
>
With R version 3.5.2 (the original development environment was 64-bit R 3.5.2 on 64-bit Win 7), simply reading the file into a data frame (WITHOUT needing to add colClasses = "character"
simply worked. The use case is that basically an R wrapper uses is.character()
to determine whether a row in the data frame contains all string values (in effect: is.character(raw[n,1:17])
); that then determines which version of a C function in a legacy DLL to call - one that expects either ALL strings, or one that expects ALL doubles).
I have been away from R since 2019, so today on a computer running Win10 Pro I installed 64-bit R 4.2.1, loaded the original workspace, and expected everything to work. And, if I manually craft a record (vector) that explicitly has every value in double quotes (e.g., "Hope Gorman", ""99975", etc.) everything does work - the R wrapper calls the correct C function.
The problem is, loading the data frame from the simple flat ASCII text file and then accessing it row by row does not work, even though after loading R seems to think the data consists of values that are quoted strings. The error is the dreaded NAs introduced by coercion error - in the wrapper R appears to NOT recognize the character strings.
What am I missing? Is this a bug in 4.x ?
EDIT:
Here are the first 4 lines of the file (first line contains column labels; 18 total tab delimited fields - some of the string fields contain spaces, e.g. Hope Gorman
is the value for the first Name field/column). This is a toy (ENTIRELY FAKED) data file for consumer credit analysis.
NAME ADDRESS ZIP SSN SEX MARITALSTATUS CHILDREN OCCUPATION HOMEOWNERSHIP INCOME EXPENSES CHECKING SAVINGS MSTRCARD VISA AMEX MERCHANT PAYMENTHISTORY
Hope Gorman 179 Del Mar Blvd. 99975 470-17-7670 F M 2 Professional O 3212 1124 N Y 1 5 0 9 2
Sarah Coriano 640 Prospect Lane 99904 355-91-5677 F M 1 Unknown O 3145 1100 N Y 1 5 0 9 0
Ernest Farmer 474 Green Street 99900 129-21-0468 M M 0 Unknown O 3165 1266 N Y 1 5 0 9 2
Also FWIW, I have checked everything on the original development machine (same file, same R workspace but R 3.5.2 running on Win7), and the R wrapper calls the correct C code as expected.
This leads me to think there is something different in R 4.2 running on Win 10 - I have noted that R now apparently uses UTF-8 characters, but since the file consists solely of US-ASCII characters and no BOM, I am hard-pressed to think character handling on Win10 is the problem, but the fact remains the original code/ R Workspace doesn't work.
Thanks, Jack
CodePudding user response:
First, thanks to all the R Gurus who responded.
Second, and somewhat embarrassing, after re-learning how to use debugging tools in R, I discovered that the reason the code "ran" on R 3.5.2 was that there was a bug in the legacy C DLL.
When I looked at the problematic R code statically, it appeared that the only way the R function could possibly call the correct DLL function was if is.character(data)
returned true.
However, when I stepped through the code in the debugger (in the original Win7/R 3.5.2 environment), I found that is.character(data)
was actually returning false - as everyone here expected (and Casper V. further demonstrated), BUT the C function in the Win7 DLL was still treating data
as an array of character strings (which it should not have done, given the logic path in the R function).
I then discovered that the legacy DLL used on Win 10, which I thought was the same as that used in the Win 7 environment, was actually a later version, in which the bug was fixed (which of course caused the R error I was seeing in Win 10).
In the end, checking the data type in R as suggested by r2evans ultimately solved the problem.
CodePudding user response:
The behaviour of is.character()
that you describe doesn't seem to have changed. What you describe as the behavior for version 3.5.2 doesn't seem to be correct, as can be seen in my attempt to replicate it below.
As mentioned by @guasi, the class of your data frame raw
is just data.frame
. The class of a column, eg raw$NAME
, can be character
.
It is possible that you had/have a custom handler for is.character()
to handle data frames on the 3.5.2 setup, and you didn't copy it to the new version:
https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/character
as.character
andis.character
are generic: you can write methods to handle specific classes of objects, seeInternalMethods
You mention that the same code on the old 3.5.2 install, still calls the correct C code. Have you checked yourself what the output of is.character(raw)
is on that machine?
Reproduction attempt
I have installed R 3.5.2 64 bit on Win 7, and couldn't reproduce what you remember from the past:
File: test.csv, created from example data, but tab delimited:
Here are the first 4 lines of the file (first line contains column labels; 18 total tab delimited fields - some of the string fields contain spaces, e.g. Hope Gorman is the value for the first Name field/column).
Environment:
the original development environment was 64-bit R 3.5.2 on 64-bit Win 7
> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.2
Command:
simply reading the file into a table (WITHOUT needing to add colClasses = "character" simply worked.
> raw <- read.table('C:\\Users\\caspar\\Desktop\\test.csv', header=T, sep='\t')
> str(raw)
'data.frame': 3 obs. of 18 variables:
$ NAME : Factor w/ 3 levels "Ernest Farmer",..: 2 3 1
$ ADDRESS : Factor w/ 3 levels "179 Del Mar Blvd.",..: 1 3 2
$ ZIP : int 99975 99904 99900
$ SSN : Factor w/ 3 levels "129-21-0468",..: 3 2 1
$ SEX : Factor w/ 2 levels "F","M": 1 1 2
$ MARITALSTATUS : Factor w/ 1 level "M": 1 1 1
$ CHILDREN : int 2 1 0
$ OCCUPATION : Factor w/ 2 levels "Professional",..: 1 2 2
$ HOMEOWNERSHIP : Factor w/ 1 level "O": 1 1 1
$ INCOME : int 3212 3145 3165
$ EXPENSES : int 1124 1100 1266
$ CHECKING : Factor w/ 1 level "N": 1 1 1
$ SAVINGS : Factor w/ 1 level "Y": 1 1 1
$ MSTRCARD : int 1 1 1
$ VISA : int 5 5 5
$ AMEX : int 0 0 0
$ MERCHANT : int 9 9 9
$ PAYMENTHISTORY: int 2 0 2
> is.character(raw)
[1] FALSE
> is.character(raw[3,1:17])
[1] FALSE
spacing between commands mine
Here I assumed you mean the same command you used before, but without the colClasses parameter. is.character()
is still FALSE
with colClasses = "character"
. It's more common to use read.csv()
, but that yielded the same results as well.