Home > Enterprise >  is.character() does not correctly identify a data frame
is.character() does not correctly identify a data frame

Time:07-04

Has the behavior of is.character() changed in R 4.x ? Here I read a simple tab-delimited text file into a data frame, and then confirm all columns are correctly marked as character data:

> raw <- read.table( creditDataPath, header = TRUE, colClasses="character", sep = "\t")
> str(raw)
'data.frame':   407 obs. of  18 variables:
 $ NAME          : chr  "Hope Gorman" "Sarah Coriano" "Ernest Farmer" "John Coleman" ...
 $ ADDRESS       : chr  "179 Del Mar Blvd." "640 Prospect Lane" "474 Green Street" "452 Green Street" ...
 $ ZIP           : chr  "99975" "99904" "99900" "99924" ...
 $ SSN           : chr  "470-17-7670" "355-91-5677" "129-21-0468" "121-57-2753" ...
 $ SEX           : chr  "F" "F" "M" "M" ...
 $ MARITALSTATUS : chr  "M" "M" "M" "M" ...
 $ CHILDREN      : chr  "2" "1" "0" "0" ...
 $ OCCUPATION    : chr  "Professional" "Unknown" "Unknown" "Unknown" ...
 $ HOMEOWNERSHIP : chr  "O" "O" "O" "O" ...
 $ INCOME        : chr  "3212" "3145" "3165" "3248" ...
 $ EXPENSES      : chr  "1124" "1100" "1266" "974" ...
 $ CHECKING      : chr  "N" "N" "N" "N" ...
 $ SAVINGS       : chr  "Y" "Y" "Y" "Y" ...
 $ MSTRCARD      : chr  "1" "1" "1" "1" ...
 $ VISA          : chr  "5" "5" "5" "5" ...
 $ AMEX          : chr  "0" "0" "0" "0" ...
 $ MERCHANT      : chr  "9" "9" "9" "9" ...
 $ PAYMENTHISTORY: chr  "2" "0" "2" "3" ...

However, is.character(raw) for the data frame and is.character(raw[3,1:17]) for a portion of a row in the data frame both return FALSE:

> is.character(raw)
[1] FALSE
> is.character(raw[3,1:17])
[1] FALSE
> 

With R version 3.5.2 (the original development environment was 64-bit R 3.5.2 on 64-bit Win 7), simply reading the file into a data frame (WITHOUT needing to add colClasses = "character" simply worked. The use case is that basically an R wrapper uses is.character() to determine whether a row in the data frame contains all string values (in effect: is.character(raw[n,1:17])); that then determines which version of a C function in a legacy DLL to call - one that expects either ALL strings, or one that expects ALL doubles).

I have been away from R since 2019, so today on a computer running Win10 Pro I installed 64-bit R 4.2.1, loaded the original workspace, and expected everything to work. And, if I manually craft a record (vector) that explicitly has every value in double quotes (e.g., "Hope Gorman", ""99975", etc.) everything does work - the R wrapper calls the correct C function.

The problem is, loading the data frame from the simple flat ASCII text file and then accessing it row by row does not work, even though after loading R seems to think the data consists of values that are quoted strings. The error is the dreaded NAs introduced by coercion error - in the wrapper R appears to NOT recognize the character strings.

What am I missing? Is this a bug in 4.x ?

EDIT: Here are the first 4 lines of the file (first line contains column labels; 18 total tab delimited fields - some of the string fields contain spaces, e.g. Hope Gorman is the value for the first Name field/column). This is a toy (ENTIRELY FAKED) data file for consumer credit analysis.

NAME    ADDRESS ZIP SSN SEX MARITALSTATUS   CHILDREN    OCCUPATION  HOMEOWNERSHIP   INCOME  EXPENSES    CHECKING    SAVINGS MSTRCARD    VISA    AMEX    MERCHANT    PAYMENTHISTORY
Hope Gorman 179 Del Mar Blvd.   99975   470-17-7670 F   M   2   Professional    O   3212    1124    N   Y   1   5   0   9   2
Sarah Coriano   640 Prospect Lane   99904   355-91-5677 F   M   1   Unknown O   3145    1100    N   Y   1   5   0   9   0
Ernest Farmer   474 Green Street    99900   129-21-0468 M   M   0   Unknown O   3165    1266    N   Y   1   5   0   9   2

Also FWIW, I have checked everything on the original development machine (same file, same R workspace but R 3.5.2 running on Win7), and the R wrapper calls the correct C code as expected.

This leads me to think there is something different in R 4.2 running on Win 10 - I have noted that R now apparently uses UTF-8 characters, but since the file consists solely of US-ASCII characters and no BOM, I am hard-pressed to think character handling on Win10 is the problem, but the fact remains the original code/ R Workspace doesn't work.

Thanks, Jack

CodePudding user response:

First, thanks to all the R Gurus who responded.

Second, and somewhat embarrassing, after re-learning how to use debugging tools in R, I discovered that the reason the code "ran" on R 3.5.2 was that there was a bug in the legacy C DLL.

When I looked at the problematic R code statically, it appeared that the only way the R function could possibly call the correct DLL function was if is.character(data) returned true.

However, when I stepped through the code in the debugger (in the original Win7/R 3.5.2 environment), I found that is.character(data) was actually returning false - as everyone here expected (and Casper V. further demonstrated), BUT the C function in the Win7 DLL was still treating data as an array of character strings (which it should not have done, given the logic path in the R function).

I then discovered that the legacy DLL used on Win 10, which I thought was the same as that used in the Win 7 environment, was actually a later version, in which the bug was fixed (which of course caused the R error I was seeing in Win 10).

In the end, checking the data type in R as suggested by r2evans ultimately solved the problem.

CodePudding user response:

The behaviour of is.character() that you describe doesn't seem to have changed. What you describe as the behavior for version 3.5.2 doesn't seem to be correct, as can be seen in my attempt to replicate it below.

As mentioned by @guasi, the class of your data frame raw is just data.frame. The class of a column, eg raw$NAME, can be character.

It is possible that you had/have a custom handler for is.character() to handle data frames on the 3.5.2 setup, and you didn't copy it to the new version:

https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/character

as.character and is.character are generic: you can write methods to handle specific classes of objects, see InternalMethods

You mention that the same code on the old 3.5.2 install, still calls the correct C code. Have you checked yourself what the output of is.character(raw) is on that machine?

Reproduction attempt

I have installed R 3.5.2 64 bit on Win 7, and couldn't reproduce what you remember from the past:

File: test.csv, created from example data, but tab delimited:

Here are the first 4 lines of the file (first line contains column labels; 18 total tab delimited fields - some of the string fields contain spaces, e.g. Hope Gorman is the value for the first Name field/column).

Environment:

the original development environment was 64-bit R 3.5.2 on 64-bit Win 7

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.2

Command:

simply reading the file into a table (WITHOUT needing to add colClasses = "character" simply worked.

> raw <- read.table('C:\\Users\\caspar\\Desktop\\test.csv', header=T, sep='\t')
> str(raw)
'data.frame':   3 obs. of  18 variables:
 $ NAME          : Factor w/ 3 levels "Ernest Farmer",..: 2 3 1
 $ ADDRESS       : Factor w/ 3 levels "179 Del Mar Blvd.",..: 1 3 2
 $ ZIP           : int  99975 99904 99900
 $ SSN           : Factor w/ 3 levels "129-21-0468",..: 3 2 1
 $ SEX           : Factor w/ 2 levels "F","M": 1 1 2
 $ MARITALSTATUS : Factor w/ 1 level "M": 1 1 1
 $ CHILDREN      : int  2 1 0
 $ OCCUPATION    : Factor w/ 2 levels "Professional",..: 1 2 2
 $ HOMEOWNERSHIP : Factor w/ 1 level "O": 1 1 1
 $ INCOME        : int  3212 3145 3165
 $ EXPENSES      : int  1124 1100 1266
 $ CHECKING      : Factor w/ 1 level "N": 1 1 1
 $ SAVINGS       : Factor w/ 1 level "Y": 1 1 1
 $ MSTRCARD      : int  1 1 1
 $ VISA          : int  5 5 5
 $ AMEX          : int  0 0 0
 $ MERCHANT      : int  9 9 9
 $ PAYMENTHISTORY: int  2 0 2
> is.character(raw)
[1] FALSE
> is.character(raw[3,1:17])
[1] FALSE

spacing between commands mine

Here I assumed you mean the same command you used before, but without the colClasses parameter. is.character() is still FALSE with colClasses = "character". It's more common to use read.csv(), but that yielded the same results as well.

  • Related