Home > Back-end >  filter() (dplyr) does not distinguish between character and number?
filter() (dplyr) does not distinguish between character and number?

Time:03-12

I am using the function filter()(in library dplyr) with this dataset. It contains a variable called "depth_m" which is numeric, I transformed it to a character class with sapply (see code below) and I didn't have problems.

Now the variable is a character however, when I filtered the dataset based on the "depth_m" variable either as =="20" (as a character) or == 20 (as a number) I obtain the same result So.. Shouldn't I get an error when filtering by number (== 20)?

Here is my code:

data <- read.table("env.txt", sep = "\t", header = TRUE)

class(data$depth_m)

Output:

[1] "integer"
# Variable transformation

data$depth_m <- sapply(data$depth_m, as.character)
class(data$depth_m)

Output:

[1] "character"

To check the data type:

class(data$depth_m)

Output:

[1] "1000" "500"  "20"   "1"    "1000" "500"  "20"   "1"    "1000" "320"  "1"    "20"   "1"   
[14] "20"   "1"    "120"  "20"   "20"   "365"  "20"   "1"    "375"  "20"   "1"    "1000" "500" 
[27] "20"   "1"    "200"  "20"   "1"    "1000" "500"  "25"   "1"    "1000" "500"  "25"   "1"   
[40] "20"   "300"  "20"   "1000" "20"  

Here I'm filtering. In this code I expected to get some subdataset because the value "20" is a character and it is correct because it exists in the original dataset.

y <- filter(data,  depth_m == "20") %>%
  select(env_sample, depth_m)
head(y)

Output:

   env_sample depth_m
1 Jan_B16_0020      20
2 Jan_B08_0020      20
3 Mar_M03_0020      20
4 Mar_M04_0020      20
5 Mar_M05_0020      20
6 Mar_M06_0020      20

Here I'm filtering again. In this code I didn't expect to get some subdataset because the value 20 is a number and it is'nt correct because it doesnt't exist in the original dataset.

y1 <- filter(data, depth_m == 20) %>%
  select(env_sample, depth_m)
head(y1)

Output:

    env_sample depth_m
1 Jan_B16_0020      20
2 Jan_B08_0020      20
3 Mar_M03_0020      20
4 Mar_M04_0020      20
5 Mar_M05_0020      20
6 Mar_M06_0020      20

Any comment will be helpful. Thank you.

CodePudding user response:

In R, the expression 20 == "20" is valid, though some (from other programming languages) might consider that a little "sloppy". When that is evaluated, it up-classes the 20 to "20" for the comparison. This silent casting can be good (useful and flexible), but it can also cause unintended, undesired, and/or surprising results. (The fact that it's silent is what I dislike about it, but convenience is convenience.)

If you want to be perfectly clear about your comparison, you can test for class as well. In your example, you show 20 which is numeric and not technically integer (which would be 20L), but you can shape the precision of the conditional to your own tastes:

filter(data, is.numeric(depth_m) & depth_m == 20)

This will still up-class the 20 to "20", but because the first portion is.numeric(.) fails, the combination of the two will fail as well. Realize that the specificity of that test is absolute: if the column is indeed character, then you will always get zero rows, which may not be what you want. If instead you want to remove non-20 rows only if they are 20 and numeric, then perhaps

filter(data, !is.numeric(depth_m) | depth_m == 20)

This goes down the dizzying logic of "if it is not numeric, then it obviously cannot truly be 20, so keep it ... but if it is numeric, make sure it is definitely 20". Of course, we run into the premise here that there is no way that one portion of the column can be numeric while another cannot, so ... perhaps that's over-indulging the specificity of filtering.

  • Related