Forgive me for asking what might be a simple question, perhaps I am misunderstanding how the curly brackets {}
work specifically in R, but I am seeing some odd behavior - likely due to my own misunderstandings - and wanted to reach out to the community so I can understand my programming better. I am also not sure why I am seeing the is.na
call returning an inappropriate result.
I have several columns of data with a number of na's in one or more columns. After removing the rows containing na's in one column, I want to check the data to make sure I know how many rows are left and document that all the na's are removed. I can do this in 3 separate lines, but am trying to use the pipe operator for simplicity.
library(magrittr)
df <- data.frame(a=rnorm(10, 3, 5), #create a quick data frame without any na values
b=rnorm(10, -3, 5))
df %>% head() #works
df %>% count() #works
df %>% sum(is.na()) #doesn't work - error
#Error in is.na() : 0 arguments passed to 'is.na' which requires 1
df %>% sum(is.na(.)) #returns random number (perhaps sum of all values) instead of zero??
Perhaps a separate question, but why doesn't the first one work, and why does the second one not evaluate the 'is.na' argument? If I put curly braces around the 3rd argument, it returns the correct value:
df %>% { #works, but why is this different?
sum(is.na(.))
}
#[1] 0
Now when I try and evaluate all 3, I don't understand the behavior I see:
df %>% { #doesn't work - error
head()
count()
sum(is.na())
}
# Error in checkHT(n, dx <- dim(x)) :
# argument "x" is missing, with no default
df %>% { #returns appropriate na count of zero, but nothing else is evaluated
head(.)
count(.)
sum(is.na(.))
}
# [1] 0
df %>% { #returns first and third result, but not count(.)
print(head(.))
count(.)
sum(is.na(.))
}
# a b
# 1 0.3555877 -7.29064483
# 2 -2.6278037 4.30943634
# 3 5.6163705 -10.31436769
# 4 -2.8920773 -4.83949384
# 5 9.0941861 -0.09287319
# 6 2.6118720 -11.86665105
# [1] 0
df %>% { #returns all three like I want
print(head(.))
print(count(.))
sum(is.na(.))
}
# a b
# 1 0.3555877 -7.29064483
# 2 -2.6278037 4.30943634
# 3 5.6163705 -10.31436769
# 4 -2.8920773 -4.83949384
# 5 9.0941861 -0.09287319
# 6 2.6118720 -11.86665105
# n
# 1 10
# [1] 0
Thanks for any advice in how to interpret this behavior so I can improve my code for next time.
CodePudding user response:
The %>%
pipe passes the left hand side to the right hand side, so think of it like this:
head(df)
# is the same as
df %>% head()
However, if you pass multiple things, you may run into a problem:
head(df)
count(df)
# is not the same as
df %>% head() %>% count()
In the above, R first processes the head
then counts the values in head(df)
, so returns a value of 6.
This is why your pipes are not returning what you expect.
In addition, your df %>% sum(is.na(.))
is returning 0, because it is evaluating everything as FALSE
(since there are no NA
values), and when you sum boolean values FALSE == 0
and TRUE == 1
is.na(df)
# a b
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,] FALSE FALSE
# [4,] FALSE FALSE
# [5,] FALSE FALSE
# [6,] FALSE FALSE
# [7,] FALSE FALSE
# [8,] FALSE FALSE
# [9,] FALSE FALSE
# [10,] FALSE FALSE
# so
sum(is.na(df))
# [1] 0
You may be most efficient if you wrap what you want in a function and store everything in a list:
example_function <- function(x){
list(head(x), count(x), sum(is.na(x)))
}
example_function(df)
# [[1]]
# a b
# 1 0.1976218 3.1204090
# 2 1.8491126 -1.2009309
# 3 10.7935416 -0.9961427
# 4 3.3525420 -2.4465864
# 5 3.6464387 -5.7792057
# 6 11.5753249 5.9345657
#
# [[2]]
# n
# 1 10
#
# [[3]]
# [1] 0
CodePudding user response:
This stems from aspects of how braces behave both in R generally and in magrittr.
First, why does df %>% sum(is.na(.))
return an unexpectedly large number, while df %>% {sum(is.na(.))}
works as you expect? By default, %>%
pass the left-hand side to the first argument on the function on the right-hand side. So df %>% sum(is.na(.))
is equivalent to sum(df, is.na(df))
, which should give you an idea of why it yields a large number. However, per the magrittr docs, this "behavior can be overruled by enclosing the right-hand side in braces," so the lhs is only inserted where you explicitly add the .
placeholder. So df %>% {sum(is.na(.))}
is equivalent to sum(is.na(df))
.
Second, in
df %>% { #returns all three like I want
print(head(.))
print(count(.))
sum(is.na(.))
}
why do you have to wrap head(.)
and count(.)
in print()
, but not sum()
? This is because, per the R docs, expressions wrapped in {
return "the result of the last expression evaluated." So the result of sum(is.na(.))
is returned and auto-printed, but the results of the prior expressions aren't returned, and must be
print()
ed.