Proper syntax for 'curly' brackets after the pipe operator in R-CodePudding

Forgive me for asking what might be a simple question, perhaps I am misunderstanding how the curly brackets {} work specifically in R, but I am seeing some odd behavior - likely due to my own misunderstandings - and wanted to reach out to the community so I can understand my programming better. I am also not sure why I am seeing the is.na call returning an inappropriate result.

I have several columns of data with a number of na's in one or more columns. After removing the rows containing na's in one column, I want to check the data to make sure I know how many rows are left and document that all the na's are removed. I can do this in 3 separate lines, but am trying to use the pipe operator for simplicity.

library(magrittr)

df <- data.frame(a=rnorm(10, 3, 5),   #create a quick data frame without any na values
                 b=rnorm(10, -3, 5))
df %>% head()        #works
df %>% count()       #works
df %>% sum(is.na())  #doesn't work - error
#Error in is.na() : 0 arguments passed to 'is.na' which requires 1

df %>% sum(is.na(.)) #returns random number (perhaps sum of all values) instead of zero??

Perhaps a separate question, but why doesn't the first one work, and why does the second one not evaluate the 'is.na' argument? If I put curly braces around the 3rd argument, it returns the correct value:

df %>% {             #works, but why is this different?
  sum(is.na(.))
}
#[1] 0

Now when I try and evaluate all 3, I don't understand the behavior I see:

df %>% {             #doesn't work - error
  head()
  count()
  sum(is.na())
}
# Error in checkHT(n, dx <- dim(x)) : 
#   argument "x" is missing, with no default

df %>% {             #returns appropriate na count of zero, but nothing else is evaluated
  head(.)
  count(.)
  sum(is.na(.))
}
# [1] 0

df %>% {             #returns first and third result, but not count(.)
  print(head(.))
  count(.)
  sum(is.na(.))
}
#    a           b
# 1  0.3555877  -7.29064483
# 2 -2.6278037   4.30943634
# 3  5.6163705 -10.31436769
# 4 -2.8920773  -4.83949384
# 5  9.0941861  -0.09287319
# 6  2.6118720 -11.86665105

# [1] 0

df %>% {             #returns all three like I want
  print(head(.))
  print(count(.))
  sum(is.na(.))
}
#    a           b
# 1  0.3555877  -7.29064483
# 2 -2.6278037   4.30943634
# 3  5.6163705 -10.31436769
# 4 -2.8920773  -4.83949384
# 5  9.0941861  -0.09287319
# 6  2.6118720 -11.86665105

#   n
# 1 10

# [1] 0

Thanks for any advice in how to interpret this behavior so I can improve my code for next time.

CodePudding user response：

The %>% pipe passes the left hand side to the right hand side, so think of it like this:

head(df)
# is the same as 
df %>% head()

However, if you pass multiple things, you may run into a problem:

head(df) 
count(df) 

# is not the same as 

df %>% head() %>% count()

In the above, R first processes the head then counts the values in head(df), so returns a value of 6.

This is why your pipes are not returning what you expect.

In addition, your df %>% sum(is.na(.)) is returning 0, because it is evaluating everything as FALSE (since there are no NA values), and when you sum boolean values FALSE == 0 and TRUE == 1

is.na(df)
#          a     b
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,] FALSE FALSE
# [4,] FALSE FALSE
# [5,] FALSE FALSE
# [6,] FALSE FALSE
# [7,] FALSE FALSE
# [8,] FALSE FALSE
# [9,] FALSE FALSE
# [10,] FALSE FALSE

# so 
sum(is.na(df))
# [1] 0

You may be most efficient if you wrap what you want in a function and store everything in a list:

example_function <- function(x){
  list(head(x), count(x), sum(is.na(x)))
}

example_function(df)

# [[1]]
# a          b
# 1  0.1976218  3.1204090
# 2  1.8491126 -1.2009309
# 3 10.7935416 -0.9961427
# 4  3.3525420 -2.4465864
# 5  3.6464387 -5.7792057
# 6 11.5753249  5.9345657
# 
# [[2]]
# n
# 1 10
# 
# [[3]]
# [1] 0

CodePudding user response：

This stems from aspects of how braces behave both in R generally and in magrittr.

First, why does df %>% sum(is.na(.)) return an unexpectedly large number, while df %>% {sum(is.na(.))} works as you expect? By default, %>% pass the left-hand side to the first argument on the function on the right-hand side. So df %>% sum(is.na(.)) is equivalent to sum(df, is.na(df)), which should give you an idea of why it yields a large number. However, per the magrittr docs, this "behavior can be overruled by enclosing the right-hand side in braces," so the lhs is only inserted where you explicitly add the . placeholder. So df %>% {sum(is.na(.))} is equivalent to sum(is.na(df)).

Second, in

df %>% {             #returns all three like I want
  print(head(.))
  print(count(.))
  sum(is.na(.))
}

why do you have to wrap head(.) and count(.) in print(), but not sum()? This is because, per the R docs, expressions wrapped in { return "the result of the last expression evaluated." So the result of sum(is.na(.)) is returned and auto-printed, but the results of the prior expressions aren't returned, and must be print()ed.