What is the best way to handle potentially missing columns when summarizing?-CodePudding

A financial statement is a good illustration of this issue. Here is an example dataframe:

df <- data.frame(   date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 10),
                    category = sample(c('a','b', 'c'), 10, replace=TRUE),
                    direction = sample(c('credit', 'debit'), 10, replace=TRUE),
                    value = sample(0:25, 10, replace = TRUE) )

I want to produce a summary table with incoming, outgoing and total columns for each category.

df %>% 
    pivot_wider(names_from = direction, values_from = value) %>% 
    group_by(category) %>% 
    summarize(incoming = sum(credit, na.rm=TRUE), outgoing=sum(debit,na.rm=TRUE) ) %>% 
    mutate(total= incoming-outgoing)

In most cases this works perfectly with the example dataframe above.

But there are cases where df$direction could contain a single value e.g., credit, resulting in an error.

Error: Problem with `summarise()` column `outgoing`.
object 'debit' not found

Given that I have no control over the dataframe, what is the best way to handle this?

I've been playing around with a conditional statement in the summarize method to check that the column exists, but have not managed to get this working.

...
summarize( outgoing = case_when(
    "debit" %in% colnames(.) ~ sum(debit,na.rm=TRUE), 
    TRUE ~ 0 ) )
...

Have I made a syntax error, or am I going in completely the wrong direction with this?

CodePudding user response：

The issue happens only when one of the elements is presents i.e. 'credit' and no 'debit' or viceversa. Then, the pivot_wider doesn't create the column missing. Instead of pivoting and then summarising, do this directly with summarise and == i.e. if the 'debit' is absent, sum will take care of it by returning 0

library(dplyr)
df %>%  
   slice(-c(9:10)) %>% # just removed the 'debit' rows completely
   group_by(category) %>% 
   summarise(total  = sum(value[direction == 'credit']) - 
          sum(value[direction == "debit"]))

-output

# A tibble: 3 × 2
  category total
  <chr>    <int>
1 a           15
2 b           30
3 c           63

With pivot_wider, it is not the case

df %>% 
      slice(-c(9:10)) %>%
     pivot_wider(names_from = direction, values_from = value) 
# A tibble: 8 × 3
  date       category credit
  <date>     <chr>     <int>
1 2020-07-25 c            19
2 2020-05-09 b            15
3 2020-08-27 a            15
4 2020-03-27 b            15
5 2020-04-06 c             6
6 2020-07-06 c            11
7 2020-09-22 c            25
8 2020-10-06 c             2

it creates only the 'credit' column, thus when we call a column 'debit' that is not created, it throws error

 df %>% 
      slice(-c(9:10)) %>%
      pivot_wider(names_from = direction, values_from = value)  %>%
      group_by(category) %>% 
      summarize(incoming = sum(credit, na.rm=TRUE), 
           outgoing=sum(debit,na.rm=TRUE) )

Error: Problem with summarise() column outgoing. ℹ outgoing = sum(debit, na.rm = TRUE). ✖ object 'debit' not found ℹ The error occurred in group 1: category = "a". Run rlang::last_error() to see where the error occurred.

In this case, we can do a complete to create some rows with debit as well which will have NA for other columns

library(tidyr)
df %>% 
   slice(-c(9:10)) %>%
   complete(category, direction = c("credit", "debit")) %>% 
   pivot_wider(names_from = direction, values_from = value) %>% 
   group_by(category) %>% 
   summarize(incoming = sum(credit, na.rm=TRUE), 
        outgoing=sum(debit,na.rm=TRUE) ) %>% 
   mutate(total= incoming-outgoing)
# A tibble: 3 × 4
  category incoming outgoing total
  <chr>       <int>    <int> <int>
1 a              15        0    15
2 b              30        0    30
3 c              63        0    63