Home > Back-end >  What is the best way to handle potentially missing columns when summarizing?
What is the best way to handle potentially missing columns when summarizing?

Time:10-06

A financial statement is a good illustration of this issue. Here is an example dataframe:

df <- data.frame(   date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 10),
                    category = sample(c('a','b', 'c'), 10, replace=TRUE),
                    direction = sample(c('credit', 'debit'), 10, replace=TRUE),
                    value = sample(0:25, 10, replace = TRUE) )

I want to produce a summary table with incoming, outgoing and total columns for each category.

df %>% 
    pivot_wider(names_from = direction, values_from = value) %>% 
    group_by(category) %>% 
    summarize(incoming = sum(credit, na.rm=TRUE), outgoing=sum(debit,na.rm=TRUE) ) %>% 
    mutate(total= incoming-outgoing)

In most cases this works perfectly with the example dataframe above.

But there are cases where df$direction could contain a single value e.g., credit, resulting in an error.

Error: Problem with `summarise()` column `outgoing`.
object 'debit' not found

Given that I have no control over the dataframe, what is the best way to handle this?

I've been playing around with a conditional statement in the summarize method to check that the column exists, but have not managed to get this working.

...
summarize( outgoing = case_when(
    "debit" %in% colnames(.) ~ sum(debit,na.rm=TRUE), 
    TRUE ~ 0 ) )
...

Have I made a syntax error, or am I going in completely the wrong direction with this?

CodePudding user response:

The issue happens only when one of the elements is presents i.e. 'credit' and no 'debit' or viceversa. Then, the pivot_wider doesn't create the column missing. Instead of pivoting and then summarising, do this directly with summarise and == i.e. if the 'debit' is absent, sum will take care of it by returning 0

library(dplyr)
df %>%  
   slice(-c(9:10)) %>% # just removed the 'debit' rows completely
   group_by(category) %>% 
   summarise(total  = sum(value[direction == 'credit']) - 
          sum(value[direction == "debit"])) 

-output

# A tibble: 3 × 2
  category total
  <chr>    <int>
1 a           15
2 b           30
3 c           63

With pivot_wider, it is not the case

df %>% 
      slice(-c(9:10)) %>%
     pivot_wider(names_from = direction, values_from = value) 
# A tibble: 8 × 3
  date       category credit
  <date>     <chr>     <int>
1 2020-07-25 c            19
2 2020-05-09 b            15
3 2020-08-27 a            15
4 2020-03-27 b            15
5 2020-04-06 c             6
6 2020-07-06 c            11
7 2020-09-22 c            25
8 2020-10-06 c             2

it creates only the 'credit' column, thus when we call a column 'debit' that is not created, it throws error

 df %>% 
      slice(-c(9:10)) %>%
      pivot_wider(names_from = direction, values_from = value)  %>%
      group_by(category) %>% 
      summarize(incoming = sum(credit, na.rm=TRUE), 
           outgoing=sum(debit,na.rm=TRUE) )

Error: Problem with summarise() column outgoing. ℹ outgoing = sum(debit, na.rm = TRUE). ✖ object 'debit' not found ℹ The error occurred in group 1: category = "a". Run rlang::last_error() to see where the error occurred.

In this case, we can do a complete to create some rows with debit as well which will have NA for other columns

library(tidyr)
df %>% 
   slice(-c(9:10)) %>%
   complete(category, direction = c("credit", "debit")) %>% 
   pivot_wider(names_from = direction, values_from = value) %>% 
   group_by(category) %>% 
   summarize(incoming = sum(credit, na.rm=TRUE), 
        outgoing=sum(debit,na.rm=TRUE) ) %>% 
   mutate(total= incoming-outgoing)
# A tibble: 3 × 4
  category incoming outgoing total
  <chr>       <int>    <int> <int>
1 a              15        0    15
2 b              30        0    30
3 c              63        0    63
  • Related