Home > Blockchain >  Is there a way to write this in a single Dplyr statement / more efficiently?
Is there a way to write this in a single Dplyr statement / more efficiently?

Time:07-17

My (simplified) dataset consists of donor occupation and contribution amounts. I'm trying to determine what the average contribution amount by occupation is (note: donor occupations are often repeated in the column, so I use that as a grouping variable). Right now, I'm using two dplyr statements -- one to get a sum of contributions amount by each occupation and another to get a count of the number of donations from that specific occupation. I am then binding the dataframes with cbind and creating a new column with mutate, where I can divide the sum by the count.

Data example:

contributor_occupation contribution_receipt_amount
1                    LISTING COORDINATOR                        5.00
2                           NOT EMPLOYED                        2.70
3                                TEACHER                        2.70
4                    ELECTRICAL DESIGNER                        2.00
5                                STUDENT                       50.00
6                      SOFTWARE ENGINEER                       10.00
7                           TRUCK DRIVER                        2.70
8                           NOT EMPLOYED                       50.00
9                             CONTRACTOR                        5.00
10                              ENGINEER                        6.00
11                                FARMER                        2.70
12                                ARTIST                       50.00
13                         CIRCUS ARTIST                      100.00
14                         CIRCUS ARTIST                       27.00
15          INFORMATION SECURITY ANALYST                        2.00
16                                LAWYER                        5.00
occupation2 <- b %>%
  select(contributor_occupation, contribution_receipt_amount) %>%
  group_by(contributor_occupation) %>%
  summarise(total = sum(contribution_receipt_amount)) %>%
  arrange(desc(contributor_occupation))


occupation3 <- b %>%
  select(contributor_occupation) %>%
  count(contributor_occupation) %>%
  group_by(contributor_occupation) %>%
  arrange(desc(contributor_occupation))

final_occ <- cbind(occupation2, occupation3[, 2]) # remove duplicate column 

occ_avg <- final_occ %>% 
  select(contributor_occupation:n) %>%
  mutate("Average Donation" = total/n) %>%
  rename("Number of Donations"= n, "Occupation" = contributor_occupation, "Total Donated" = total)

occ_avg %>%
  arrange(desc(`Average Donation`))

This gives me the result I want but seems like a very cumbersome process. It seems I get the same result by using the following code; however, I am confused as to why it works:

avg_donation_occupation <- b %>%
  select(contributor_occupation, contribution_receipt_amount) %>%
  group_by(contributor_occupation) %>%
  summarize(avg_donation_by_occupation = sum(contribution_receipt_amount)/n()) %>%
  arrange(desc(avg_donation_by_occupation))

Wouldn't dividing by n divide by the number of rows (i.e., number of occupations) as opposed to the number of people in that occupation (which is what I used the count function for previously)?

Thanks for the help clearing up any confusion!

CodePudding user response:

We may need both sum and mean along with n() which gives the number of observations in the grouped data. According to ?context

n() gives the current group size.

and `?mean

mean - Generic function for the (trimmed) arithmetic mean.

which is basically the sum of observations divided by the number of observations

library(dplyr)
out <- b %>% 
  group_by(Occupation = contributor_occupation) %>% 
  summarise(`Total Donated` = sum(contribution_receipt_amount), 
      `Number of Donations` = n(),
   `Average Donation` = mean(contribution_receipt_amount),
   #or
   #`Average Donation` = `Total Donated`/`Number of Donations`,    
      .groups = 'drop') %>% 
   
  arrange(desc(`Average Donation`))

-output

out
# A tibble: 14 × 4
   Occupation                   `Total Donated` `Number of Donations` `Average Donation`
   <chr>                                  <dbl>                 <int>              <dbl>
 1 CIRCUS ARTIST                          127                       2               63.5
 2 ARTIST                                  50                       1               50  
 3 STUDENT                                 50                       1               50  
 4 NOT EMPLOYED                            52.7                     2               26.4
 5 SOFTWARE ENGINEER                       10                       1               10  
 6 ENGINEER                                 6                       1                6  
 7 CONTRACTOR                               5                       1                5  
 8 LAWYER                                   5                       1                5  
 9 LISTING COORDINATOR                      5                       1                5  
10 FARMER                                   2.7                     1                2.7
11 TEACHER                                  2.7                     1                2.7
12 TRUCK DRIVER                             2.7                     1                2.7
13 ELECTRICAL DESIGNER                      2                       1                2  
14 INFORMATION SECURITY ANALYST             2                       1                2  

data

b <- structure(list(contributor_occupation = c("LISTING COORDINATOR", 
"NOT EMPLOYED", "TEACHER", "ELECTRICAL DESIGNER", "STUDENT", 
"SOFTWARE ENGINEER", "TRUCK DRIVER", "NOT EMPLOYED", "CONTRACTOR", 
"ENGINEER", "FARMER", "ARTIST", "CIRCUS ARTIST", "CIRCUS ARTIST", 
"INFORMATION SECURITY ANALYST", "LAWYER"), contribution_receipt_amount = c(5, 
2.7, 2.7, 2, 50, 10, 2.7, 50, 5, 6, 2.7, 50, 100, 27, 2, 5)), 
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16"))
  • Related