Home > Enterprise >  Similliar aggregate function to unique/distinct in R Postgres Backend
Similliar aggregate function to unique/distinct in R Postgres Backend

Time:10-15

How does it work to aggregate a variable in a postgres db backend table to its unique value. For example i have the following table:

library(tidyverse)
library(dbplyr)

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
summarise(aggregatedSum = sum(b), 
          aggregatedUnique = unique(a))

But neither unique() nor distinct() are doing the job. Any ideas how to achieve my desired outcome like so when i collect() the table before summarise:

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
collect() %>%
summarise(aggregatedSum = sum(b), 
          aggregatedUnique = unique(a))

# A tibble: 1 x 2
  aggregatedSum aggregatedUnique
          <dbl>            <dbl>
1             9                2

CodePudding user response:

You can just add a group_by to your dplyr pipe:

> dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
    group_by(a) %>% 
    summarise(aggregatedSum = sum(b)) %>% 
    rename(aggregatedUnique = a) %>% 
    select(aggregatedSum, aggregatedUnique)

# Source:   lazy query [?? x 2]
# Database: sqlite 3.34.1 [:memory:]
  aggregatedSum aggregatedUnique
          <dbl>            <dbl>
1             9                2

If there are multiple distinct values in column a, this will return a row per value (with the sum of the b values that occur alongside them).

CodePudding user response:

I might be misinterpreting, but this seems like a grouping operation, where you might want the sum of b for each unique value of a. If so, then group_by(a):

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
  group_by(a) %>%
  summarise(aggregatedSum = sum(b))
# Source:   lazy query [?? x 2]
# # Database: sqlite 3.33.0 [:memory:]
#       a aggregatedSum
#   <dbl>         <dbl>
# 1     2             9

This is related to How to combine SELECT DISTINCT and SUM() in that I believe SQL does not let you sum(.) and distinct(.) in the same query. The unchanged query looks like:

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
  summarise(aggregatedSum = sum(b), 
          aggregatedUnique = distinct(a)) %>%
  show_query()
# <SQL>
# SELECT SUM(`b`) AS `aggregatedSum`, distinct(`a`) AS `aggregatedUnique`
# FROM `dbplyr_014`

whereas the updated query is

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
  group_by(a) %>%
  summarise(aggregatedSum = sum(b)) %>%
  show_query()
# <SQL>
# SELECT `a`, SUM(`b`) AS `aggregatedSum`
# FROM `dbplyr_016`
# GROUP BY `a`

which is aligned with the linked question/answer.

  • Related