How does it work to aggregate a variable in a postgres db backend table to its unique value. For example i have the following table:
library(tidyverse)
library(dbplyr)
dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
summarise(aggregatedSum = sum(b),
aggregatedUnique = unique(a))
But neither unique()
nor distinct()
are doing the job. Any ideas how to achieve my desired outcome like so when i collect() the table before summarise:
dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
collect() %>%
summarise(aggregatedSum = sum(b),
aggregatedUnique = unique(a))
# A tibble: 1 x 2
aggregatedSum aggregatedUnique
<dbl> <dbl>
1 9 2
CodePudding user response:
You can just add a group_by
to your dplyr
pipe:
> dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
group_by(a) %>%
summarise(aggregatedSum = sum(b)) %>%
rename(aggregatedUnique = a) %>%
select(aggregatedSum, aggregatedUnique)
# Source: lazy query [?? x 2]
# Database: sqlite 3.34.1 [:memory:]
aggregatedSum aggregatedUnique
<dbl> <dbl>
1 9 2
If there are multiple distinct values in column a
, this will return a row per value (with the sum of the b
values that occur alongside them).
CodePudding user response:
I might be misinterpreting, but this seems like a grouping operation, where you might want the sum of b
for each unique value of a
. If so, then group_by(a)
:
dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
group_by(a) %>%
summarise(aggregatedSum = sum(b))
# Source: lazy query [?? x 2]
# # Database: sqlite 3.33.0 [:memory:]
# a aggregatedSum
# <dbl> <dbl>
# 1 2 9
This is related to How to combine SELECT DISTINCT and SUM() in that I believe SQL does not let you sum(.)
and distinct(.)
in the same query. The unchanged query looks like:
dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
summarise(aggregatedSum = sum(b),
aggregatedUnique = distinct(a)) %>%
show_query()
# <SQL>
# SELECT SUM(`b`) AS `aggregatedSum`, distinct(`a`) AS `aggregatedUnique`
# FROM `dbplyr_014`
whereas the updated query is
dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
group_by(a) %>%
summarise(aggregatedSum = sum(b)) %>%
show_query()
# <SQL>
# SELECT `a`, SUM(`b`) AS `aggregatedSum`
# FROM `dbplyr_016`
# GROUP BY `a`
which is aligned with the linked question/answer.