Home > Blockchain >  Summing the values in a large df by day and type
Summing the values in a large df by day and type

Time:11-02

I have a large dataframe of this structure:

structure(list(Date = structure(c(18220, 18220, 18220, 
18955, 19110, 19110, 18514, 18514, 18892, 18647, 18647, 18528, 
18822, 18822, 18822, 18822, 18745, 18745, 18745), class = "Date"), 
    Type = c("Main", "Main", "Main", 
    "Main", "Secondary", "Tri-Annual", "Main", 
    "Main", "Tri-Annual", "Main", "Syndication", 
    "Main", "Tri-Annual", "Syndication", "Secondary", 
    "Main", "Main", "Tri-Annual", "Secondary"
    ), Value = c(4000, 2250, 2250, 1100, 1800, 12000, 
    8000, 9000, 10000, 6500, 7000, 6500, 7000, 4250, 5500, 2500, 
    6000, 6000, 4500), buckets = c("Long", "Long", "Long", 
    "Long", "Long", "Medium", "Medium", "Long", "Medium", "Long", 
    "Long", "Long", "Long", "Long", "Long", "Long", "Long", "Long", 
    "Long")), row.names = c(NA, -19L), class = c("tbl_df", "tbl", 
"data.frame"))

I'm trying to aggregate the data by Type and Date using:

df <- df %>% group_by(Date, Type) %>% dplyr::summarise(Aggregated = sum(df$Value))

But instead of giving the sum of each Type's Value on each day, it sums all the Values together (regardless of day) and posts it in the Aggregated column. Does anyone know what's wrong with this?

CodePudding user response:

Check the use of your variable call.

If you use df$Value as part of your summarise() call, the sum will be built on the vector of that object. This means you "leave" the pipe and look for an object (happens to be the same, i.e. df) and take its column Value), instead of summarising the grouped tibble/dataframe.

df %>% 
  group_by(Date, Type) %>% 
  summarise(N = n()                       # just to count we have several entries per Date and Type
         , aggregated_old = sum(df$Value) # you call the dataframe column
         , aggregated_new = sum(Value)    # do not use object df, just column name
)

# A tibble: 16 × 5
# Groups:   Date [9]
   Date       Type            N aggregated_old aggregated_new
   <date>     <chr>       <int>          <dbl>          <dbl>
 1 2019-11-20 Main            3         106150           8500
 2 2020-09-09 Main            2         106150          17000
 3 2020-09-23 Main            1         106150           6500
 4 2021-01-20 Main            1         106150           6500
 5 2021-01-20 Syndication     1         106150           7000
 6 2021-04-28 Main            1         106150           6000
 7 2021-04-28 Secondary       1         106150           4500
 8 2021-04-28 Tri-Annual      1         106150           6000
 9 2021-07-14 Main            1         106150           2500
10 2021-07-14 Secondary       1         106150           5500
11 2021-07-14 Syndication     1         106150           4250
12 2021-07-14 Tri-Annual      1         106150           7000
13 2021-09-22 Tri-Annual      1         106150          10000
14 2021-11-24 Main            1         106150           1100
15 2022-04-28 Secondary       1         106150           1800
16 2022-04-28 Tri-Annual      1         106150          12000
  • Related