Home > Blockchain >  Aggregate dataframe with many columns per week
Aggregate dataframe with many columns per week

Time:04-05

I have a large dataframe with many columns and a timestamp (see picture) What I am trying to do is to aggregate the data per week for all columns. Any suggestions?

Kind regards, Daniel

CodePudding user response:

Without a minimal reproducible example it's difficult to know exactly which function to use to summarize the data (sum, mean, median, etc.).

For now we'll assume that each row represents a day or some more granular unit (since the date column is called Timestamp and we can't see whether there are actual time values in the field).

We use a combination of tidyr, dplyr and lubridate to create a summarized data frame that sums the data in the columns.

First, we generated some raw data that is in a format similar to the data in the screen shot, and read it into R.

rawData <- "Timestamp,Var.2,Amazonas,Antioquia,Arauca
2022-01-01,0,0,0,1
2022-01-02,0,0,1,3
2022-01-03,0,1,1,2
2022-01-04,0,0,1,0
2022-01-05,0,2,0,0
2022-01-06,3,0,2,2
2022-01-07,2,3,0,2
2022-01-08,1,0,0,0
2022-01-09,0,1,3,0
2022-01-10,0,0,0,0
2022-01-11,0,2,0,5
2022-01-12,0,0,3,0
2022-01-13,0,3,0,4
2022-01-14,0,0,4,0
2022-01-15,0,0,0,3
2022-01-16,0,0,0,0
2022-01-17,0,3,0,0
2022-01-18,0,0,2,3
2022-01-19,0,0,0,0
2022-01-20,0,2,0,0
2022-01-21,0,0,5,2
2022-01-22,0,0,0,0
2022-01-23,0,1,0,0
2022-01-24,0,0,3,1
2022-01-25,0,1,0,1
2022-01-26,0,0,0,1
2022-01-27,0,2,0,0
2022-01-28,0,2,0,1
2022-01-29,0,0,1,0
2022-01-30,0,0,1,0"

df <- read.csv(text = rawData,
               colClasses = c("Date","numeric","numeric","numeric","numeric"))

Next, we load the required libraries. From lubridate package we'll use the year() and week() functions to group the data by week of the year.

library(lubridate)
library(tidyr)
library(dplyr)

Finally, we use tidyr::pivot_longer() to create long format tidy data where each row represents one day's observations for one column in the wide format data frame, create the Year and Week columns, and summarise() the remaining columns in the data frame.

df %>% pivot_longer(-Timestamp,names_to="Area") %>%
     mutate(Year = year(Timestamp),
            Week = week(Timestamp)) %>%
     group_by(Year,Week,Area) %>%
     summarise(summedValue = sum(value)) -> summarisedData

head(summarisedData)

...and the first few rows of output:

> head(summarisedData)
# A tibble: 6 × 4
# Groups:   Year, Week [2]
   Year  Week Area      summedValue
  <dbl> <dbl> <chr>           <dbl>
1  2022     1 Amazonas            6
2  2022     1 Antioquia           5
3  2022     1 Arauca             10
4  2022     1 Var.2               5
5  2022     2 Amazonas            6
6  2022     2 Antioquia          10
> 

CodePudding user response:

Couple of things:

[1] Never provide pictures of data; provide data.

[2] The answer depends on how you define "week". For instance, 2022-01-01 was a Saturday. The ISO 8601 standard defines a week to begin on Monday, so week 1 in 2022 started on Jan 3. Jan 1 & 2 were technically part of the last week in 2021. The week(...) function in lubridate counts in 7 day increments from whatever is the first day of the year. So obviously you will get different answers.

Here is a very simple way to do this using data.table, borrowing the minimum reproducible example kindly provided by @LenGreski (you should do this).

library(data.table)
setDT(df)[, lapply(.SD, sum), by=.(year=year(Timestamp), week=isoweek(Timestamp))]

##    year week Var.2 Amazonas Antioquia Arauca
## 1: 2022   52     0        0         1      4
## 2: 2022    1     6        7         7      6
## 3: 2022    2     0        5         7     12
## 4: 2022    3     0        6         7      5
## 5: 2022    4     0        5         5      4

The first clause, setDT(df), converts df to a data.table. The remaining part applies the sum(...) function to a subset of the columns in the data.table referenced by .SD. The default subset is all columns except ones referenced in the by=... clause (so, all columns except Timestamp).

  • Related