I have a large dataframe with many columns and a timestamp (see picture) What I am trying to do is to aggregate the data per week for all columns. Any suggestions?
Kind regards, Daniel
CodePudding user response:
Without a minimal reproducible example it's difficult to know exactly which function to use to summarize the data (sum, mean, median, etc.).
For now we'll assume that each row represents a day or some more granular unit (since the date column is called Timestamp
and we can't see whether there are actual time values in the field).
We use a combination of tidyr
, dplyr
and lubridate
to create a summarized data frame that sums the data in the columns.
First, we generated some raw data that is in a format similar to the data in the screen shot, and read it into R.
rawData <- "Timestamp,Var.2,Amazonas,Antioquia,Arauca
2022-01-01,0,0,0,1
2022-01-02,0,0,1,3
2022-01-03,0,1,1,2
2022-01-04,0,0,1,0
2022-01-05,0,2,0,0
2022-01-06,3,0,2,2
2022-01-07,2,3,0,2
2022-01-08,1,0,0,0
2022-01-09,0,1,3,0
2022-01-10,0,0,0,0
2022-01-11,0,2,0,5
2022-01-12,0,0,3,0
2022-01-13,0,3,0,4
2022-01-14,0,0,4,0
2022-01-15,0,0,0,3
2022-01-16,0,0,0,0
2022-01-17,0,3,0,0
2022-01-18,0,0,2,3
2022-01-19,0,0,0,0
2022-01-20,0,2,0,0
2022-01-21,0,0,5,2
2022-01-22,0,0,0,0
2022-01-23,0,1,0,0
2022-01-24,0,0,3,1
2022-01-25,0,1,0,1
2022-01-26,0,0,0,1
2022-01-27,0,2,0,0
2022-01-28,0,2,0,1
2022-01-29,0,0,1,0
2022-01-30,0,0,1,0"
df <- read.csv(text = rawData,
colClasses = c("Date","numeric","numeric","numeric","numeric"))
Next, we load the required libraries. From lubridate
package we'll use the year()
and week()
functions to group the data by week of the year.
library(lubridate)
library(tidyr)
library(dplyr)
Finally, we use tidyr::pivot_longer()
to create long format tidy data where each row represents one day's observations for one column in the wide format data frame, create the Year
and Week
columns, and summarise()
the remaining columns in the data frame.
df %>% pivot_longer(-Timestamp,names_to="Area") %>%
mutate(Year = year(Timestamp),
Week = week(Timestamp)) %>%
group_by(Year,Week,Area) %>%
summarise(summedValue = sum(value)) -> summarisedData
head(summarisedData)
...and the first few rows of output:
> head(summarisedData)
# A tibble: 6 × 4
# Groups: Year, Week [2]
Year Week Area summedValue
<dbl> <dbl> <chr> <dbl>
1 2022 1 Amazonas 6
2 2022 1 Antioquia 5
3 2022 1 Arauca 10
4 2022 1 Var.2 5
5 2022 2 Amazonas 6
6 2022 2 Antioquia 10
>
CodePudding user response:
Couple of things:
[1] Never provide pictures of data; provide data.
[2] The answer depends on how you define "week". For instance, 2022-01-01 was a Saturday. The ISO 8601 standard defines a week to begin on Monday, so week 1 in 2022 started on Jan 3. Jan 1 & 2 were technically part of the last week in 2021. The week(...)
function in lubridate counts in 7 day increments from whatever is the first day of the year. So obviously you will get different answers.
Here is a very simple way to do this using data.table, borrowing the minimum reproducible example kindly provided by @LenGreski (you should do this).
library(data.table)
setDT(df)[, lapply(.SD, sum), by=.(year=year(Timestamp), week=isoweek(Timestamp))]
## year week Var.2 Amazonas Antioquia Arauca
## 1: 2022 52 0 0 1 4
## 2: 2022 1 6 7 7 6
## 3: 2022 2 0 5 7 12
## 4: 2022 3 0 6 7 5
## 5: 2022 4 0 5 5 4
The first clause, setDT(df)
, converts df
to a data.table. The remaining part applies the sum(...)
function to a subset of the columns in the data.table referenced by .SD
. The default subset is all columns except ones referenced in the by=...
clause (so, all columns except Timestamp
).