I have a database with a 142 columns with one called "Date" (of class POSIXct) that I'd like to make a new column from that groups consecutive dates together. Dates with more than 2 days separating one another are categorized into separate groups.
I'd also like to name the level of the group with the name of month the consecutive dates start in (For example: Jan. 3rd, 2018 -> Jan. 12th 2018 = group level called "January sampling event"; Feb 27th, 2018 -> March 1st, 2018 = group level called "February sampling event"; etc...).
I've seen very similar questions like Group consecutive dates in R and R: group dates that are next to each other, but just can't get it to work for my data.
EDIT: My data example (Last row shows dates separated by over a year are grouped together, for some reason)
> dput(df)
structure(list(Date = structure(c(17534, 17535, 17536, 17537,
18279, 18280, 18281, 18282, 17932), class = "Date"), group = c(1,
1, 1, 1, 2, 2, 2, 2, 2)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
My attempt:
df$group <- 1 c(0, cumsum(ifelse(diff(df$Date) > 1, 1, 0)))
CodePudding user response:
Remove time from date time
It's hard to tell exactly what the problem is without seeing your data (or similar example data), but my guess is that the date time format (the 00:00:00 part) is messing up as.Date
One solution would be to extract just the date part and then try again with just the date part:
# here are your date times
date_time <- "2018-01-03 00:00:00"
# this looks for 4 digits between 0 and 9, followed by a dash, followed by 2 digits between 0 and 9,followed by a dash, followed by 2 digits between 0 and 9
date_pattern <- " ?([0-9]{4}-[0-9]{2}-[0-9]{2}) ?"
#need this library
library(stringr)
library(magrittr) #for pipes
#this pulls out text matching the pattern we specified in date pattern
date_new <- str_extract(date_time, date_pattern) %>%
str_squish() # this removes white space
# this is the new date without the time
date_new
# then we convert to as date
date_new <- as.Date(date_new)
See if converting your date column to just dates and then rerunning your grouping works.
If you have dates in different formats and need to adapt the regular expression, here's something about regular expressions: https://stackoverflow.com/a/49286794/16502170
Group dates
Let's start with an example data frame that contains a date column
# here's a bunch of example dates:
library(lubridate)
dates2 <- seq.Date(as.Date("2018-03-01"),by="days",length.out = 60)
#here's the dataframe
exampl_df <- data.frame(animals = rep(c("cats","dogs","rabbits"),20), dates=dates2,
numbers= rep(1:3,20))
Here's what it looks like:
head(exampl_df)
animals dates numbers
1 cats 2018-03-01 1
2 dogs 2018-03-02 2
3 rabbits 2018-03-03 3
4 cats 2018-03-04 1
5 dogs 2018-03-05 2
6 rabbits 2018-03-06 3
Then let's make a sequence of every day between the minimum and maximum date in the sequence. This step is important because there may be missing dates in our data that we still want counting towards the separation between days.
# this is a day by day sequence from the earliest day in your data to the latest day
date_sequence <- seq.Date(from = min(dates2),max(dates2),by="day")
Then let's make a sequence of numbers each repeated seven times. If you wanted to group every three days, you could change each to 3. Then the length.out= length(date_sequence) tells R to make this vector have as many entries as the min to max date sequence has:
# and then if you want a new group every seven days you can make this number sequence
groups <- rep(1:length(date_sequence),each= 7, length.out = length(date_sequence) )
Then let's attach the groups to the date_sequence to make a grouping index
date_grouping_index <- data.frame(a=date_sequence,b=groups)
then you can do a join to attach the groups to the original dataframe
library(dplyr)
example_df 2 <- exampl_df %>%
inner_join(date_grouping_index, by=c("dates"="a"))
This is what we get:
head(example_df2,n=10)
animals dates numbers b
1 cats 2018-03-01 1 1
2 dogs 2018-03-02 2 1
3 rabbits 2018-03-03 3 1
4 cats 2018-03-04 1 1
5 dogs 2018-03-05 2 1
6 rabbits 2018-03-06 3 1
7 cats 2018-03-07 1 1
8 dogs 2018-03-08 2 2
9 rabbits 2018-03-09 3 2
10 cats 2018-03-10 1 2
Then you should be able to group_by()
or aggregate()
your data using column b
Using the data provided in the question
#original data
df <- structure(list(Date = structure(c(17534, 17535, 17536, 17537,
18279, 18280, 18281, 18282, 17932), class = "Date"), group = c(1,
1, 1, 1, 2, 2, 2, 2, 2)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
#plus extra step
df$group2 <- 1 c(0, cumsum(ifelse(diff(df$Date) > 1, 1, 0)))
Method described above
date_sequence <- seq.Date(from = min(df$Date),max(df$Date),by="day")
groups <- rep(1:length(date_sequence),each= 7, length.out = length(date_sequence) )
date_grouping_index <- data.frame(a=date_sequence,groups=groups)
example_df2<- df %>%
inner_join(date_grouping_index, by=c("Date"="a"))
Looks like it worked?
example_df2
# A tibble: 9 x 4
Date group group2 groups
<date> <dbl> <dbl> <int>
1 2018-01-03 1 1 1
2 2018-01-04 1 1 1
3 2018-01-05 1 1 1
4 2018-01-06 1 1 1
5 2020-01-18 2 2 107
6 2020-01-19 2 2 107
7 2020-01-20 2 2 107
8 2020-01-21 2 2 107
9 2019-02-05 2 2 57
Here's something you could do to make group names with the date and year in them:
example_df2$group_name <- paste0("sampling number ",
example_df2$groups,
" (",
month.name[month(example_df2$Date)],
"-",
year(example_df2$Date),
")")