Issue:
I have a data frame that has 16000 rows, and I am going to use this data to plot a boat track line in QGIS. One of the columns is called ' Course', which is the subsequent number order that the boat went out per field season.
For instance, course 1 ranges from 6/16/17 to 8/13/17. The boat track data frame has five columns for ID, date, time, course, latitude and longitude. These values were recorded every second throughout each day, which generally ranged from 6 am to 6 pm. Therefore, for each field course, there are hundreds of rows. Altogether there are 11 courses over a 5 year time period (2016-2018).
Is there any way that the data can be filtered per row by date to input a repeating number such as the value '2' (course 2) in the 'Course' column for the dates 8/14/17 to 10/16/17, and the repeating value of '3' (course 3) for the dates of 10/17/17 to 03/01/18, and so forth?
Unfortunately, I cannot share my data and I cannot find public any data that resembles my data frame that I can supply for this question using the function dput().
I have been using packages such as dplyr and tidyverse to try to solve this conundrum and I have so far been unsuccessful in finding a solution.
Would anyone be able to help?
Many thanks in advance.
CodePudding user response:
As mentioned, it is helpful to make up example data for something like this. Here, I created 2 data.frames. One will have your data (16k rows), and the second will have your filter criteria (e.g., which course and date range you want to keep).
df1 <- data.frame(
ID = 1:10,
date = seq.Date(as.Date("2016-01-01"), as.Date("2019-01-01"), by = 120),
course = rep(1:5, each = 2)
)
df1
ID date course
1 1 2016-01-01 1
2 2 2016-04-30 1
3 3 2016-08-28 2
4 4 2016-12-26 2
5 5 2017-04-25 3
6 6 2017-08-23 3
7 7 2017-12-21 4
8 8 2018-04-20 4
9 9 2018-08-18 5
10 10 2018-12-16 5
df2 <- data.frame(
course = c(2, 3),
start_date = as.Date(c("2016-01-01", "2017-05-09")),
end_date = as.Date(c("2016-12-01", "2018-09-09"))
)
df2
course start_date end_date
1 2 2016-01-01 2016-12-01
2 3 2017-05-09 2018-09-09
They don't have all the columns, but hopefully this will give you the idea.
In my example, I would be filtering df1
where:
- course is 2, and date falls between 1/1/16 and 12/1/16
- course is 3, and date falls between 5/9/17 and 9/9/18
Once you have that, you can try using fuzzyjoin
package to merge the two, using fuzzy_semi_join
. It will keep rows in the first data frame that match the criteria in the second data frame.
library(fuzzyjoin)
fuzzy_semi_join(
df1,
df2,
by = c("course", "date" = "start_date", "date" = "end_date"),
match_fun = c(`==`, `>=`, `<=`)
)
Output
ID date course
3 3 2016-08-28 2
6 6 2017-08-23 3
In this case, only two rows meet those criteria.
As an alternative, you can merge using the data.table
package. This may be a faster solution. It should give you the same result.
library(data.table)
setDT(df1)
setDT(df2)
df1[df2, .(ID, x.date, course), on = .(course, date >= start_date, date <= end_date)]