R: Filtering and adding a repeating identical value in a column for hundreds of rows in a dataframe-CodePudding

Issue:

I have a data frame that has 16000 rows, and I am going to use this data to plot a boat track line in QGIS. One of the columns is called ' Course', which is the subsequent number order that the boat went out per field season.

For instance, course 1 ranges from 6/16/17 to 8/13/17. The boat track data frame has five columns for ID, date, time, course, latitude and longitude. These values were recorded every second throughout each day, which generally ranged from 6 am to 6 pm. Therefore, for each field course, there are hundreds of rows. Altogether there are 11 courses over a 5 year time period (2016-2018).

Is there any way that the data can be filtered per row by date to input a repeating number such as the value '2' (course 2) in the 'Course' column for the dates 8/14/17 to 10/16/17, and the repeating value of '3' (course 3) for the dates of 10/17/17 to 03/01/18, and so forth?

Unfortunately, I cannot share my data and I cannot find public any data that resembles my data frame that I can supply for this question using the function dput().

I have been using packages such as dplyr and tidyverse to try to solve this conundrum and I have so far been unsuccessful in finding a solution.

Would anyone be able to help?

Many thanks in advance.

CodePudding user response：

As mentioned, it is helpful to make up example data for something like this. Here, I created 2 data.frames. One will have your data (16k rows), and the second will have your filter criteria (e.g., which course and date range you want to keep).

df1 <- data.frame(
  ID = 1:10,
  date = seq.Date(as.Date("2016-01-01"), as.Date("2019-01-01"), by = 120),
  course = rep(1:5, each = 2)
)
df1

   ID       date course
1   1 2016-01-01      1
2   2 2016-04-30      1
3   3 2016-08-28      2
4   4 2016-12-26      2
5   5 2017-04-25      3
6   6 2017-08-23      3
7   7 2017-12-21      4
8   8 2018-04-20      4
9   9 2018-08-18      5
10 10 2018-12-16      5

df2 <- data.frame(
  course = c(2, 3),
  start_date = as.Date(c("2016-01-01", "2017-05-09")),
  end_date = as.Date(c("2016-12-01", "2018-09-09"))
)
df2

      course start_date   end_date
1      2 2016-01-01 2016-12-01
2      3 2017-05-09 2018-09-09

They don't have all the columns, but hopefully this will give you the idea.

In my example, I would be filtering df1 where:

course is 2, and date falls between 1/1/16 and 12/1/16
course is 3, and date falls between 5/9/17 and 9/9/18

Once you have that, you can try using fuzzyjoin package to merge the two, using fuzzy_semi_join. It will keep rows in the first data frame that match the criteria in the second data frame.

library(fuzzyjoin)

fuzzy_semi_join(
  df1,
  df2,
  by = c("course", "date" = "start_date", "date" = "end_date"),
  match_fun = c(`==`, `>=`, `<=`)
)

Output

  ID       date course
3  3 2016-08-28      2
6  6 2017-08-23      3

In this case, only two rows meet those criteria.

As an alternative, you can merge using the data.table package. This may be a faster solution. It should give you the same result.

library(data.table)

setDT(df1)
setDT(df2)

df1[df2, .(ID, x.date, course), on = .(course, date >= start_date, date <= end_date)]