I have two data set that I have split()
based on ID
, resulting in two list: july
and july2
. In one of the list the ID
D
only has two components while the other has 3 components. The objective is to create a function to look for these irregularities, based on the fact that one of the list doesn't have a specific number (in this case that number is 3), and remove the ID
from both list. Is there a efficient way to do this?
For example, each of these list have 4 ID
s (A,B,C,D). For each ID
, I create a data frame in the list for a specific 10-day interval within the month of July (e.g., [[1]]
is the first 10-days in July for ID
A, [[2]]
is the second 10-days in July for ID
A, and [[3]]
is the third 10-days in July for ID
, and this begins again for B
, C
, and D
.) For july2
list, the third 10-day interval is simulated to not exist, and I would like a way to remove that ID
from both july
and july2
lists because it is missing this last interval.
library(dplyr)
library(lubridate)
ID <- rep(c("A","B","C", "D"), 5000)
date <- rep_len(seq(dmy("01-01-2010"), dmy("31-01-2011"), by = "days"), 500)
x <- runif(length(date), min = 60000, max = 80000)
y <- runif(length(date), min = 800000, max = 900000)
ID2 <- rep(c("A", "B", "C", "D"), 5000)
date2 <- rep_len(seq(dmy("01-01-2010"), dmy("21-01-2011"), by = "days"), 500)
x2 <- runif(length(date2), min = 60000, max = 80000)
y2 <- runif(length(date2), min = 800000, max = 900000)
df <- data.frame(date = date,
x = x,
y =y,
ID)
df2 <- data.frame(date = date2,
x = x2,
y =y2,
ID2)
df$jDate <- julian(as.Date(df$date), origin = as.Date("1970-01-01"))
df$Month <- month(df$date)
df2$jDate <- julian(as.Date(df2$date), origin = as.Date("1970-01-01"))
df2$Month <- month(df2$date)
july <- df %>%
# Creates a new column assigning the first day in the 10-day interval in which
# the date falls under (e.g., 01-03-2021 would be in the first 10-day interval
# so the `floor_date` assigned to it would be 01-01-2021)
mutate(new = floor_date(date, "10 days")) %>%
# For any months that has 31 days, the 31st day would normally be assigned its
# own interval. The code below takes the 31st day and joins it with the
# previous interval.
group_by(ID) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(new, .add = TRUE) %>%
# Filter the data by the season based on the `season_categ` column
filter(Month == "7") %>%
group_split()
july2 <- df2 %>%
# Creates a new column assigning the first day in the 10-day interval in which
# the date falls under (e.g., 01-03-2021 would be in the first 10-day interval
# so the `floor_date` assigned to it would be 01-01-2021)
mutate(new = floor_date(date, "10 days")) %>%
# For any months that has 31 days, the 31st day would normally be assigned its
# own interval. The code below takes the 31st day and joins it with the
# previous interval.
group_by(ID2) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(new, .add = TRUE) %>%
# Filter the data by the season based on the `season_categ` column
filter(Month == "7") %>%
group_split()
july2 <- july2[-12]
names(july) <- sapply(july, function(x) paste(x$ID[1]))
names(july2) <- sapply(july2, function(x) paste(x$ID2[1]))
CodePudding user response:
If you just want to find and remove ID's where the counts are off, this might be easy enough.
library(tibble)
remove_component <- function(list1, list2) {
bad_id <- full_join(
enframe(table(names(list1)), "ID", "list1"),
enframe(table(names(list2)), "ID", "list2"),
by = "ID") %>%
filter(list1 != list2) %>%
distinct(ID) %>%
pull()
list(new_list1 = list1[which(!((names(list1) %in% bad_id)))],
new_list2 = list2[which(!((names(list2) %in% bad_id)))])
}
remove_component(july, july2)
I assume you made the fake data like this for a reason. Otherwise, I would not split the lists. If you keep the data stacked, this may be easier and more precise.
july_full <- map_dfr(july, I) # binding rows, no idea why bind_rows is failing for me
july2_full <- map_dfr(july2, I)
# find ID and data periods that are not in both
missing_elements <- anti_join(
distinct(july_full, ID, new),
distinct(july2_full, ID2, new),
by = c(ID = "ID2", "new")
)
# remove that ID from both data sets
july_full %>%
anti_join(missing_elements, by = "ID")
july2_full %>%
anti_join(missing_elements, by = "ID")