Removing list components based on a criteria-CodePudding

I have two data set that I have split() based on ID, resulting in two list: july and july2. In one of the list the ID D only has two components while the other has 3 components. The objective is to create a function to look for these irregularities, based on the fact that one of the list doesn't have a specific number (in this case that number is 3), and remove the ID from both list. Is there a efficient way to do this?

For example, each of these list have 4 IDs (A,B,C,D). For each ID, I create a data frame in the list for a specific 10-day interval within the month of July (e.g., [[1]] is the first 10-days in July for ID A, [[2]] is the second 10-days in July for ID A, and [[3]] is the third 10-days in July for ID, and this begins again for B, C, and D.) For july2 list, the third 10-day interval is simulated to not exist, and I would like a way to remove that ID from both july and july2 lists because it is missing this last interval.

library(dplyr)
library(lubridate)

ID <-  rep(c("A","B","C", "D"), 5000)
date <-  rep_len(seq(dmy("01-01-2010"), dmy("31-01-2011"), by = "days"), 500)
x <-  runif(length(date), min = 60000, max = 80000)
y <-  runif(length(date), min = 800000, max = 900000)

ID2 <- rep(c("A", "B", "C", "D"), 5000)
date2 <-  rep_len(seq(dmy("01-01-2010"), dmy("21-01-2011"), by = "days"), 500)
x2 <-  runif(length(date2), min = 60000, max = 80000)
y2 <-  runif(length(date2), min = 800000, max = 900000)

df <- data.frame(date = date, 
                 x = x,
                 y =y,
                 ID)

df2 <- data.frame(date = date2, 
                  x = x2,
                  y =y2,
                  ID2)

df$jDate <- julian(as.Date(df$date), origin = as.Date("1970-01-01"))
df$Month <- month(df$date)
df2$jDate <- julian(as.Date(df2$date), origin = as.Date("1970-01-01"))
df2$Month <- month(df2$date)

july <- df %>%
  # Creates a new column assigning the first day in the 10-day interval in which
  # the date falls under (e.g., 01-03-2021 would be in the first 10-day interval
  # so the `floor_date` assigned to it would be 01-01-2021)
  mutate(new = floor_date(date, "10 days")) %>%
  # For any months that has 31 days, the 31st day would normally be assigned its 
  # own interval. The code below takes the 31st day and joins it with the 
  # previous interval. 
  group_by(ID) %>% 
  mutate(new = if_else(day(new) == 31, new - days(10), new)) %>% 
  group_by(new, .add = TRUE) %>%
  # Filter the data by the season based on the `season_categ` column
  filter(Month == "7") %>% 
  group_split()

july2 <- df2 %>%
  # Creates a new column assigning the first day in the 10-day interval in which
  # the date falls under (e.g., 01-03-2021 would be in the first 10-day interval
  # so the `floor_date` assigned to it would be 01-01-2021)
  mutate(new = floor_date(date, "10 days")) %>%
  # For any months that has 31 days, the 31st day would normally be assigned its 
  # own interval. The code below takes the 31st day and joins it with the 
  # previous interval. 
  group_by(ID2) %>% 
  mutate(new = if_else(day(new) == 31, new - days(10), new)) %>% 
  group_by(new, .add = TRUE) %>%
  # Filter the data by the season based on the `season_categ` column
  filter(Month == "7") %>% 
  group_split()

july2 <- july2[-12]

names(july) <- sapply(july, function(x) paste(x$ID[1]))
names(july2) <- sapply(july2, function(x) paste(x$ID2[1]))

CodePudding user response：

If you just want to find and remove ID's where the counts are off, this might be easy enough.

library(tibble)

remove_component <- function(list1, list2) {
  
  bad_id <- full_join(
    enframe(table(names(list1)), "ID", "list1"),
    enframe(table(names(list2)), "ID", "list2"),
    by = "ID") %>% 
    filter(list1 != list2) %>% 
    distinct(ID) %>% 
    pull()
  
  list(new_list1 = list1[which(!((names(list1) %in% bad_id)))],
       new_list2 = list2[which(!((names(list2) %in% bad_id)))])

}

remove_component(july, july2)

I assume you made the fake data like this for a reason. Otherwise, I would not split the lists. If you keep the data stacked, this may be easier and more precise.

july_full <- map_dfr(july, I) # binding rows, no idea why bind_rows is failing for me
july2_full <- map_dfr(july2, I)

# find ID and data periods that are not in both
missing_elements <- anti_join(
  distinct(july_full, ID, new),
  distinct(july2_full, ID2, new),
  by = c(ID = "ID2", "new")
)  
  
# remove that ID from both data sets
july_full %>% 
  anti_join(missing_elements, by = "ID")

july2_full %>% 
  anti_join(missing_elements, by = "ID")