Home > Enterprise >  Matching logical subscripts to the size of the indexed input
Matching logical subscripts to the size of the indexed input

Time:05-31

I have created two data frames that I then turn into lists (e.g., list1 and list2). I removed one element from list2 to better represent my example data set.

library(dplyr)

intervals <- rep_len(c("01-01-2022", "01-11-2022", "01-31-2022"), 100)
ID <- rep(c("A","B", "C"), 100)
df <- data.frame(ID = as.factor(ID),
                 intervals = as.factor(intervals))

list1 <- df %>% 
  group_by(ID, intervals) %>% 
  group_split()

intervals <- rep_len(c("01-01-2022", "01-11-2022", "01-31-2022"), 25)
ID <- rep(c("A","B"), 25)
df2 <- data.frame(ID = as.factor(ID),
                 intervals = as.factor(intervals))

list2 <- df2 %>% 
  group_by(ID, intervals) %>% 
  group_split()

list2 <- list2[-6]

For each of these list I have added an attribute, and I have included a function to check the added attribute more readily (check).

# Convenience function to grab the attributes for you
check <- function(list, attribute_name) {
  return(attr(list, attribute_name))
}

# Add an attribute to hold the attributes of each list element
attr(list1, "match") <- data.frame(id = sapply(list1, function(x) paste(x$ID[1])),
                                  interval_start_date = sapply(list1, function(x) paste(x$intervals[1]))
)

# Check the attributes
check(list1, "match")

# Add an attribute "tab" to hold the attributes of each list element
attr(list2, "match") <- data.frame(id = sapply(list2, function(x) paste(x$ID[1])),
                                  interval_start_date = sapply(list2, function(x) paste(x$intervals[1]))
) 

# Check the attributes
check(list2, "match")

I have created an index for the two list, and the objective here is to remove any list components that don't have the same ID and the same intervals. The goal is to have only the matching IDs with the same intervals.

# Creates an index for the two list based on the attributes, 
dat2 <- check(list1, "match")
dat1 <- check(list2, "match")

# Removes rows where the id isn't present in both data frames, and creates a 
# index where both the interval and id are the same.
if (!length(unique(dat2$id)) == length(unique(dat1$id))){
  dat3 <- dat2[dat2$id %in% dat1$id, ]
  dat4 <- dat1[dat1$id %in% dat2$id, ]
  
  i1 <-   paste(dat3[["id"]], format(as.Date(dat3[["interval_"]]),
                                     "%Y-%d")) %in%  
    paste(dat4[["id"]], format(as.Date(dat4[["interval_"]]), 
                               "%Y-%d"))
}

Now here is where I begin to get an error:

# Error occurs because the lengths of `i1` is not the same as `list2`
out <- list1[i1]

I know that this is occuring because list1 does not have the same length as i1. I'm wondering if there is a way to appending logical values to i1 to get it the same length as list1, but in a way that it doesn't remove values from list1 that we actually do want to keep. Any thoughts?

Here is my expected output for list1, where I hope it ends up with only the same IDs and intervals as list2.

# Expected output
expected_list1 <- list(list1[1], list1[2],list1[3], list1[4], list1[5])

This answer is close to what I would like, but it has an additional element. I think ultimately the attribute table should be similiar to that of dat4.

 test <- list1[dat2$id %in% dat1$id][i1]
 # Add an attribute "tab" to hold the attributes of each list element
 attr(test, "match") <- data.frame(id = sapply(test, function(x) paste(x$ID[1])),
                                    interval_start_date = sapply(test, function(x) paste(x$intervals[1]))
 ) 
 
 # Check the attributes
 check(test, "match")

CodePudding user response:

There was a mismatch in the column name i.e. it is not interval_, but interval_start_date in dat1 and dat2. [[ will look for exact match whereas $ can match partial names as well


if (!length(unique(dat2$id)) == length(unique(dat1$id))){
 ids_common <- intersect(dat2$id, dat1$id)
 inds1 <- dat2$id %in% ids_common
 inds2 <- dat1$id %in% ids_common
 i1 <-   paste(dat2[["id"]], format(as.Date(dat2[["interval_start_date"]]),
                                    "%Y-%d")) %in%  
   paste(dat1[["id"]], format(as.Date(dat1[["interval_start_date"]]), 
                              "%Y-%d"))
 
  out <- list1[i1 & inds1]
  
 
}

-checking

> length(out)
[1] 5
> i1
[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
  • Related