Home > database >  How to subset multiple dataframes from a large list R
How to subset multiple dataframes from a large list R

Time:08-02

I have a large list of dataframes that I would like to subset and reorder, so that they can be horizontally joined. Ideally, I would merge them, however this consistently exceeds my computer's capabiltiies, so I need to join them manually.

My large list is a list of 10 dataframes, all with 10 unique variables and one shared variable, with over 1000 rows for each dataframe:

df1:

ID, V2, V3, etc.

df2:

ID, V2_2, V3_2, etc.

and so forth.

All of these dataframes are contained within the same large list. How can I subset all of these dataframes within that large list simultaneously, such that they only include observations where ID > 100? Similarly, how can I order all of these dataframes simultaneously by the ID column, so that I can then bind them horizontally?

This seems to be the only way to have the dataframes merge, because merging the dataframes by ID directly does not appear to work, since the size of the dataframes (even though IDs are all the same for each dataframe, and each ID is unique) causes my R to freeze.

CodePudding user response:

Try the following. Untested, since there are no test data.

1. Filter and sort

Filter ID > 100 and sort by ID.

df_list <- lapply(df_list, \(x) {
  x <- x[x$ID > 100, ]
  x[order(x$ID), ]
})

2. Merge the data.frames

This is the complicated part. The code below creates a result data.frame df_all and then puts each data.frame in df_list in its place by matching the id's. This is probably faster and no merge and Reduce function calls are needed.

all_cols <- unique(sapply(df_list, names))
all_rows <- unique(sapply(df_list, `[[`, 'ID'))
all_rows <- sort(all_rows)
df_all <- matrix(nrow = length(all_rows), ncol = length(all_cols))
df_all <- as.data.frame(df_all)
names(df_all) <- all_cols
df_all$ID <- all_rows

for(i in seq_along((df_list))) {
  i <- match(df_list[[i]]$ID, all_rows)
  df_all[i, names(df_list[[i]])] <- df_list[[i]]
}

CodePudding user response:

This should return the desired list:

lapply(df_list, \(df) df[df$ID > 100, ])

EDIT

Update after comment that you have a list of lists. Try this:

lapply(df_list, \(df) {
    df  <- data.frame(unlist(df, recursive=FALSE))
    df[df$ID > 100, ]
    }
)
  • Related