Home > Enterprise >  filter values in a list of dataframes based on a vector, and add rows for vector values not containe
filter values in a list of dataframes based on a vector, and add rows for vector values not containe

Time:07-27

I would like o filter the rows in data frames in a list based on the values of a vector. Then, in the filtered dataframe I would like to create new rows with NAs for all the values in the vector not contained in the dataframe.

The datarfame look like this:

vector <- c("prov1", "prov2", "prov3", "prov4", "prov5", "prov6")

df1 <- data.frame(province = c("prov1", "prov3", "prov5", "non_prov"),
                  value = c(23, 56, 93, 46),
                  value2 = c(25, 57, 83, 67))

df2 <- data.frame(province = c("prov4", "prov6", "prov5", "non_prov"),
                  value = c(3, 79, 90, 52),
                  value2 = c(2, 59, 67, NA))

list_df <- list(df1, df2)

Where, the filtering has to be done on the column 'provinces' based on the values from 'vector'.

Then, for for the dataframes that do not contain some of the values from 'vector', I would like to have new rows created for these missing values so that all output dataframes have the same number of rows.

Final dataframes should like this:


df1_outp <- data.frame(province = c("prov1", "prov2", "prov3", "prov4", "prov5", "prov6"),
                  value = c(23, NA, 56, NA, 93, NA),
                  value2 = c(25, NA, 57, NA, 83, NA))

df2_outp <- data.frame(province = c("prov1", "prov3", "prov2", "prov4", "prov6", "prov5"),
                  value = c(NA, NA, NA, 3, 79, 90),
                  value2 = c(NA, NA, NA, 2, 59, 67))

list_df_outp <- list(df1_outp, df2_outp)

Rows that do not have one the values in 'vector' (e.g., 'non_prov') get removed whereas new rows are created for the values in 'vector' that are not in the dataframes.

As can be seen in df2_outp, the order on the 'prov' values in the column 'province' column does not matter.

Thanks!

CodePudding user response:

library(tidyverse)
vector <- c("prov1", "prov2", "prov3", "prov4", "prov5", "prov6")

df1 <- data.frame(province = c("prov1", "prov3", "prov5", "non_prov"),
                  value = c(23, 56, 93, 46),
                  value2 = c(25, 57, 83, 67))

df2 <- data.frame(province = c("prov4", "prov6", "prov5", "non_prov"),
                  value = c(3, 79, 90, 52),
                  value2 = c(2, 59, 67, NA))

list_df <- list(df1, df2)

First I've created a data frame with all the values in vector with NA for value and value2

tibble(province = vector,
       value = NA,
       value2 = NA) -> vector_df

vector_df

# A tibble: 6 × 3
  province value value2
  <chr>    <lgl> <lgl> 
1 prov1    NA    NA    
2 prov2    NA    NA    
3 prov3    NA    NA    
4 prov4    NA    NA    
5 prov5    NA    NA    
6 prov6    NA    NA   

Then I filtered the province colummn in the data frame with vector value provided. After that I row binded the vector_df. Now all I have to do is figure out the province which has more than 1 value in the datframe. If the province has more than 1 value I will remove the dummy NA rows otherwise I'll retain the NA rows.



list_df %>% 
  map(~ .x %>% filter(province %in% vector) %>% 
        bind_rows(vector_df) %>% 
        left_join(count(., province)) %>% 
        filter(!(is.na(value) & is.na(value2) & n > 1)) %>% 
        select(-n))
[[1]]
  province value value2
1    prov1    23     25
2    prov3    56     57
3    prov5    93     83
4    prov2    NA     NA
5    prov4    NA     NA
6    prov6    NA     NA

[[2]]
  province value value2
1    prov4     3      2
2    prov6    79     59
3    prov5    90     67
4    prov1    NA     NA
5    prov2    NA     NA
6    prov3    NA     NA

CodePudding user response:

  1. Create a data.frame with all the rows you want

    data.frame(province=vector)

  2. Merge this with the data frame you do have, setting all.x=TRUE (so every row from point 1 is retained, and filled with NA if necessary)

    merge(data.frame(province=vector), df1, all.x=TRUE)

  3. Done!

> merge(data.frame(province=vector), df1, all.x=TRUE)
  province value value2
1    prov1    23     25
2    prov2    NA     NA
3    prov3    56     57
4    prov4    NA     NA
5    prov5    93     83
6    prov6    NA     NA
  • Bonus 1: you can trivially loop this with lapply

    lapply(list_df, function(df) merge(data.frame(province=vector), df, all.x=TRUE))

    (if you have a lot of data frames you want to apply this to, you will probably want to avoid re-building the vector data frame anonymously each time but create it as a named data frame instead)

  • Bonus 2: all base-r with no dependencies whatsoever

  • Bonus 3: you did say it doesn't matter, but the rows are in order as in vector

  • Related