I would like o filter the rows in data frames in a list based on the values of a vector. Then, in the filtered dataframe I would like to create new rows with NAs for all the values in the vector not contained in the dataframe.
The datarfame look like this:
vector <- c("prov1", "prov2", "prov3", "prov4", "prov5", "prov6")
df1 <- data.frame(province = c("prov1", "prov3", "prov5", "non_prov"),
value = c(23, 56, 93, 46),
value2 = c(25, 57, 83, 67))
df2 <- data.frame(province = c("prov4", "prov6", "prov5", "non_prov"),
value = c(3, 79, 90, 52),
value2 = c(2, 59, 67, NA))
list_df <- list(df1, df2)
Where, the filtering has to be done on the column 'provinces' based on the values from 'vector'.
Then, for for the dataframes that do not contain some of the values from 'vector', I would like to have new rows created for these missing values so that all output dataframes have the same number of rows.
Final dataframes should like this:
df1_outp <- data.frame(province = c("prov1", "prov2", "prov3", "prov4", "prov5", "prov6"),
value = c(23, NA, 56, NA, 93, NA),
value2 = c(25, NA, 57, NA, 83, NA))
df2_outp <- data.frame(province = c("prov1", "prov3", "prov2", "prov4", "prov6", "prov5"),
value = c(NA, NA, NA, 3, 79, 90),
value2 = c(NA, NA, NA, 2, 59, 67))
list_df_outp <- list(df1_outp, df2_outp)
Rows that do not have one the values in 'vector' (e.g., 'non_prov') get removed whereas new rows are created for the values in 'vector' that are not in the dataframes.
As can be seen in df2_outp, the order on the 'prov' values in the column 'province' column does not matter.
Thanks!
CodePudding user response:
library(tidyverse)
vector <- c("prov1", "prov2", "prov3", "prov4", "prov5", "prov6")
df1 <- data.frame(province = c("prov1", "prov3", "prov5", "non_prov"),
value = c(23, 56, 93, 46),
value2 = c(25, 57, 83, 67))
df2 <- data.frame(province = c("prov4", "prov6", "prov5", "non_prov"),
value = c(3, 79, 90, 52),
value2 = c(2, 59, 67, NA))
list_df <- list(df1, df2)
First I've created a data frame with all the values in vector with NA for value
and value2
tibble(province = vector,
value = NA,
value2 = NA) -> vector_df
vector_df
# A tibble: 6 × 3
province value value2
<chr> <lgl> <lgl>
1 prov1 NA NA
2 prov2 NA NA
3 prov3 NA NA
4 prov4 NA NA
5 prov5 NA NA
6 prov6 NA NA
Then I filtered the province
colummn in the data frame with vector
value provided. After that I row binded the vector_df
. Now all I have to do is figure out the province which has more than 1 value in the datframe. If the province has more than 1 value I will remove the dummy NA rows otherwise I'll retain the NA rows.
list_df %>%
map(~ .x %>% filter(province %in% vector) %>%
bind_rows(vector_df) %>%
left_join(count(., province)) %>%
filter(!(is.na(value) & is.na(value2) & n > 1)) %>%
select(-n))
[[1]]
province value value2
1 prov1 23 25
2 prov3 56 57
3 prov5 93 83
4 prov2 NA NA
5 prov4 NA NA
6 prov6 NA NA
[[2]]
province value value2
1 prov4 3 2
2 prov6 79 59
3 prov5 90 67
4 prov1 NA NA
5 prov2 NA NA
6 prov3 NA NA
CodePudding user response:
Create a data.frame with all the rows you want
data.frame(province=vector)
Merge this with the data frame you do have, setting
all.x=TRUE
(so every row from point 1 is retained, and filled withNA
if necessary)merge(data.frame(province=vector), df1, all.x=TRUE)
Done!
> merge(data.frame(province=vector), df1, all.x=TRUE)
province value value2
1 prov1 23 25
2 prov2 NA NA
3 prov3 56 57
4 prov4 NA NA
5 prov5 93 83
6 prov6 NA NA
Bonus 1: you can trivially loop this with
lapply
lapply(list_df, function(df) merge(data.frame(province=vector), df, all.x=TRUE))
(if you have a lot of data frames you want to apply this to, you will probably want to avoid re-building the vector data frame anonymously each time but create it as a named data frame instead)
Bonus 2: all
base-r
with no dependencies whatsoeverBonus 3: you did say it doesn't matter, but the rows are in order as in
vector