Home > Software design >  Is there a command to merge data sets based on observations of a variable?
Is there a command to merge data sets based on observations of a variable?

Time:12-29

I'm trying to merge two data sets, depending on the observations of the data sets.

In other words, I have two data sets both containing year and state. However, the two data sets consists each of one additional variable, X1 for df1 and X2 for df2. That is, I'm trying to merge the two dataset if a state is observed to a have minimum of 5 observation for both X1 and X2, such that the all observations for that state is included, even when both X1 and X2 are NA values.

Is there a way to merge the data sets to only merge states in which both X1 and X2 has minimum 5 observations? Such that, the new dataset has observations for all the years of the states that both X1 and X5 has a minimum of 5 observations in. While rest are excluded.

I have tried to use experiment with inner_join(df1,df2) with no success, as it only merges the year and state that both specific dataset has individual observations.

An reproducible example of the merge effect (for simplicity, I have used if 2 observations are non NA, the state is included)

df1 = read.table(
  text =
    "State Year X1
A 1 NA 
A 2 NA 
A 3 5 
A 4 NA 
B 1 NA 
B 2 NA 
B 3 4 
B 4 3", header = TRUE)

df2 = read.table(
  text =
    "State Year X2
A 1 NA 
A 2 5 
A 3 7 
A 4 NA 
B 1 NA 
B 2 2 
B 3 5 
B 4 7", header = TRUE)

newdf = read.table(
  text =
    "State Year X1 X2
B 1 NA NA 
B 2 NA 2
B 3 4 5
B 4 3 7", header = TRUE)

Here, newdf neglect the state A as the df1 only have one observation for that state, while all years are included for the state B (even the first year when both X1 and X2 are NA) as both X1 and X2 has minimum of 2 non-NA observations for that state. (recall, for simplicity here the minimum observation is 2 not 5)

CodePudding user response:

You need to do further filtering after merging.

library(dplyr)

inner_join(df1, df2, by = c("State", "Year")) %>%
  group_by(State) %>%
  filter(if_all(X1:X2, ~ sum(!is.na(.x)) >= 2)) %>%
  ungroup()

# # A tibble: 4 × 4
#   State  Year X1    X2   
#   <chr> <int> <chr> <chr>
# 1 B         1 NA    NA   
# 2 B         2 NA    2    
# 3 B         3 4     5    
# 4 B         4 3     7
  • Related