Home > OS >  Merging rows based on multiple conditions
Merging rows based on multiple conditions

Time:02-23

I am trying to merge 3 rows in to 1 based on multiple conditions. The dataframe is created as follows:

region <- c("Europe", "Asia", "Europe", "Africa","Europe")
enterprise <- c(1, 1, 2, 3, 3)
q1 <- c(NA, 0, NA, 1, 0)
q2 <- c(0, 1, 1, NA, NA)
q3 <- c(NA, 1, NA, 0, NA)
q4 <- c(NA, 0, 1, 0, 0)
q5 <- c(0,NA,1,NA,1)

df <- data.frame(region, enterprise, q1, q2, q3, q4,q5)

So, I'm working with survey data and I'm trying to merge rows corresponding to region 'Europe' in to one row. Conditions are as follows:

  1. For any given question, the most complete source should be used. For example, if enterprise 2 and 3 (For Europe region) have NA for q1 (haven't answered the question) but enterprise 1 have an answer (either 0 or 1) then answer of enterprise 1 should be considered.
  2. If all 3 enterprises haven't answered (NA) then NA.
  3. In case more than 1 enterprises per region give complete answers (0 or 1) then the following enterprise hierarchy should be followed in order to chose the enterprise and its answer to include in the final merged row. Enterprise 1 has highest level in hierarchy followed by enterprise 2 and 3. For example, enterprise 1 hasn't answered the question (NA) but 2 and 3 did. Then answer of enterprise 2 should be considered.
  4. If all 3 enterprises have answered, then hierarchical order should be followed again.

I have tried using

merged = coalesce(df[1,],df[3,],df[5,]) 

But couldn't get it to work. I don't know how to pass conditions especially the one with hierarchies. My desired output is as follows:

  region q1 q2 q3 q4 q5
1   Asia  0  1  1  0 NA
2 Africa  1 NA  0  0 NA
3 Europe  0  0 NA  1  0

Thank you very much for any help, beforehand.

CodePudding user response:

How about this:

library(tidyr)
library(dplyr)

region <- c("Europe", "Asia", "Europe", "Africa","Europe")
enterprise <- c(1, 1, 2, 3, 3)
q1 <- c(1, 0, NA, 1, NA)
q2 <- c(0, 1, 1, NA, NA)
q3 <- c(NA, 1, NA, 0, NA)
q4 <- c(1, 0, 1, 0, 0)

df <- data.frame(region, enterprise, q1, q2, q3, q4)

df %>% 
  pivot_longer(q1:q4, names_to="q", values_to="vals") %>% 
  group_by(region, q) %>% 
  mutate(ind = min(enterprise[which(!is.na(vals))])) %>% 
  filter(enterprise == ind) %>% 
  pivot_wider(values_from = "vals", names_from = "q") %>% 
  select(region, q1, q2, q3, q4)

#> # A tibble: 3 × 5
#> # Groups:   region [3]
#>   region    q1    q2    q3    q4
#>   <chr>  <dbl> <dbl> <dbl> <dbl>
#> 1 Europe     1     0    NA     1
#> 2 Asia       0     1     1     0
#> 3 Africa     1    NA     0     0

Created on 2022-02-22 by the reprex package (v2.0.1)

CodePudding user response:

An approach using fill

library(dplyr)
library(tidyr)

df %>% 
  group_by(region) %>% 
  fill(q1:q5, .direction="updown") %>% 
  arrange(enterprise) %>% 
  summarise(across(q1:q5, ~ .x[1]))
# A tibble: 3 × 6
  region    q1    q2    q3    q4    q5
  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl>
1 Africa     1    NA     0     0    NA
2 Asia       0     1     1     0    NA
3 Europe     0     0    NA     1     0
  • Related