How to collapse grouped information spread across different columns in R-CodePudding

I have an issue that should have a very straight forward solution but I can't seem to find it. Basically I have observations nested within district*county but their values of candidate are wrong and the information for a single candidate is spread through multiple columns. The original dataset has the same problem but instead of only four non.grouping variables it has 500 .

But here it's what it looks like now:

current_df
   county district candidate  name votes party
1    E100        1         1  john    NA  <NA>
2    E100        1         2  <NA>   100  <NA>
3    E100        1         3  <NA>    NA     D
4    E101        2         4 maria    NA  <NA>
5    E101        2         5  <NA>   200  <NA>
6    E101        2         6  <NA>    NA     I
7    E202        1         7   tom    NA  <NA>
8    E202        1         8  <NA>    78  <NA>
9    E202        1         9  <NA>    NA     R
10   E204        3        10  jose    NA  <NA>
11   E204        3        11  <NA>   120  <NA>
12   E204        3        12  <NA>    NA  <NA>

And this is what I'd want to have:

target_df <- data.frame(county = c("E100",  "E101","E202", "E204"),
                         district= c(1, 2, 1, 3),
                         candidate = c(1, 1, 2, 1),
                         name = c("john","maria",
                                  "tom", "jose"),
                         votes = c(100,
                                   200,
                                   78,
                                   120),
                         party = c( "D",
                                    "I",
                                    "R",
                                    NA))

Again, if there is a solution that doesn't build on the fact that there are only 4 variables other than county and district that's ideal because I have many in many dataframe. For reference, candidate is just the ordered indicator of the candidate within each district.

Also, some rows, as is the case of that with candidate == "jose" in target_df have NA for their party value. This is just a function of the data's context.

I've tried way to many things and can't still find something doing it well. Any pointers very welcome, thanks!

CodePudding user response：

Here is another option (also ignoring the issue with candidate. You can just summarize each group by taking the first instance of non-NA

library(tidyverse)

current_df |>
  group_by(county, district) |>
  summarise(across(everything(), ~first(.[!is.na(.)])))
#> # A tibble: 4 x 6
#> # Groups:   county [4]
#>   county district candidate name  votes party
#>   <chr>     <dbl>     <dbl> <chr> <dbl> <chr>
#> 1 E100          1         1 john    100 D    
#> 2 E101          2         4 maria   200 I    
#> 3 E202          1         7 tom      78 R    
#> 4 E204          3        10 jose    120 R

or another option. Fill the NA by group and select first instance:

current_df %>%
  group_by(county, district)|>
  fill(everything(), .direction = "updown") |>
  filter(row_number() == 1)
#> # A tibble: 4 x 6
#> # Groups:   county, district [4]
#>   county district candidate name  votes party
#>   <chr>     <dbl>     <dbl> <chr> <dbl> <chr>
#> 1 E100          1         1 john    100 D    
#> 2 E101          2         4 maria   200 I    
#> 3 E202          1         7 tom      78 R    
#> 4 E204          3        10 jose    120 R

CodePudding user response：

Partial solution until the column candidate is fully explained

library(tidyverse) 

df %>% 
  mutate(across(everything(), as.character)) %>% 
  select(-candidate) %>% 
  pivot_longer(-c(1:2)) %>% 
  drop_na() %>% 
  pivot_wider(names_from = name, 
              values_from = value)

# A tibble: 4 × 5
  county district name  votes party
  <chr>  <chr>    <chr> <chr> <chr>
1 E100   1        john  100   D    
2 E101   2        maria 200   I    
3 E202   1        tom   78    R    
4 E204   3        jose  120   R