I have an issue that should have a very straight forward solution but I can't seem to find it. Basically I have observations nested within district*county but their values of candidate are wrong and the information for a single candidate is spread through multiple columns. The original dataset has the same problem but instead of only four non.grouping variables it has 500 .
But here it's what it looks like now:
current_df
county district candidate name votes party
1 E100 1 1 john NA <NA>
2 E100 1 2 <NA> 100 <NA>
3 E100 1 3 <NA> NA D
4 E101 2 4 maria NA <NA>
5 E101 2 5 <NA> 200 <NA>
6 E101 2 6 <NA> NA I
7 E202 1 7 tom NA <NA>
8 E202 1 8 <NA> 78 <NA>
9 E202 1 9 <NA> NA R
10 E204 3 10 jose NA <NA>
11 E204 3 11 <NA> 120 <NA>
12 E204 3 12 <NA> NA <NA>
And this is what I'd want to have:
target_df <- data.frame(county = c("E100", "E101","E202", "E204"),
district= c(1, 2, 1, 3),
candidate = c(1, 1, 2, 1),
name = c("john","maria",
"tom", "jose"),
votes = c(100,
200,
78,
120),
party = c( "D",
"I",
"R",
NA))
Again, if there is a solution that doesn't build on the fact that there are only 4 variables other than county and district that's ideal because I have many in many dataframe. For reference, candidate
is just the ordered indicator of the candidate within each district.
Also, some rows, as is the case of that with candidate == "jose"
in target_df
have NA for their party value. This is just a function of the data's context.
I've tried way to many things and can't still find something doing it well. Any pointers very welcome, thanks!
CodePudding user response:
Here is another option (also ignoring the issue with candidate
. You can just summarize each group by taking the first instance of non-NA
library(tidyverse)
current_df |>
group_by(county, district) |>
summarise(across(everything(), ~first(.[!is.na(.)])))
#> # A tibble: 4 x 6
#> # Groups: county [4]
#> county district candidate name votes party
#> <chr> <dbl> <dbl> <chr> <dbl> <chr>
#> 1 E100 1 1 john 100 D
#> 2 E101 2 4 maria 200 I
#> 3 E202 1 7 tom 78 R
#> 4 E204 3 10 jose 120 R
or another option. Fill the NA by group and select first instance:
current_df %>%
group_by(county, district)|>
fill(everything(), .direction = "updown") |>
filter(row_number() == 1)
#> # A tibble: 4 x 6
#> # Groups: county, district [4]
#> county district candidate name votes party
#> <chr> <dbl> <dbl> <chr> <dbl> <chr>
#> 1 E100 1 1 john 100 D
#> 2 E101 2 4 maria 200 I
#> 3 E202 1 7 tom 78 R
#> 4 E204 3 10 jose 120 R
CodePudding user response:
Partial solution until the column candidate
is fully explained
library(tidyverse)
df %>%
mutate(across(everything(), as.character)) %>%
select(-candidate) %>%
pivot_longer(-c(1:2)) %>%
drop_na() %>%
pivot_wider(names_from = name,
values_from = value)
# A tibble: 4 × 5
county district name votes party
<chr> <chr> <chr> <chr> <chr>
1 E100 1 john 100 D
2 E101 2 maria 200 I
3 E202 1 tom 78 R
4 E204 3 jose 120 R