Home > Back-end >  Aggregating data with missing values in R
Aggregating data with missing values in R

Time:01-24

I am working to extend the following example based on the question Using a dataframe to input values to a function and aggregating the outputs

I've added a third team (Mary Bill), but am now having some problems because the "team3" contains a player that doesn't have any data, so the output is just "Mary"

The other issue I have is that the output of "team1" moves from "Mary Frank" in the input in "teams" to "Frank Mary" in the output (that is, the order is around the other way).

This makes it difficult to reconcile the output back to the teams.

I can make team1 and team2 work if I use regex and the start and finish of the teams$V2, but that doesn't help with missing data (ie. Bill).

Modified input:

input = structure(list(V1 = c("Team_2022", "Team_2022", "Team_2022"), V2 = c("Frank", "Mary", "John"), V3 = c("Sydney", "Sydney", "Sydney"), V4 = c(55, 76, 14)), row.names = c(NA, -3L), class = c("data.table", "data.frame"))

teams = structure(list(V1 = c("team1", "team2", "team3"), V2 = c("Mary   Frank","Mary   John", "Mary   Bill")), class = "data.frame", row.names = c(NA, -3L))

result = function(data, combination)
{
    purrr::map_df(lapply(combination,function(x){paste(x,collapse = '|')}), function(x){
    df = data[grepl(x,data$V2),] %>% group_by(V3) %>% summarize(V1= paste(V1[1]), V2= paste(V2,collapse = '   '), V4= sum(V4))
    data = df[,c(2,3,1,4)]
    return(data)
  }
  )
}

apply(teams, 1, function(x) {
  y <- list(str_split(x[2], ' \\  ', simplify = T))
  result(input, y)
}) %>% do.call("rbind", .)

Output

structure(list(V1 = c("Team_2022", "Team_2022", "Team_2022"), 
    V2 = c("Frank   Mary", "Mary   John", "Mary"), V3 = c("Sydney", 
    "Sydney", "Sydney"), V4 = c(131, 90, 76)), row.names = c(NA, 
-3L), class = c("tbl_df", "tbl", "data.frame"))

Can anyone suggest how I can get to my desired output - either keeps the team1, team2 in the same order so I can join them back afterwards; or joins the team name as part of the input?

output1 = structure(list(V1 = c("Team_2022", "Team_2022", "Team_2022"), 
               V2 = c("Mary   Frank","Mary   John", "Mary   Bill"), V3 = c("Sydney", 
                                                                 "Sydney", "Sydney"), V4 = c(131, 90, 76)), row.names = c(NA, 
                                                                                                                         -3L), class = c("tbl_df", "tbl", "data.frame"))

output2 = structure(list(V1 = c("team1", "team2", "team3"), V2 = c("Team_2022", "Team_2022", "Team_2022"), 
                         V3 = c("Frank   Mary", "John   Mary","Mary   Bill"), V4 = c("Sydney", 
                                                                                     "Sydney", "Sydney"), V5 = c(131, 90, 76)), row.names = c(NA, 
                                                                                                                                             -3L), class = c("tbl_df", "tbl", "data.frame"))

Or even (because I could at least tie the team3 back to missing values):

output3 = structure(list(V1 = c("team1", "team2", "team3"), V2 = c("Team_2022", "Team_2022", "Team_2022"), 
                         V3 = c("Frank   Mary", "John   Mary","Mary"), V4 = c("Sydney",                                                                                    "Sydney", "Sydney"), V5 = c(69, 90, 14)), row.names = c(NA, 
                                                                                                                                             -3L), class = c("tbl_df", "tbl", "data.frame"))

CodePudding user response:

library(dplyr); library(tidyr)
teams %>%
  separate_rows(V2) %>%
  left_join(input, by = c("V2" = "V2")) %>%
  replace_na(list(V4 = 0)) %>%
  group_by(V1.x) %>% fill(V1.y, V3) %>%
  summarize(V2 = paste(V2, collapse = "   "),
            V3 = first(V3),
            V4 = sum(V4))

Result

# A tibble: 3 × 4
  V1.x  V2           V3        V4
  <chr> <chr>        <chr>  <dbl>
1 team1 Mary   Frank Sydney   131
2 team2 Mary   John  Sydney    90
3 team3 Mary   Bill  Sydney    76
  • Related