Home > Software engineering >  grouping by column variables and appending a new variable based on condition
grouping by column variables and appending a new variable based on condition

Time:11-19

Say I have a dataframe-

Column A Column B
id1 blue
id1 red
id1 grey
id2 red
id3 red
id3 grey

I would like this output-

Column A Column B
id1 all.mixed
id2 red
id3 red.grey

I tried this- table1 <- mydf %>% group_by(ColA, ColB) %>% count(ColB) and came to this-

ColA ColB                n
  <chr>   <chr>           <int>
1 id1    blue              1
2 id1    red        1
3 id1   grey             1
4 id2   red       1
5 id3   red     1
6 id3  grey1     1

But I am kind of lost after this. I thought of group_by and summing up the rows in the colB, but then if I have a situation such that-

Column A Column B
id5 grey
id5 grey

Then what do i do?

CodePudding user response:

It's not clear how you are determining the values for column B in the output (e.g. "all.mixed"), but let's assume there is a calcMagicValue function. The general flow would be something like this:

calcMagicValue <- function(vals){
  
  # psuedo code
  if(all(possibleVals %in% vals)){
    "all.mixed"
  } else if( hasRedAndGrey){
    "red.grey"
  } else if(onlyRed) {
    "red"
  } else {
    #...other conditions...
  }
}

myDf |> 
  group_by(ColA) |> 
  summarize(newColB = calcMagicValue(ColB))

After grouping by, summarize will iterate through the unique values in the grouped columns, and pass the matching rows the functions called (in this case calcMagicValue). In this case, your result will then have one row for each unique combination/values in the grouped columns(s).

CodePudding user response:

A slightly different approach using group_by summarise which uses an indicator to replace the all.mixed cases in an additional mutate step:

library(dplyr)

dat %>%
  group_by(Column.A) |> 
  summarise(all.mixed = all(unique(dat$Column.B) %in% Column.B), 
            Column.B = paste(Column.B, collapse = "."), .groups = "drop") |> 
  mutate(Column.B = if_else(all.mixed, "all.mixed", Column.B)) |> 
  select(-all.mixed)
#> # A tibble: 3 × 2
#>   Column.A Column.B 
#>   <chr>    <chr>    
#> 1 id1      all.mixed
#> 2 id2      red      
#> 3 id3      red.grey

DATA

dat <- data.frame(
          Column.A = c("id1", "id1", "id1", "id2", "id3", "id3"),
          Column.B = c("blue", "red", "grey", "red", "red", "grey")
)
  • Related