Home > OS >  Combine duplicates but not all of them
Combine duplicates but not all of them

Time:02-26

I have a dataframe as the one below. I want to combine the duplicates in the 'activity' column except from the ones called 'selection', and sum their value in the 'duration' column. I am doing it in R. I've tried using aggregate() but I couldn't find the way to not aggregate the 'selection' rows.

 # df - I used dput so you can have my df
test <- structure(list(activity = c("selection", "selection", "selection", 
"other", "inspection", "assignment", "inspection", "inspection", 
"inspection", "inspection"), workers = c("worker 1", "worker 1", 
"worker 1", "worker 34", "worker 6", "worker 5", "worker 2", 
"worker 2", "worker 2", "worker 2"), start_time = structure(c(1645396200, 
1645396200, 1645396200, 1645394352, 1645394155, 1645394100, 1645390080, 
1645476480, 1645562880, 1645649280), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), status = c("passed", "passed", "passed", "passed", 
"passed", "passed", "passed", "passed", "passed", "passed"), 
    duration = c(8.98333333333333, 9.69027777777778, 9.20555555555556, 
    0.557222222222222, 2.24527777777778, 1.61666666666667, 2.12166666666667, 
    1.32638888888889, 2.59861111111111, 0.765555555555556)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))
  
test 
# A tibble: 10 x 5
       activity   workers   start_time          status duration
       <chr>      <chr>     <dttm>              <chr>     <dbl>
     1 selection  worker 1  2022-02-20 22:30:00 passed    8.98 
     2 selection  worker 1  2022-02-20 22:30:00 passed    9.69 
     3 selection  worker 1  2022-02-20 22:30:00 passed    9.21 
     4 other      worker 34 2022-02-20 21:59:12 passed    0.557
     5 inspection worker 6  2022-02-20 21:55:55 passed    2.25 
     6 assignment worker 5  2022-02-20 21:55:00 passed    1.62 
     7 inspection worker 2  2022-02-20 20:48:00 passed    2.12 
     8 inspection worker 2  2022-02-21 20:48:00 passed    1.33 
     9 inspection worker 2  2022-02-22 20:48:00 passed    2.60 
    10 inspection worker 2  2022-02-23 20:48:00 passed    0.766

CodePudding user response:

Not sure to fully understand what you are looking for but I give it a try!

So, using the dplyr library, you could do:

Reprex

  • Code
library(dplyr)

test %>% 
  filter(activity != "selection") %>% 
  group_by(activity) %>% 
  summarise(workers = workers[1],
            start_time = start_time[1],
            status = status[1],
            duration = sum(duration)) %>% 
  bind_rows(test %>% filter(activity == "selection"))
  • Output
#> # A tibble: 6 x 5
#>   activity   workers   start_time          status duration
#>   <chr>      <chr>     <dttm>              <chr>     <dbl>
#> 1 assignment worker 5  2022-02-20 21:55:00 passed    1.62 
#> 2 inspection worker 6  2022-02-20 21:55:55 passed    9.06 
#> 3 other      worker 34 2022-02-20 21:59:12 passed    0.557
#> 4 selection  worker 1  2022-02-20 22:30:00 passed    8.98 
#> 5 selection  worker 1  2022-02-20 22:30:00 passed    9.69 
#> 6 selection  worker 1  2022-02-20 22:30:00 passed    9.21

Created on 2022-02-25 by the reprex package (v2.0.1)

  • Related