Home > Enterprise >  'Grouping' rows based on unique values in columns
'Grouping' rows based on unique values in columns

Time:08-25

Hope you can help me.

I have the following df:

structure(list(Donorcode = c("406A001", "406A002", "406A003", 
"406A004", "406A003", "406A008", "406A009", "406A007"), Doos = c(1, 
1, 1, 1, 2, 2, 2, 2), `Leeftijd T0` = c(70, 73, 79, 75, 70, 73, 
79, 75), Instituut = c("Spaarne ziekenhuis", "Spaarne ziekenhuis", 
"Spaarne ziekenhuis", "RIVM", "RIVM", "RIVM", "RIVM", "Spaarne ziekenhuis"
), Datum = structure(c(1567468800, 1567555200, 1567900800, 1567468800, 
1567468800, 1567555200, 1567987200, 1568246400), class = c("POSIXct", 
"POSIXt"), tzone = "UTC")), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L))

I wish to make 4 groups of this data where each group has each of the values from the column 'Doos'.

My output would look like this:

  Donorcode  Doos `Leeftijd T0` Instituut          Datum              
  <chr>     <dbl>         <dbl> <chr>              <dttm>             
1 406A001       1            70 Spaarne ziekenhuis 2019-09-03 00:00:00
2 406A003       2            70 RIVM               2019-09-03 00:00:00
3 406A003       1            79 Spaarne ziekenhuis 2019-09-08 00:00:00
4 406A009       2            79 RIVM               2019-09-09 00:00:00
5 406A004       1            75 RIVM               2019-09-03 00:00:00
6 406A008       2            73 RIVM               2019-09-04 00:00:00
7 406A002       1            73 Spaarne ziekenhuis 2019-09-04 00:00:00
8 406A007       2            75 Spaarne ziekenhuis 2019-09-12 00:00:00

I've seen many posts about grouping and then summarizing but I don't need to summarize and the group_by function by dpylr doesn't seem to work for me. This is the output I get:

dplyr::group_by(df, Doos, Instituut)
# A tibble: 8 × 5
# Groups:   Doos, Instituut [4]
  Donorcode  Doos `Leeftijd T0` Instituut          Datum              
  <chr>     <dbl>         <dbl> <chr>              <dttm>             
1 406A001       1            70 Spaarne ziekenhuis 2019-09-03 00:00:00
2 406A002       1            73 Spaarne ziekenhuis 2019-09-04 00:00:00
3 406A003       1            79 Spaarne ziekenhuis 2019-09-08 00:00:00
4 406A004       1            75 RIVM               2019-09-03 00:00:00
5 406A003       2            70 RIVM               2019-09-03 00:00:00
6 406A008       2            73 RIVM               2019-09-04 00:00:00
7 406A009       2            79 RIVM               2019-09-09 00:00:00
8 406A007       2            75 Spaarne ziekenhuis 2019-09-12 00:00:00

Could someone please help? If it's possible, I would like a function that could group by multiple columns at a time (so that I can also include the Instituut column for the grouping).

I hope anyone can help me!

Thanks so much

CodePudding user response:

Maybe you want something like this where you order your dataframe based on certain sequence order. Your dataframe has a "given_seq" of 1 1 1 1 2 2 2 2 and you want a "seq_order" of 1 2 1 2 1 2 1 2. You can use the following code to order your dataframe based on that sequential order:

given_seq <- as.vector(df$Doos)
seq_order <- rep(1:2, 4)
df[order(given_seq),][order(order(seq_order)),]
#>   Donorcode Doos Leeftijd T0          Instituut      Datum
#> 1   406A001    1          70 Spaarne ziekenhuis 2019-09-03
#> 5   406A003    2          70               RIVM 2019-09-03
#> 2   406A002    1          73 Spaarne ziekenhuis 2019-09-04
#> 6   406A008    2          73               RIVM 2019-09-04
#> 3   406A003    1          79 Spaarne ziekenhuis 2019-09-08
#> 7   406A009    2          79               RIVM 2019-09-09
#> 4   406A004    1          75               RIVM 2019-09-03
#> 8   406A007    2          75 Spaarne ziekenhuis 2019-09-12

Created on 2022-08-24 with reprex v2.0.2

CodePudding user response:

You are looking for kind of "anti-clustering', and there exists the package anticlust for that. Check if this works for you.

To anti-cluster for 'Doos' and 'Instituut' we first need both as "numeric"s, which we can get using transform and do as.factor/as.numeric and then subset for columns.

library(anticlust)

dat$group <- anticlustering(
  subset(transform(dat, Instituut2=as.numeric(as.factor(Instituut))), 
         select=c(Doos, Instituut2)),
  K=4,
  objective="variance",
  method="local-maximum"
)

We can assess the result better when it's ordered.

dat[order(dat$group), ]
#   Donorcode Doos Leeftijd T0          Instituut      Datum group
# 3   406A003    1          79 Spaarne ziekenhuis 2019-09-08     1
# 5   406A003    2          70               RIVM 2019-09-03     1
# 2   406A002    1          73 Spaarne ziekenhuis 2019-09-04     2
# 7   406A009    2          79               RIVM 2019-09-09     2
# 1   406A001    1          70 Spaarne ziekenhuis 2019-09-03     3
# 6   406A008    2          73               RIVM 2019-09-04     3
# 4   406A004    1          75               RIVM 2019-09-03     4
# 8   406A007    2          75 Spaarne ziekenhuis 2019-09-12     4

or make a table.

with(dat, table(Doos, Instituut, group))
# , , group = 1
# 
# Instituut
# Doos RIVM Spaarne ziekenhuis
#    1    0                  1
#    2    1                  0
# 
# , , group = 2
# 
# Instituut
# Doos RIVM Spaarne ziekenhuis
#    1    0                  1
#    2    1                  0
# 
# , , group = 3
# 
# Instituut
# Doos RIVM Spaarne ziekenhuis
#    1    0                  1
#    2    1                  0
# 
# , , group = 4
# 
# Instituut
# Doos RIVM Spaarne ziekenhuis
#    1    1                  0
#    2    0                  1
  • Related