Home > Software engineering >  data.table, filter >= median per group and keep two lowest
data.table, filter >= median per group and keep two lowest

Time:05-02

Situation & Goal

I'm having a large table that looks like (simplified):

|MainCat |SubCat | Value|
|:-------|:------|-----:|
|A       |Y      |    50|
|A       |Z      |    60|
|A       |ZZZZ   |    80|
|A       |XX     |    90|
|A       |X      |   100|
|B       |XYXY   |    15|
|B       |XXX    |    50|
|B       |YY     |    60|
|B       |ZZZ    |   150|
|B       |ZZ     |   400|

Now I want to filter each group (MainCat) and keep only the two lowest values (Value) that are equal/greater than median:

|MainCat |SubCat | Value|Comment               |
|:-------|:------|-----:|:---------------------|
|A       |Y      |    50|-                     |
|A       |Z      |    60|-                     |
|A       |ZZZZ   |    80|Median, First to keep |
|A       |XX     |    90|Second to keep        |
|A       |X      |   100|-                     |
|B       |XYXY   |    15|-                     |
|B       |XXX    |    50|-                     |
|B       |YY     |    60|Median, First to keep |
|B       |ZZZ    |   150|Second to keep        |
|B       |ZZ     |   400|-                     |

Expected result:

|MainCat |SubCat | Value|
|:-------|:------|-----:|
|A       |ZZZZ   |    80|
|A       |XX     |    90|
|B       |YY     |    60|
|B       |ZZZ    |   150|

My (failed) attempt

I tried df2[Value >= df2[MainCat==MainCat, median(Value, na.rm=TRUE)]] but this calculates a Median for all values, without grouping. Can somebody help? As performance is key, I prefer a data.table solution if possible. Thank you very much.


MWE

Base data:

df2 = structure(list(MainCat = c("A", "A", "A", "A", "A", "B", "B", 
                                 "B", "B", "B"), SubCat = c("Y", "Z", "ZZZZ", "XX", "X", "XYXY", 
                                                "XXX", "YY", "ZZZ", "ZZ"), Value = c(50, 60, 80, 90, 100, 15, 
                                                             50, 60, 150, 400)), row.names = c(NA, -10L), class = c("data.table", 
                                                                "data.frame"))

Result:

data.table(MainCat=c("A","A","B","B"),
                 SubCat=c("ZZZZ", "XX", "YY", "ZZZ"),
                 Value=c(80,90,60,150))

CodePudding user response:

Do a group by 'MainCat', get the row index (.I) after creating the logical expression with the median 'Value', extract the index ($V1), subset the data, order by the 'MainCat', 'Value', get the first two rows with head, grouped by 'MainCat'

library(data.table)
df2[df2[, .I[Value >= median(Value, na.rm = TRUE)],.(MainCat)]$V1
    ][order(MainCat, Value), head(.SD, 2), MainCat]

-output

   MainCat SubCat Value
    <char> <char> <num>
1:       A   ZZZZ    80
2:       A     XX    90
3:       B     YY    60
4:       B    ZZZ   150
  • Related