Situation & Goal
I'm having a large table that looks like (simplified):
|MainCat |SubCat | Value|
|:-------|:------|-----:|
|A |Y | 50|
|A |Z | 60|
|A |ZZZZ | 80|
|A |XX | 90|
|A |X | 100|
|B |XYXY | 15|
|B |XXX | 50|
|B |YY | 60|
|B |ZZZ | 150|
|B |ZZ | 400|
Now I want to filter each group (MainCat
) and keep only the two lowest values (Value
) that are equal/greater than median:
|MainCat |SubCat | Value|Comment |
|:-------|:------|-----:|:---------------------|
|A |Y | 50|- |
|A |Z | 60|- |
|A |ZZZZ | 80|Median, First to keep |
|A |XX | 90|Second to keep |
|A |X | 100|- |
|B |XYXY | 15|- |
|B |XXX | 50|- |
|B |YY | 60|Median, First to keep |
|B |ZZZ | 150|Second to keep |
|B |ZZ | 400|- |
Expected result:
|MainCat |SubCat | Value|
|:-------|:------|-----:|
|A |ZZZZ | 80|
|A |XX | 90|
|B |YY | 60|
|B |ZZZ | 150|
My (failed) attempt
I tried df2[Value >= df2[MainCat==MainCat, median(Value, na.rm=TRUE)]]
but this calculates a Median for all values, without grouping. Can somebody help? As performance is key, I prefer a data.table
solution if possible. Thank you very much.
MWE
Base data:
df2 = structure(list(MainCat = c("A", "A", "A", "A", "A", "B", "B",
"B", "B", "B"), SubCat = c("Y", "Z", "ZZZZ", "XX", "X", "XYXY",
"XXX", "YY", "ZZZ", "ZZ"), Value = c(50, 60, 80, 90, 100, 15,
50, 60, 150, 400)), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
Result:
data.table(MainCat=c("A","A","B","B"),
SubCat=c("ZZZZ", "XX", "YY", "ZZZ"),
Value=c(80,90,60,150))
CodePudding user response:
Do a group by 'MainCat', get the row index (.I
) after creating the logical expression with the median
'Value', extract the index ($V1
), subset the data, order
by the 'MainCat', 'Value', get the first two rows with head
, grouped by 'MainCat'
library(data.table)
df2[df2[, .I[Value >= median(Value, na.rm = TRUE)],.(MainCat)]$V1
][order(MainCat, Value), head(.SD, 2), MainCat]
-output
MainCat SubCat Value
<char> <char> <num>
1: A ZZZZ 80
2: A XX 90
3: B YY 60
4: B ZZZ 150