Home > Blockchain >  How to subset a dataframe in R based on multiple conjoined conditions
How to subset a dataframe in R based on multiple conjoined conditions

Time:06-05

Let's say I have the data frame df below:

df <- data.frame(School_ID = c(3334, 3111, 3334, 3500, 3800, 3111, 2000, 5000),
                 Phase = c("Phase1", "Phase1", "Phase2", "Phase1", "Phase2",
                           "Phase2", "Phase1", "Phase2"),
                 MeasureAT1 = c(5, 3, NA, 4, 4, 1, NA, 1),
                 MeasureAT2 = c(NA, NA, 3, 1, 5, 2, NA, 3),
                 MeasureBT1 = c(2, 3, 4, NA, 4, NA, 1, 1),
                 MeasureBT2 = c(NA, 2, 4, 1, NA, 2, NA, 3))

Created on 2022-06-04 by the reprex package (v2.0.1)

What I am trying to achieve is to subset this data frame so that the data frame I am left with meets the following conditions:

1 - I can remove the row associated with School ID == 3334 and Phase = Phase1 (in this case row 1), while keeping the row with the results of that same School ID 3334 but referring to Phase2 on the Phase column (in this case row 3);

2 - I can remove the row associated with School ID 3111 and Phase2 (in this case row 6), while keeping the row with the results of that same School ID 3111 but referring to Phase1 on the Phase column (in this case row 2);

3 - I keep all other rows intact and their respective columns (Measures included);

4 - The solution is not based on row number. The reason is that in the actual data frame there are over 30000 rows and it will be impractical to remove about 8000 of them using row numbers;

I have tried subsetting and filtering but once I set Phase1 in the filter the Phase 2 options are immediately filtered out and vice versa, and in this case what I am aiming for is a filter that only removes a row if two conditions are simultaneously met. If the right school ID is met but the Phase is wrong, the row stays, if the right school ID is met and the corresponding value for Phase is also met, the column goes.

Any help you can provide would be most welcome.

CodePudding user response:

Logical operations can be combined in the second argument to subset:

subset(df , !( (School_ID == 3334 & Phase == "Phase1") | 
               (School_ID == 3111 & Phase == "Phase2")) )  #negation of both
  School_ID  Phase MeasureAT1 MeasureAT2 MeasureBT1 MeasureBT2
2      3111 Phase1          3         NA          3          2
3      3334 Phase2         NA          3          4          4
4      3500 Phase1          4          1         NA          1
5      3800 Phase2          4          5          4         NA
7      2000 Phase1         NA         NA          1         NA
8      5000 Phase2          1          3          1          3

The logical argument {Not ( A Or B)} could also have been {Not A & Not B}:

             ! (School_ID == 3334 & Phase == "Phase1") &
              ! (School_ID == 3111 & Phase == "Phase2")
  • Related