Home > Blockchain >  Filter a dataframe to keep only rows that are increasingly smaller than the first row
Filter a dataframe to keep only rows that are increasingly smaller than the first row

Time:09-11

I have a dataframe, which is sorted in descending order by the abc column (and needs to remain that way). So, beginning with the first row (i.e., where ID is 1259 and value is 4054), I want to keep the next row that has a lower value in the value column, which would be row 2 (i.e., value is 3902). Then, based on that value in row 2 (3902), I would then want to select the next lowest value row (i.e., 3801 in row 4). Then, based on that value in row 4 (3801), I would select the next lowest value row (i.e., rows 5-7 with a value of 1314), and so on. I also want to retain rows that are selected that have duplicate values (e.g., 1314 value for rows 5-7).

Data

# A tibble: 42 × 3
      ID   abc value
   <int> <dbl> <dbl>
 1  1259  4530 4054.
 2  1617  4415 3902.
 3  2441  4250 5286.
 4   539  4161 3801.
 5  1471  4012 1314.
 6   687  3934 1314.
 7  2436  3900 1314.
 8  1295  3591 3558.
 9  1738  3493 1188.
10  1747  3366 3902.
# … with 32 more rows

df <- structure(list(ID = c(1259L, 1617L, 2441L, 539L, 1471L, 687L, 
2436L, 1295L, 1738L, 1747L, 672L, 2187L, 2214L, 738L, 1784L, 
726L, 1712L, 2961L, 137L, 1188L, 1824L, 2856L, 953L, 2579L, 2010L, 
484L, 2769L, 1736L, 2994L, 1572L, 586L, 2462L, 2180L, 1404L, 
1472L, 1007L, 3000L, 545L, 2098L, 2029L, 2557L, 1623L), abc = c(4530, 
4415, 4250, 4161, 4012, 3934, 3900, 3591, 3493, 3366, 3248, 3172, 
3172, 3004, 3004, 2939, 2853, 2782, 2541, 2439, 2371, 2116, 2005, 
1999, 1838, 1749, 1611, 1573, 1416, 1301, 1177, 1074, 975, 805, 
729, 680, 538, 494, 309, 297, 155, 0), value = c(4053.69672793727, 
3902.10003271411, 5286.47420784735, 3801.25160651053, 1314.10007572174, 
1314.10007572174, 1314.10007572174, 3557.56126600696, 1188.20000553131, 
3902.10003271411, 1543.90001773834, 1543.90001773834, 1543.90001773834, 
1228.10004138947, 1228.10004138947, 1618.80007553101, 1618.80007553101, 
2787.70004177094, 2877.32907307533, 1289.59995269775, 2640.10003471375, 
1618.80007553101, 3557.56126600696, 3902.10003271411, 1228.10004138947, 
1329.40004730225, 1228.10004138947, 3557.56126600696, 1618.80007553101, 
1206.79998779297, 2117.7806730578, 1206.79998779297, 1206.79998779297, 
3260.02910244849, 1191.90001773834, 1191.90001773834, 1029.70001125336, 
1336.000041008, 1091.09995365143, 1234.89997673035, 1164.4999704361, 
1191.90001773834)), row.names = c(NA, -42L), class = c("tbl_df", 
"tbl", "data.frame"))

Expected Output

     ID   abc value
  <int> <dbl> <dbl>
1  1259  4530 4054.
2  1617  4415 3902.
3   539  4161 3801.
4  1471  4012 1314.
5   687  3934 1314.
6  2436  3900 1314.
7  1738  3493 1188.
8  3000   538 1030.

I know I could do something like this answer to find the next lowest value for each row, but I am unsure how to go through and filter. Since this can skip several rows, using lag and lead did not work for me. I would like to be able to do this with tidyverse or data.table or the apply family. I would like to avoid using a for loop, unless that's the only way. I have a larger dataset, so I am looking for an efficient way to do this.

CodePudding user response:

library(dplyr)
filter(df, value == cummin(value))


# A tibble: 8 × 3
     ID   abc value
  <int> <dbl> <dbl>
1  1259  4530 4054.
2  1617  4415 3902.
3   539  4161 3801.
4  1471  4012 1314.
5   687  3934 1314.
6  2436  3900 1314.
7  1738  3493 1188.
8  3000   538 1030.
  • Related