Filter/select rows by comparing to previous rows when using DataFrames.jl?-CodePudding

I have a DataFrame that's 659 x 2 in its size, and is sorted according to its Low column. Its first 20 rows can be seen below:

julia> size(dfl)
(659, 2)

julia> first(dfl, 20)
20×2 DataFrame
 Row │ Date        Low
     │ Date…       Float64
─────┼──────────────────────
   1 │ 2010-05-06  0.708333
   2 │ 2010-07-01  0.717292
   3 │ 2010-08-27  0.764583
   4 │ 2010-08-31  0.776146
   5 │ 2010-08-25  0.783125
   6 │ 2010-05-25  0.808333
   7 │ 2010-06-08  0.820938
   8 │ 2010-07-20  0.82375
   9 │ 2010-05-21  0.824792
  10 │ 2010-08-16  0.842188
  11 │ 2010-08-12  0.849688
  12 │ 2010-02-25  0.871979
  13 │ 2010-02-23  0.879896
  14 │ 2010-07-30  0.890729
  15 │ 2010-06-01  0.916667
  16 │ 2010-08-06  0.949271
  17 │ 2010-09-10  0.949792
  18 │ 2010-03-04  0.969375
  19 │ 2010-05-17  0.9875
  20 │ 2010-03-09  1.0349

What I'd like to do is to filter out all rows in this dataframe such that only rows with monotonically increasing dates remain. So if applied to the first 20 rows above, I'd like the output to be the following:

julia> my_filter_or_subset(f, first(dfl, 20))
5×2 DataFrame
 Row │ Date        Low
     │ Date…       Float64
─────┼──────────────────────
   1 │ 2010-05-06  0.708333
   2 │ 2010-07-01  0.717292
   3 │ 2010-08-27  0.764583
   4 │ 2010-08-31  0.776146
   5 │ 2010-09-10  0.949792

Is there some high-level way to achieve this using Julia and DataFrames.jl?

I should also note that, I originally prototyped the solution in Python using Pandas, and b/c it was just a PoC I didn't bother try to figure out how to achieve this using Pandas either (assuming it's even possible). And instead, I just used a Python for loop to iterate over each row of the dataframe, then only appended the rows whose dates are greater than the last date of the growing list.

I'm now trying to write this better in Julia, and looked into filter and subset methods in DataFrames.jl. Intuitively filter doesn't seem like it'd work, since the user supplied filter function can only access contents from each passed row; subset might be feasible since it has access to the entire column of data. But it's not obvious to me how to do this cleanly and efficiently, assuming it's even possible. If not, then guess I'll just have to stick with using a for loop here too.

CodePudding user response：

You need to use for loop for this task in the end (you have to loop all values)
In Julia loops are fast so using your own for loop does not hinder performance.
If you are looking for something that is relatively short to type (but it will be slower than a custom for loop as it will perform the operation in several passes) you can use e.g.:

dfl[pushfirst!(diff(accumulate(max, dfl.Date)) .> 0, true), :]