Home > Software engineering >  Filter/select rows by comparing to previous rows when using DataFrames.jl?
Filter/select rows by comparing to previous rows when using DataFrames.jl?

Time:10-18

I have a DataFrame that's 659 x 2 in its size, and is sorted according to its Low column. Its first 20 rows can be seen below:

julia> size(dfl)
(659, 2)

julia> first(dfl, 20)
20×2 DataFrame
 Row  Date        Low
      Date…       Float64
─────┼──────────────────────
   1  2010-05-06  0.708333
   2  2010-07-01  0.717292
   3  2010-08-27  0.764583
   4  2010-08-31  0.776146
   5  2010-08-25  0.783125
   6  2010-05-25  0.808333
   7  2010-06-08  0.820938
   8  2010-07-20  0.82375
   9  2010-05-21  0.824792
  10  2010-08-16  0.842188
  11  2010-08-12  0.849688
  12  2010-02-25  0.871979
  13  2010-02-23  0.879896
  14  2010-07-30  0.890729
  15  2010-06-01  0.916667
  16  2010-08-06  0.949271
  17  2010-09-10  0.949792
  18  2010-03-04  0.969375
  19  2010-05-17  0.9875
  20  2010-03-09  1.0349

What I'd like to do is to filter out all rows in this dataframe such that only rows with monotonically increasing dates remain. So if applied to the first 20 rows above, I'd like the output to be the following:

julia> my_filter_or_subset(f, first(dfl, 20))
5×2 DataFrame
 Row  Date        Low
      Date…       Float64
─────┼──────────────────────
   1  2010-05-06  0.708333
   2  2010-07-01  0.717292
   3  2010-08-27  0.764583
   4  2010-08-31  0.776146
   5  2010-09-10  0.949792

Is there some high-level way to achieve this using Julia and DataFrames.jl?

I should also note that, I originally prototyped the solution in Python using Pandas, and b/c it was just a PoC I didn't bother try to figure out how to achieve this using Pandas either (assuming it's even possible). And instead, I just used a Python for loop to iterate over each row of the dataframe, then only appended the rows whose dates are greater than the last date of the growing list.

I'm now trying to write this better in Julia, and looked into filter and subset methods in DataFrames.jl. Intuitively filter doesn't seem like it'd work, since the user supplied filter function can only access contents from each passed row; subset might be feasible since it has access to the entire column of data. But it's not obvious to me how to do this cleanly and efficiently, assuming it's even possible. If not, then guess I'll just have to stick with using a for loop here too.

CodePudding user response:

  1. You need to use for loop for this task in the end (you have to loop all values)
  2. In Julia loops are fast so using your own for loop does not hinder performance.
  3. If you are looking for something that is relatively short to type (but it will be slower than a custom for loop as it will perform the operation in several passes) you can use e.g.:
dfl[pushfirst!(diff(accumulate(max, dfl.Date)) .> 0, true), :]
  • Related