I have a DataFrame
that's 659 x 2 in its size, and is sorted according to its Low
column. Its first 20 rows can be seen below:
julia> size(dfl)
(659, 2)
julia> first(dfl, 20)
20×2 DataFrame
Row │ Date Low
│ Date… Float64
─────┼──────────────────────
1 │ 2010-05-06 0.708333
2 │ 2010-07-01 0.717292
3 │ 2010-08-27 0.764583
4 │ 2010-08-31 0.776146
5 │ 2010-08-25 0.783125
6 │ 2010-05-25 0.808333
7 │ 2010-06-08 0.820938
8 │ 2010-07-20 0.82375
9 │ 2010-05-21 0.824792
10 │ 2010-08-16 0.842188
11 │ 2010-08-12 0.849688
12 │ 2010-02-25 0.871979
13 │ 2010-02-23 0.879896
14 │ 2010-07-30 0.890729
15 │ 2010-06-01 0.916667
16 │ 2010-08-06 0.949271
17 │ 2010-09-10 0.949792
18 │ 2010-03-04 0.969375
19 │ 2010-05-17 0.9875
20 │ 2010-03-09 1.0349
What I'd like to do is to filter out all rows in this dataframe such that only rows with monotonically increasing dates remain. So if applied to the first 20 rows above, I'd like the output to be the following:
julia> my_filter_or_subset(f, first(dfl, 20))
5×2 DataFrame
Row │ Date Low
│ Date… Float64
─────┼──────────────────────
1 │ 2010-05-06 0.708333
2 │ 2010-07-01 0.717292
3 │ 2010-08-27 0.764583
4 │ 2010-08-31 0.776146
5 │ 2010-09-10 0.949792
Is there some high-level way to achieve this using Julia and DataFrames.jl?
I should also note that, I originally prototyped the solution in Python using Pandas, and b/c it was just a PoC I didn't bother try to figure out how to achieve this using Pandas either (assuming it's even possible). And instead, I just used a Python for
loop to iterate over each row of the dataframe, then only appended the rows whose dates are greater than the last date of the growing list.
I'm now trying to write this better in Julia, and looked into filter
and subset
methods in DataFrames.jl. Intuitively filter
doesn't seem like it'd work, since the user supplied filter function can only access contents from each passed row; subset
might be feasible since it has access to the entire column of data. But it's not obvious to me how to do this cleanly and efficiently, assuming it's even possible. If not, then guess I'll just have to stick with using a for
loop here too.
CodePudding user response:
- You need to use
for
loop for this task in the end (you have to loop all values) - In Julia loops are fast so using your own
for
loop does not hinder performance. - If you are looking for something that is relatively short to type (but it will be slower than a custom for loop as it will perform the operation in several passes) you can use e.g.:
dfl[pushfirst!(diff(accumulate(max, dfl.Date)) .> 0, true), :]