I have a long list of dates (starting from 1942-1-1 00:00:00 to 2012-12-31 24:00:00). These are associated with some amounts respectively (see below). Is there a way to first filter all amounts for one day separately, and then add them up together?
For example for 1942-01-01, how to find all values (amounts) that occur in this day (from time 0 to 24) and then sum them together?
time amount
DateTime Float64
1942-01-01T00:00:00 7.0
1942-01-02T00:00:00 0.2
1942-01-03T00:00:00 2.1
1942-01-04T00:00:00 3.0
:
2012-12-31T23:00:00 4.0
2012-12-31T24:00:00 0.0
df = CSV.read(path, DataFrame)
for i in 1:24
filter(r ->hour(r.time) == i, df)
end
CodePudding user response:
There are many approaches you could use (and maybe some other commenters will propose alternatives). Here let me show you how to achieve what you want without any filtering:
julia> df = DataFrame(time=[DateTime(2020, 1, rand(1:2), rand(0:23)) for _ in 1:100], amount=rand(100))
100×2 DataFrame
Row │ time amount
│ DateTime Float64
─────┼────────────────────────────────
1 │ 2020-01-02T16:00:00 0.29325
2 │ 2020-01-02T02:00:00 0.376917
3 │ 2020-01-02T09:00:00 0.11849
4 │ 2020-01-02T04:00:00 0.462997
⋮ │ ⋮ ⋮
97 │ 2020-01-02T18:00:00 0.750604
98 │ 2020-01-01T13:00:00 0.179414
99 │ 2020-01-01T15:00:00 0.552547
100 │ 2020-01-01T02:00:00 0.769066
92 rows omitted
julia> transform!(df, :time => ByRow(Date) => :date, :time => ByRow(hour) => :hour)
100×4 DataFrame
Row │ time amount date hour
│ DateTime Float64 Date Int64
─────┼───────────────────────────────────────────────────
1 │ 2020-01-02T16:00:00 0.29325 2020-01-02 16
2 │ 2020-01-02T02:00:00 0.376917 2020-01-02 2
3 │ 2020-01-02T09:00:00 0.11849 2020-01-02 9
4 │ 2020-01-02T04:00:00 0.462997 2020-01-02 4
⋮ │ ⋮ ⋮ ⋮ ⋮
97 │ 2020-01-02T18:00:00 0.750604 2020-01-02 18
98 │ 2020-01-01T13:00:00 0.179414 2020-01-01 13
99 │ 2020-01-01T15:00:00 0.552547 2020-01-01 15
100 │ 2020-01-01T02:00:00 0.769066 2020-01-01 2
92 rows omitted
julia> unstack(df, :hour, :date, :amount, combine=sum, fill=0)
24×3 DataFrame
Row │ hour 2020-01-02 2020-01-01
│ Int64 Float64 Float64
─────┼───────────────────────────────
1 │ 16 1.06636 0.949414
2 │ 2 0.990913 1.43032
3 │ 9 0.183206 3.16363
4 │ 4 1.24055 0.57196
⋮ │ ⋮ ⋮ ⋮
21 │ 10 0.0 0.492397
22 │ 14 0.393438 0.0
23 │ 21 0.0 0.487992
24 │ 8 0.848852 0.0
16 rows omitted
The final result is a data frame that gives you aggregates for all hours (in rows) for all days (in columns). The data is presented in the order of their appearance, so you might want to sort the result by hour
:
julia> res = sort!(unstack(df, :hour, :date, :amount, combine=sum, fill=0), :hour)
24×3 DataFrame
Row │ hour 2020-01-02 2020-01-01
│ Int64 Float64 Float64
─────┼───────────────────────────────
1 │ 0 1.99143 0.150979
2 │ 1 1.25939 0.860835
3 │ 2 0.990913 1.43032
4 │ 3 3.83337 2.33696
⋮ │ ⋮ ⋮ ⋮
21 │ 20 1.73576 1.93323
22 │ 21 0.0 0.487992
23 │ 22 1.52546 0.651938
24 │ 23 1.03808 0.0
16 rows omitted
Now you can extract information for a specific day just by extracting a column corresponding to it, e.g.:
julia> res."2020-01-02"
24-element Vector{Float64}:
1.991425180864845
1.2593855803084226
0.9909134301068651
3.833369559458414
1.2405519797178841
1.4494215475119732
⋮
2.4509665509554157
0.0
1.7357636571508785
0.0
1.525457178008634
1.0380772820126043
For the amount of data you have there should be no problem with getting all the results in one shot (in this example I pre-sorted the source data frame on day and hour to make the final table nicely ordered both by rows and columns):
julia> @time big = DataFrame(time=[DateTime(rand(1942:2012), rand(1:12), rand(1:28), rand(0:23)) for _ in 1:10^7], amount=rand(10^7));
0.413495 seconds (99.39 k allocations: 310.149 MiB, 3.75% gc time, 5.54% compilation time)
julia> @time sort!(transform!(big, :time => ByRow(Date) => :date, :time => ByRow(hour) => :hour), [:date, :hour]);
5.049808 seconds (1.03 M allocations: 1.167 GiB, 0.81% gc time)
julia> @time unstack(big, :hour, :date, :amount, combine=sum, fill=0)
1.342251 seconds (21.58 M allocations: 673.052 MiB, 13.63% gc time)
24×23857 DataFrame
Row │ hour 1942-01-01 1942-01-02 1942-01-03 1942-01-04 1942-01-05 1942-01-06 1942-01-07 1942-01-08 1942-01-09 1942-01-10 1942-01-11 194 ⋯
│ Int64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Flo ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0 9.19054 8.00765 6.99379 9.63979 6.5088 11.6281 12.4928 6.86322 11.4453 12.6505 10.0583 1 ⋯
2 │ 1 8.78977 8.32879 6.29344 12.0815 9.83297 8.24592 10.349 10.1213 6.51192 6.1523 8.38962
3 │ 2 5.51566 9.97157 12.1064 8.28468 11.1929 8.274 8.25525 7.88186 4.65225 7.44625 6.62251 1
4 │ 3 7.25526 13.1635 4.75877 9.77418 11.5427 6.30625 6.2512 8.06394 8.77394 12.5935 9.09008
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
21 │ 20 8.46999 9.99227 11.1116 14.5478 11.8379 7.38414 11.0567 6.17652 10.6811 9.059 9.77321 ⋯
22 │ 21 7.02998 10.0908 5.5182 8.8145 9.81238 10.8413 8.65648 12.6846 12.1116 8.75566 11.2892 1
23 │ 22 9.17824 13.2115 10.589 9.87813 10.7258 7.97428 12.8137 10.3456 8.37605 9.54897 7.24197
24 │ 23 13.0214 10.2333 9.08972 11.8678 7.36996 9.80802 11.0031 6.0818 11.7789 4.3467 7.49586
23845 columns and 16 rows omitted
EDIT
Here is an example how you can use filter
. I assume we work on a big
data frame created above and want information for 1942-02-03 only. I am also using Chain.jl to nicely chain the performed operations:
julia> @chain big begin
filter(:date => ==(Date("1942-02-03")), _)
groupby(:hour, sort=true)
combine(:amount => sum)
end
24×2 DataFrame
Row │ hour amount_sum
│ Int64 Float64
─────┼───────────────────
1 │ 0 6.22427
2 │ 1 8.33195
3 │ 2 9.26992
4 │ 3 13.7858
⋮ │ ⋮ ⋮
21 │ 20 6.59938
22 │ 21 6.07788
23 │ 22 6.68741
24 │ 23 7.59147
16 rows omitted
(if anything is unclear please comment)