how to filter columns in dataframes?-CodePudding

I have a long list of dates (starting from 1942-1-1 00:00:00 to 2012-12-31 24:00:00). These are associated with some amounts respectively (see below). Is there a way to first filter all amounts for one day separately, and then add them up together?

For example for 1942-01-01, how to find all values (amounts) that occur in this day (from time 0 to 24) and then sum them together?

    time                    amount
        DateTime            Float64
    1942-01-01T00:00:00     7.0
    1942-01-02T00:00:00     0.2
    1942-01-03T00:00:00     2.1
    1942-01-04T00:00:00     3.0
        :
    2012-12-31T23:00:00     4.0
    2012-12-31T24:00:00     0.0

df = CSV.read(path, DataFrame)
for i in 1:24
  filter(r ->hour(r.time) == i, df)
end

CodePudding user response：

There are many approaches you could use (and maybe some other commenters will propose alternatives). Here let me show you how to achieve what you want without any filtering:

julia> df = DataFrame(time=[DateTime(2020, 1, rand(1:2), rand(0:23)) for _ in 1:100], amount=rand(100))
100×2 DataFrame
 Row │ time                 amount
     │ DateTime             Float64
─────┼────────────────────────────────
   1 │ 2020-01-02T16:00:00  0.29325
   2 │ 2020-01-02T02:00:00  0.376917
   3 │ 2020-01-02T09:00:00  0.11849
   4 │ 2020-01-02T04:00:00  0.462997
  ⋮  │          ⋮               ⋮
  97 │ 2020-01-02T18:00:00  0.750604
  98 │ 2020-01-01T13:00:00  0.179414
  99 │ 2020-01-01T15:00:00  0.552547
 100 │ 2020-01-01T02:00:00  0.769066
                       92 rows omitted

julia> transform!(df, :time => ByRow(Date) => :date, :time => ByRow(hour) => :hour)
100×4 DataFrame
 Row │ time                 amount     date        hour
     │ DateTime             Float64    Date        Int64
─────┼───────────────────────────────────────────────────
   1 │ 2020-01-02T16:00:00  0.29325    2020-01-02     16
   2 │ 2020-01-02T02:00:00  0.376917   2020-01-02      2
   3 │ 2020-01-02T09:00:00  0.11849    2020-01-02      9
   4 │ 2020-01-02T04:00:00  0.462997   2020-01-02      4
  ⋮  │          ⋮               ⋮          ⋮         ⋮
  97 │ 2020-01-02T18:00:00  0.750604   2020-01-02     18
  98 │ 2020-01-01T13:00:00  0.179414   2020-01-01     13
  99 │ 2020-01-01T15:00:00  0.552547   2020-01-01     15
 100 │ 2020-01-01T02:00:00  0.769066   2020-01-01      2
                                          92 rows omitted

julia> unstack(df, :hour, :date, :amount, combine=sum, fill=0)
24×3 DataFrame
 Row │ hour   2020-01-02  2020-01-01
     │ Int64  Float64     Float64
─────┼───────────────────────────────
   1 │    16    1.06636     0.949414
   2 │     2    0.990913    1.43032
   3 │     9    0.183206    3.16363
   4 │     4    1.24055     0.57196
  ⋮  │   ⋮        ⋮           ⋮
  21 │    10    0.0         0.492397
  22 │    14    0.393438    0.0
  23 │    21    0.0         0.487992
  24 │     8    0.848852    0.0
                      16 rows omitted

The final result is a data frame that gives you aggregates for all hours (in rows) for all days (in columns). The data is presented in the order of their appearance, so you might want to sort the result by hour:

julia> res = sort!(unstack(df, :hour, :date, :amount, combine=sum, fill=0), :hour)
24×3 DataFrame
 Row │ hour   2020-01-02  2020-01-01
     │ Int64  Float64     Float64
─────┼───────────────────────────────
   1 │     0    1.99143     0.150979
   2 │     1    1.25939     0.860835
   3 │     2    0.990913    1.43032
   4 │     3    3.83337     2.33696
  ⋮  │   ⋮        ⋮           ⋮
  21 │    20    1.73576     1.93323
  22 │    21    0.0         0.487992
  23 │    22    1.52546     0.651938
  24 │    23    1.03808     0.0
                      16 rows omitted

Now you can extract information for a specific day just by extracting a column corresponding to it, e.g.:

julia> res."2020-01-02"
24-element Vector{Float64}:
 1.991425180864845
 1.2593855803084226
 0.9909134301068651
 3.833369559458414
 1.2405519797178841
 1.4494215475119732
 ⋮
 2.4509665509554157
 0.0
 1.7357636571508785
 0.0
 1.525457178008634
 1.0380772820126043

For the amount of data you have there should be no problem with getting all the results in one shot (in this example I pre-sorted the source data frame on day and hour to make the final table nicely ordered both by rows and columns):

julia> @time big = DataFrame(time=[DateTime(rand(1942:2012), rand(1:12), rand(1:28), rand(0:23)) for _ in 1:10^7], amount=rand(10^7));
  0.413495 seconds (99.39 k allocations: 310.149 MiB, 3.75% gc time, 5.54% compilation time)

julia> @time sort!(transform!(big, :time => ByRow(Date) => :date, :time => ByRow(hour) => :hour), [:date, :hour]);
  5.049808 seconds (1.03 M allocations: 1.167 GiB, 0.81% gc time)

julia> @time unstack(big, :hour, :date, :amount, combine=sum, fill=0)
  1.342251 seconds (21.58 M allocations: 673.052 MiB, 13.63% gc time)
24×23857 DataFrame
 Row │ hour   1942-01-01  1942-01-02  1942-01-03  1942-01-04  1942-01-05  1942-01-06  1942-01-07  1942-01-08  1942-01-09  1942-01-10  1942-01-11  194 ⋯
     │ Int64  Float64     Float64     Float64     Float64     Float64     Float64     Float64     Float64     Float64     Float64     Float64     Flo ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │     0     9.19054     8.00765     6.99379     9.63979     6.5088     11.6281     12.4928      6.86322    11.4453     12.6505     10.0583     1 ⋯
   2 │     1     8.78977     8.32879     6.29344    12.0815      9.83297     8.24592    10.349      10.1213      6.51192     6.1523      8.38962
   3 │     2     5.51566     9.97157    12.1064      8.28468    11.1929      8.274       8.25525     7.88186     4.65225     7.44625     6.62251    1
   4 │     3     7.25526    13.1635      4.75877     9.77418    11.5427      6.30625     6.2512      8.06394     8.77394    12.5935      9.09008
  ⋮  │   ⋮        ⋮           ⋮           ⋮           ⋮           ⋮           ⋮           ⋮           ⋮           ⋮           ⋮           ⋮           ⋱
  21 │    20     8.46999     9.99227    11.1116     14.5478     11.8379      7.38414    11.0567      6.17652    10.6811      9.059       9.77321      ⋯
  22 │    21     7.02998    10.0908      5.5182      8.8145      9.81238    10.8413      8.65648    12.6846     12.1116      8.75566    11.2892     1
  23 │    22     9.17824    13.2115     10.589       9.87813    10.7258      7.97428    12.8137     10.3456      8.37605     9.54897     7.24197
  24 │    23    13.0214     10.2333      9.08972    11.8678      7.36996     9.80802    11.0031      6.0818     11.7789      4.3467      7.49586
                                                                                                                      23845 columns and 16 rows omitted

EDIT

Here is an example how you can use filter. I assume we work on a big data frame created above and want information for 1942-02-03 only. I am also using Chain.jl to nicely chain the performed operations:

julia> @chain big begin
           filter(:date => ==(Date("1942-02-03")), _)
           groupby(:hour, sort=true)
           combine(:amount => sum)
       end
24×2 DataFrame
 Row │ hour   amount_sum
     │ Int64  Float64
─────┼───────────────────
   1 │     0     6.22427
   2 │     1     8.33195
   3 │     2     9.26992
   4 │     3    13.7858
  ⋮  │   ⋮        ⋮
  21 │    20     6.59938
  22 │    21     6.07788
  23 │    22     6.68741
  24 │    23     7.59147
          16 rows omitted

(if anything is unclear please comment)