Home > Back-end >  Quartils calculation in Julia DataFrame
Quartils calculation in Julia DataFrame

Time:10-08

I have this dataframe

Seller     Product     Price
  1           A         10.00
  2           A         5.0
  3           A         7.50
  4           A         3.75

What I want as Output:

Seller     Product     Price    Quartil_1  Quartil_2  Quartil_3   Quartil_4  Rank_Price
  1           A         10.00      2.5       5.0        7.5         10         4
  2           A         5.0        2.5       5.0        7.5         10         2
  3           A         7.50       2.5       5.0        7.5         10         3
  4           A         3.35       2.5       5.0        7.5         10         1

So, my dataframe will have several sellers selling the same price for different prices. I need to calculate de 4 quartils of prices, considering all sellers for he same product and then classify each seller's price in this quartils, like the example above.

I know how to do it in python, using groupby, quartiles and rank but I am pretty new to Julia.

If someone can give a hand, please, i appreciate.

CodePudding user response:

Your input and output tables do not match. Also note that there are normally three quartiles defined Q1, Q2, and Q3 (see e.g. https://en.wikipedia.org/wiki/Quartile).

This is what I assume you want:

julia> using DataFrames, CategoricalArrays, DataFramesMeta

julia> df = DataFrame(seller=repeat(1:10, outer=2), product=repeat(["A", "B"], inner=10), price=1.5:1.0:20.5)
20×3 DataFrame
 Row │ seller  product  price
     │ Int64   String   Float64
─────┼──────────────────────────
   11  A            1.5
   22  A            2.5
   33  A            3.5
   44  A            4.5
   55  A            5.5
   66  A            6.5
   77  A            7.5
   88  A            8.5
   99  A            9.5
  1010  A           10.5
  111  B           11.5
  122  B           12.5
  133  B           13.5
  144  B           14.5
  155  B           15.5
  166  B           16.5
  177  B           17.5
  188  B           18.5
  199  B           19.5
  2010  B           20.5

julia> @transform groupby(df, :product) begin
       :quantile_1 = quantile(:price, 0.25)
       :quantile_2 = quantile(:price, 0.5)
       :quantile_3 = quantile(:price, 0.75)
       :rank_price = levelcode.(cut(:price, 4))
       end
20×7 DataFrame
 Row │ seller  product  price    quantile_1  quantile_2  quantile_3  rank_price
     │ Int64   String   Float64  Float64     Float64     Float64     Int64
─────┼──────────────────────────────────────────────────────────────────────────
   11  A            1.5        3.75         6.0        8.25           1
   22  A            2.5        3.75         6.0        8.25           1
   33  A            3.5        3.75         6.0        8.25           1
   44  A            4.5        3.75         6.0        8.25           2
   55  A            5.5        3.75         6.0        8.25           2
   66  A            6.5        3.75         6.0        8.25           3
   77  A            7.5        3.75         6.0        8.25           3
   88  A            8.5        3.75         6.0        8.25           4
   99  A            9.5        3.75         6.0        8.25           4
  1010  A           10.5        3.75         6.0        8.25           4
  111  B           11.5       13.75        16.0       18.25           1
  122  B           12.5       13.75        16.0       18.25           1
  133  B           13.5       13.75        16.0       18.25           1
  144  B           14.5       13.75        16.0       18.25           2
  155  B           15.5       13.75        16.0       18.25           2
  166  B           16.5       13.75        16.0       18.25           3
  177  B           17.5       13.75        16.0       18.25           3
  188  B           18.5       13.75        16.0       18.25           4
  199  B           19.5       13.75        16.0       18.25           4
  2010  B           20.5       13.75        16.0       18.25           4

or (maybe a bit more direct method but requiring an extra package)

julia> using StatsBase

julia> res = @transform groupby(df, :product) begin
       :quantile_1 = quantile(:price, 0.25)
       :quantile_2 = quantile(:price, 0.5)
       :quantile_3 = quantile(:price, 0.75)
       :rank_price = denserank(cut(:price, 4))
       end
20×7 DataFrame
 Row │ seller  product  price    quantile_1  quantile_2  quantile_3  rank_price
     │ Int64   String   Float64  Float64     Float64     Float64     Int64
─────┼──────────────────────────────────────────────────────────────────────────
   11  A            1.5        3.75         6.0        8.25           1
   22  A            2.5        3.75         6.0        8.25           1
   33  A            3.5        3.75         6.0        8.25           1
   44  A            4.5        3.75         6.0        8.25           2
   55  A            5.5        3.75         6.0        8.25           2
   66  A            6.5        3.75         6.0        8.25           3
   77  A            7.5        3.75         6.0        8.25           3
   88  A            8.5        3.75         6.0        8.25           4
   99  A            9.5        3.75         6.0        8.25           4
  1010  A           10.5        3.75         6.0        8.25           4
  111  B           11.5       13.75        16.0       18.25           1
  122  B           12.5       13.75        16.0       18.25           1
  133  B           13.5       13.75        16.0       18.25           1
  144  B           14.5       13.75        16.0       18.25           2
  155  B           15.5       13.75        16.0       18.25           2
  166  B           16.5       13.75        16.0       18.25           3
  177  B           17.5       13.75        16.0       18.25           3
  188  B           18.5       13.75        16.0       18.25           4
  199  B           19.5       13.75        16.0       18.25           4
  2010  B           20.5       13.75        16.0       18.25           4

I am using DataFramesMeta.jl as then the code is probably a bit easier to read than vanilla DataFrames.jl code.

  • Related