Home > Enterprise >  Row Number per Group in DataFrame
Row Number per Group in DataFrame

Time:02-22

I have a Julia DataFrame

using DataFrames
df = DataFrame(a = [1,1,1,2,2,2,2], b = 1:7)

7×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     2      4
   5 │     2      5
   6 │     2      6
   7 │     2      7

and want to create a new column that contains the row number per group. It should look like this

7×2 DataFrame
 Row │ a      b      c 
     │ Int64  Int64  Int64
─────┼──────────────────────
   1 │     1      1      1
   2 │     1      2      2
   3 │     1      3      3
   4 │     2      4      4
   5 │     2      5      1
   6 │     2      6      2
   7 │     2      7      3

I am open to any solution, but I am especially looking for a DataFramesMeta solution that works out nicely together with the Chain package. R's dplyr has a simple function named n() that is doing this. I feel like there must be something similar in Julia

CodePudding user response:

Do:

julia> using DataFrames, DataFramesMeta

julia> df = DataFrame(a = [1,1,1,2,2,2,2], b = 1:7)
7×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     2      4
   5 │     2      5
   6 │     2      6
   7 │     2      7

julia> @chain df begin
           groupby(:a)
           @transform(:c = eachindex(:b))
       end
7×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      1
   2 │     1      2      2
   3 │     1      3      3
   4 │     2      4      1
   5 │     2      5      2
   6 │     2      6      3
   7 │     2      7      4

In upcoming DataFrames.jl 1.4 release it will be even simpler, see https://github.com/JuliaData/DataFrames.jl/pull/3001.

(the difference is that you will not have to pass the column name as :b in this case but write :c = $eachindex)

  • Related