How to get a sequence number or row number from a sub dataframe in Julia-CodePudding

I have a large Julia dataframe which I can group on a number of columns.

Having done this I need to number the rows within the groups. So the dataframe is -

    Row │ seq    TOID              Lane     SectionLabel    Chainage  
        │ Int64  String31          String3  String15        Float64   
────────┼─────────────────────────────────────────────────────────────
      1 │     1  4000000015522725  CL1      A0006/210/08/U      0.0   
      2 │     1  4000000015522725  CL1      A0006/210/08/U     10.02  
      3 │     1  4000000015522725  CL1      A0006/210/08/U     20.03  
      4 │     1  4000000015522725  CL1      A0006/210/08/U     30.05  
      5 │     1  4000000015522725  CL1      A0006/210/08/U     40.07  
      6 │     1  4000000015522725  CL1      A0006/210/08/U     50.08  
      7 │     1  4000000015522725  CL1      A0006/210/08/U     60.1
      8 |     1  4000000015522726  CL1      A0006/210/08/R     0
      9 |     1  4000000015522726  CL1      A0006/210/08/R     9.8
   ⋮    │   ⋮           ⋮             ⋮           ⋮            ⋮          
 123058 │     1  4000000015531585  CL1      C5111/005/34/R   2191.99  
 123059 │     1  4000000015531585  CL1      C5111/005/34/R   2202.0   
 123060 │     1  4000000015531585  CL1      C5111/005/34/R   2212.0   
 123061 │     1  4000000015531585  CL1      C5111/005/34/R   2222.01 
 123062 │     1  4000000015531585  CL1      C5111/005/34/R   2232.02  
 123063 │     1  4000000015531585  CL1      C5111/005/34/R   2242.03

I need the 'seq' field to reflect the number of the row within the group (subdataframe), i.e. 1 to 7 for the first 7 rows. Then restart from 1 for the second group and so on for further groups.

the lines used to create the above extract are

grouped_currentDF = combine(groupby(currentDF,[:TOID, :Lane, :SectionLabel, :Chainage]), 
                                    sdf -> sort(sdf,[:TOID, :Lane, :SectionLabel, :Chainage]), 
                                    #[:TOID, :Lane, :SectionLabel, :Chainage] => eachindex => :s, 
                                    nrow => :seq)

(the commented out part just adds another column of type cartesian index and didn't help)

and

extract = select(grouped_currentDF, [:seq, :TOID, :Lane, :SectionLabel, :Chainage])

Help and pointers are appreciated.

CodePudding user response：

Here is an example with one method to achieve this:

julia> df = DataFrame(a = rand(1:4,10), b = rand(10))                                               [1/1932]
10×2 DataFrame                                                                                              
 Row │ a      b         
     │ Int64  Float64   
─────┼──────────────────
   1 │     2  0.72928
   2 │     4  0.216733
   3 │     3  0.795951
   4 │     3  0.0701898
...

julia> gdf = groupby(df, :a)
GroupedDataFrame with 4 groups based on key: a
First Group (3 rows): a = 1
 Row │ a      b        
     │ Int64  Float64  
─────┼─────────────────
   1 │     1  0.279315
   2 │     1  0.495083
   3 │     1  0.274184
...

Now, to add the index within the group:

julia> df.igi = map(let c = zeros(Int, length(gdf)); i -> c[i]  = 1 ; end, groupindices(gdf));

julia> gdf
GroupedDataFrame with 4 groups based on key: a
First Group (3 rows): a = 1
 Row │ a      b         igi   
     │ Int64  Float64   Int64 
─────┼────────────────────────
   1 │     1  0.279315      1
   2 │     1  0.495083      2
   3 │     1  0.274184      3
...

This method uses a vector captured in a closure to keep count of how many of rows of each group were encountered. The number of groups is known by length(gdf) and the group of each row is obtained with groupindices(gdf).

Waiting to see the other methods...

CodePudding user response：

Roy, in your code there is a right function that should be used - it is eachindex (but for some reason you commented it out).

Here is how you can use it (using data created by Dan):

julia> df = DataFrame(a = rand(1:4,10), b = rand(10))
10×2 DataFrame
 Row │ a      b
     │ Int64  Float64
─────┼─────────────────
   1 │     4  0.919583
   2 │     4  0.97129
   3 │     4  0.341484
   4 │     4  0.650949
   5 │     3  0.742711
   6 │     1  0.466313
   7 │     2  0.315033
   8 │     4  0.144105
   9 │     2  0.759096
  10 │     2  0.603755

julia> gdf = groupby(df, :a)
GroupedDataFrame with 4 groups based on key: a
First Group (1 row): a = 1
 Row │ a      b
     │ Int64  Float64
─────┼─────────────────
   1 │     1  0.466313
⋮
Last Group (5 rows): a = 4
 Row │ a      b
     │ Int64  Float64
─────┼─────────────────
   1 │     4  0.919583
   2 │     4  0.97129
   3 │     4  0.341484
   4 │     4  0.650949
   5 │     4  0.144105

julia> transform!(gdf, eachindex => :igi)
10×3 DataFrame
 Row │ a      b         igi
     │ Int64  Float64   Int64
─────┼────────────────────────
   1 │     4  0.919583      1
   2 │     4  0.97129       2
   3 │     4  0.341484      3
   4 │     4  0.650949      4
   5 │     3  0.742711      1
   6 │     1  0.466313      1
   7 │     2  0.315033      1
   8 │     4  0.144105      5
   9 │     2  0.759096      2
  10 │     2  0.603755      3