I have a large Julia dataframe which I can group on a number of columns.
Having done this I need to number the rows within the groups. So the dataframe is -
Row │ seq TOID Lane SectionLabel Chainage
│ Int64 String31 String3 String15 Float64
────────┼─────────────────────────────────────────────────────────────
1 │ 1 4000000015522725 CL1 A0006/210/08/U 0.0
2 │ 1 4000000015522725 CL1 A0006/210/08/U 10.02
3 │ 1 4000000015522725 CL1 A0006/210/08/U 20.03
4 │ 1 4000000015522725 CL1 A0006/210/08/U 30.05
5 │ 1 4000000015522725 CL1 A0006/210/08/U 40.07
6 │ 1 4000000015522725 CL1 A0006/210/08/U 50.08
7 │ 1 4000000015522725 CL1 A0006/210/08/U 60.1
8 | 1 4000000015522726 CL1 A0006/210/08/R 0
9 | 1 4000000015522726 CL1 A0006/210/08/R 9.8
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
123058 │ 1 4000000015531585 CL1 C5111/005/34/R 2191.99
123059 │ 1 4000000015531585 CL1 C5111/005/34/R 2202.0
123060 │ 1 4000000015531585 CL1 C5111/005/34/R 2212.0
123061 │ 1 4000000015531585 CL1 C5111/005/34/R 2222.01
123062 │ 1 4000000015531585 CL1 C5111/005/34/R 2232.02
123063 │ 1 4000000015531585 CL1 C5111/005/34/R 2242.03
I need the 'seq' field to reflect the number of the row within the group (subdataframe), i.e. 1 to 7 for the first 7 rows. Then restart from 1 for the second group and so on for further groups.
the lines used to create the above extract are
grouped_currentDF = combine(groupby(currentDF,[:TOID, :Lane, :SectionLabel, :Chainage]),
sdf -> sort(sdf,[:TOID, :Lane, :SectionLabel, :Chainage]),
#[:TOID, :Lane, :SectionLabel, :Chainage] => eachindex => :s,
nrow => :seq)
(the commented out part just adds another column of type cartesian index and didn't help)
and
extract = select(grouped_currentDF, [:seq, :TOID, :Lane, :SectionLabel, :Chainage])
Help and pointers are appreciated.
CodePudding user response:
Here is an example with one method to achieve this:
julia> df = DataFrame(a = rand(1:4,10), b = rand(10)) [1/1932]
10×2 DataFrame
Row │ a b
│ Int64 Float64
─────┼──────────────────
1 │ 2 0.72928
2 │ 4 0.216733
3 │ 3 0.795951
4 │ 3 0.0701898
...
julia> gdf = groupby(df, :a)
GroupedDataFrame with 4 groups based on key: a
First Group (3 rows): a = 1
Row │ a b
│ Int64 Float64
─────┼─────────────────
1 │ 1 0.279315
2 │ 1 0.495083
3 │ 1 0.274184
...
Now, to add the index within the group:
julia> df.igi = map(let c = zeros(Int, length(gdf)); i -> c[i] = 1 ; end, groupindices(gdf));
julia> gdf
GroupedDataFrame with 4 groups based on key: a
First Group (3 rows): a = 1
Row │ a b igi
│ Int64 Float64 Int64
─────┼────────────────────────
1 │ 1 0.279315 1
2 │ 1 0.495083 2
3 │ 1 0.274184 3
...
This method uses a vector captured in a closure to keep count of how many of rows of each group were encountered. The number of groups is known by length(gdf)
and the group of each row is obtained with groupindices(gdf)
.
Waiting to see the other methods...
CodePudding user response:
Roy, in your code there is a right function that should be used - it is eachindex
(but for some reason you commented it out).
Here is how you can use it (using data created by Dan):
julia> df = DataFrame(a = rand(1:4,10), b = rand(10))
10×2 DataFrame
Row │ a b
│ Int64 Float64
─────┼─────────────────
1 │ 4 0.919583
2 │ 4 0.97129
3 │ 4 0.341484
4 │ 4 0.650949
5 │ 3 0.742711
6 │ 1 0.466313
7 │ 2 0.315033
8 │ 4 0.144105
9 │ 2 0.759096
10 │ 2 0.603755
julia> gdf = groupby(df, :a)
GroupedDataFrame with 4 groups based on key: a
First Group (1 row): a = 1
Row │ a b
│ Int64 Float64
─────┼─────────────────
1 │ 1 0.466313
⋮
Last Group (5 rows): a = 4
Row │ a b
│ Int64 Float64
─────┼─────────────────
1 │ 4 0.919583
2 │ 4 0.97129
3 │ 4 0.341484
4 │ 4 0.650949
5 │ 4 0.144105
julia> transform!(gdf, eachindex => :igi)
10×3 DataFrame
Row │ a b igi
│ Int64 Float64 Int64
─────┼────────────────────────
1 │ 4 0.919583 1
2 │ 4 0.97129 2
3 │ 4 0.341484 3
4 │ 4 0.650949 4
5 │ 3 0.742711 1
6 │ 1 0.466313 1
7 │ 2 0.315033 1
8 │ 4 0.144105 5
9 │ 2 0.759096 2
10 │ 2 0.603755 3