I am looking for a solution to find out nth largest data in my Julia dataframe, something like ,pd.Series.nlargest(n= 5, keep='first')
in Python.
In more detail, let's say I have Julia dataframe, such as ;
df = DataFrame(Data1 = rand(5), Data2 = rand(5));
Data1 Data2
Float64 Float64
1 0.125824 0.841358
2 0.612905 0.337965
3 0.210736 0.66849
4 0.172203 0.377226
5 0.898269 0.448477
How can I get the nth largest value from column name Data1?
If n =3, below is my expected output.
5 0.898269
2 0.612905
3 0.210736
CodePudding user response:
Here is an efficient way to do it. First, to subset rows of a data frame:
julia> df = DataFrame(Data1 = rand(10), Data2 = rand(10));
julia> df[partialsortperm(df.Data1, 1:3, rev=true), :] # if you need a data frame with top 3 rows
3×2 DataFrame
Row │ Data1 Data2
│ Float64 Float64
─────┼────────────────────
1 │ 0.959456 0.628431
2 │ 0.856696 0.144034
3 │ 0.824744 0.996384
julia> df[partialsortperm(df.Data1, 3, rev=true), :] # if you need only the 3-rd row
DataFrameRow
Row │ Data1 Data2
│ Float64 Float64
─────┼────────────────────
4 │ 0.824744 0.996384
Both operations are efficient. The partialsort
operation does a minimal amount of work to get the resulting the required values.
If you did not want to get all rows of the data frame, but only part of the single column then the following would be enough:
julia> partialsort(df.Data1, 1:3, rev=true) # top 3 values
3-element view(::Vector{Float64}, 1:3) with eltype Float64:
0.959456038630526
0.856695598334831
0.8247444664227905
julia> partialsort(df.Data1, 3, rev=true) # 3-rd value
0.8247444664227905