Say I have the following df
:
using DataFrames
df = DataFrame(categ = ["a", "a", "a", "b"], numer = [1, 2, 3, 4])
#4×2 DataFrame
# Row │ categ numer
# │ String Int64
#─────┼───────────────
# 1 │ a 1
# 2 │ a 2
# 3 │ a 3
# 4 │ b 4
Now I want to perform a frequency encoding on the categorical variables (note that I just provided a minimal example and the number of categorical vars may reach hundreds) and the expected result is:
4×2 DataFrame
Row │ categ numer
│ Float64 Int64
─────┼────────────────
1 │ 0.75 1
2 │ 0.75 2
3 │ 0.75 3
4 │ 0.25 4
I achieved it using:
select(
df,
:categ => ByRow(x->count(==(x), df.categ)/size(df, 1))=>:categ,
:numer)
It works, But it's not generic:
- I performed the operation just on the
:categ
var. A workaround can be passingnames(df, String)
. - I used
df.categ
in thecount
function and it's not generic. What if I had 100 categorical vars? - I manually kept
:numer
intact, and it's not generic. What if I had 100 numerical vars?
Update:
I tried using names(df, Not(String))
for the third problem. But it doesn't work. I know that I can say names(df, Int64)
, but I don't want to limit it to Int64
only. I might had Complex
vars as well.
CodePudding user response:
This will be efficient and generic:
julia> df = DataFrame(categ = ["a", "a", "a", "b"], categ2 = ["a", "b", "b", "b"], numer = [1, 2, 3, 4])
4×3 DataFrame
Row │ categ categ2 numer
│ String String Int64
─────┼───────────────────────
1 │ a a 1
2 │ a b 2
3 │ a b 3
4 │ b b 4
julia> transform(df, names(df, String) .=> (x -> select(groupby(DataFrame(x=x), :x), proprow, keepkeys=false).proprow), renamecols=false)
4×3 DataFrame
Row │ categ categ2 numer
│ Float64 Float64 Int64
─────┼─────────────────────────
1 │ 0.75 0.25 1
2 │ 0.75 0.75 2
3 │ 0.75 0.75 3
4 │ 0.25 0.75 4
If the function looks too long for you then do:
julia> freqrow(x) = select(groupby(DataFrame(x=x), :x), proprow, keepkeys=false).proprow
freqrow (generic function with 1 method)
julia> transform(df, names(df, String) .=> freqrow, renamecols=false)
4×3 DataFrame
Row │ categ categ2 numer
│ Float64 Float64 Int64
─────┼─────────────────────────
1 │ 0.75 0.25 1
2 │ 0.75 0.75 2
3 │ 0.75 0.75 3
4 │ 0.25 0.75 4
Alternatively you could do:
julia> df2 = copy(df);
julia> for col in names(df2, String)
df2 = transform(groupby(df2, col), proprow => col, copycols=false, keepkeys=false)
end
julia> df2
4×3 DataFrame
Row │ categ categ2 numer
│ Float64 Float64 Int64
─────┼─────────────────────────
1 │ 0.75 0.25 1
2 │ 0.75 0.75 2
3 │ 0.75 0.75 3
4 │ 0.25 0.75 4
which maybe is easier to digest.