Home > Back-end >  How to apply a function on specific columns of a dataframe using `select`?
How to apply a function on specific columns of a dataframe using `select`?

Time:01-12

Say I have the following df:

using DataFrames

df = DataFrame(categ = ["a", "a", "a", "b"], numer = [1, 2, 3, 4])

#4×2 DataFrame
# Row │ categ   numer
#     │ String  Int64
#─────┼───────────────
#   1 │ a           1
#   2 │ a           2
#   3 │ a           3
#   4 │ b           4

Now I want to perform a frequency encoding on the categorical variables (note that I just provided a minimal example and the number of categorical vars may reach hundreds) and the expected result is:

 4×2 DataFrame
  Row │ categ    numer
      │ Float64  Int64
 ─────┼────────────────
    1 │    0.75      1
    2 │    0.75      2
    3 │    0.75      3
    4 │    0.25      4

I achieved it using:

select(
  df,
  :categ => ByRow(x->count(==(x), df.categ)/size(df, 1))=>:categ,
  :numer)

It works, But it's not generic:

  1. I performed the operation just on the :categ var. A workaround can be passing names(df, String).
  2. I used df.categ in the count function and it's not generic. What if I had 100 categorical vars?
  3. I manually kept :numer intact, and it's not generic. What if I had 100 numerical vars?

Update:
I tried using names(df, Not(String)) for the third problem. But it doesn't work. I know that I can say names(df, Int64), but I don't want to limit it to Int64 only. I might had Complex vars as well.

CodePudding user response:

This will be efficient and generic:

julia> df = DataFrame(categ = ["a", "a", "a", "b"], categ2 = ["a", "b", "b", "b"], numer = [1, 2, 3, 4])
4×3 DataFrame
 Row │ categ   categ2  numer
     │ String  String  Int64
─────┼───────────────────────
   1 │ a       a           1
   2 │ a       b           2
   3 │ a       b           3
   4 │ b       b           4

julia> transform(df, names(df, String) .=> (x -> select(groupby(DataFrame(x=x), :x), proprow, keepkeys=false).proprow), renamecols=false)
4×3 DataFrame
 Row │ categ    categ2   numer
     │ Float64  Float64  Int64
─────┼─────────────────────────
   1 │    0.75     0.25      1
   2 │    0.75     0.75      2
   3 │    0.75     0.75      3
   4 │    0.25     0.75      4

If the function looks too long for you then do:

julia> freqrow(x) = select(groupby(DataFrame(x=x), :x), proprow, keepkeys=false).proprow
freqrow (generic function with 1 method)

julia> transform(df, names(df, String) .=> freqrow, renamecols=false)
4×3 DataFrame
 Row │ categ    categ2   numer
     │ Float64  Float64  Int64
─────┼─────────────────────────
   1 │    0.75     0.25      1
   2 │    0.75     0.75      2
   3 │    0.75     0.75      3
   4 │    0.25     0.75      4

Alternatively you could do:

julia> df2 = copy(df);

julia> for col in names(df2, String)
           df2 = transform(groupby(df2, col), proprow => col, copycols=false, keepkeys=false)
       end

julia> df2
4×3 DataFrame
 Row │ categ    categ2   numer
     │ Float64  Float64  Int64
─────┼─────────────────────────
   1 │    0.75     0.25      1
   2 │    0.75     0.75      2
   3 │    0.75     0.75      3
   4 │    0.25     0.75      4

which maybe is easier to digest.

  • Related