Home > Software design >  Replace missing values with mean in each column in dataframe Julia
Replace missing values with mean in each column in dataframe Julia

Time:12-25

I would like to impute the missing values with their mean value per column in a dataframe. Here is some sample data:

using DataFrames
using Statistics
df = DataFrame(x = [1, missing, 3], y = [missing, 2, 3], z = [4, 4,  missing])

3×3 DataFrame
 Row │ x        y        z       
     │ Int64?   Int64?   Int64?  
─────┼───────────────────────────
   1 │       1  missing        4
   2 │ missing        2        4
   3 │       3        3  missing 

I know you could use replace! with mean and skipmissing to impute a single column like this:

replace!(df.x, missing => mean(skipmissing(df.x)))

3-element Vector{Union{Missing, Int64}}:
 1
 2
 3

3×3 DataFrame
 Row │ x       y        z       
     │ Int64?  Int64?   Int64?  
─────┼──────────────────────────
   1 │      1  missing        4
   2 │      2        2        4
   3 │      3        3  missing

But using eachcol to impute each column like this doesn't work:

replace!(eachcol(df), missing => mean(skipmissing(eachcol(df))))

MethodError: no method matching _replace!(::Base.var"#new#352"{Tuple{Pair{Missing, Vector{Missing}}}}, ::DataFrames.DataFrameColumns{DataFrame}, ::DataFrames.DataFrameColumns{DataFrame}, ::Int64)

So I was wondering if anyone knows how to replace the missing values in each column with their mean value in Julia?

CodePudding user response:

So, replace! is incorrect. To see why note that:

julia> x = [1, missing, 2]
3-element Vector{Union{Missing, Int64}}:
 1
  missing
 2

julia> replace!(x, missing => mean(skipmissing(x)))
ERROR: InexactError: Int64(1.5)

You need the following (assuming you want an in-place operation):

julia> transform!(df, All() .=> (x -> replace(x, missing => mean(skipmissing(x)))) => identity)
3×3 DataFrame
 Row │ x        y        z
     │ Float64  Float64  Float64
─────┼───────────────────────────
   1 │     1.0      2.5      4.0
   2 │     2.0      2.0      4.0
   3 │     3.0      3.0      4.0

Note the promotion of column eltypes.

This is a general approach. An alternative that is valid in your case as you want to replace all columns would be:

julia> mapcols!(df) do x
           replace(x, missing => mean(skipmissing(x)))
       end
3×3 DataFrame
 Row │ x        y        z
     │ Float64  Float64  Float64
─────┼───────────────────────────
   1 │     1.0      2.5      4.0
   2 │     2.0      2.0      4.0
   3 │     3.0      3.0      4.0

Yet another way to do it would be:

julia> df[!, :] .= coalesce.(df, permutedims((mean∘skipmissing).(eachcol(df))))
3×3 DataFrame
 Row │ x      y        z
     │ Int64  Float64  Float64
─────┼─────────────────────────
   1 │     1      2.5      4.0
   2 │     2      2.0      4.0
   3 │     3      3.0      4.0

Note a difference. Broadcasting does not change eltype of column :x as it is not needed because the mean happens to be integer.

CodePudding user response:

If your data was already Float64 (eg. by running map(c -> df[!,c] = convert.(Union{Float64,Missing}, df[:,c]), names(df));) than a another convenient option is using Impute:

julia> Impute.substitute(df)
3×3 DataFrame
 Row │ x         y         z
     │ Float64?  Float64?  Float64?
─────┼──────────────────────────────
   1 │      1.0       2.5       4.0
   2 │      2.0       2.0       4.0
   3 │      3.0       3.0       4.0
  • Related