Home > Software design >  how to deal with missing values in ifelse function julia
how to deal with missing values in ifelse function julia

Time:12-29

I am using Julia and I got a dataframe with 42 values, of which 2 are missing.

This values are prices that go from 0.23 to 0.3

I am trying to get a new column that tells if its cheap or expensive by a ifelse statement.

the ifelse should go:

df.x_category=ifelse.(df.x .< mean(df.x),"cheap", "expensive")

but i get the following error:

ERROR: TypeError: non-boolean (Missing) used in boolean context

Is there a way to skip those missing values?

I tried with:

df.x_category=ifelse.(skipmissing(df.x) .< mean(skipmissing(df.x)),"cheap", "expensive")

but get this error:

ERROR: ArgumentError: New columns must have the same length as old columns

I can't just delete missing observations.

How can i make this?

Thanks in advance!

CodePudding user response:

ifelse can handle only 2 values and you need handle 3. Assuming that you have

df = DataFrame(x=rand([0.23,0.3,missing], 10))

than mean(df.x) yields a missing since some of values are missings. You need to do instead mean(skipmissing(df.x))).

Hence the code could be:

julia> map(x -> ismissing(x) ? missing : ifelse(x,"cheap", "expensive"), df.x .< mean(skipmissing(df.x)))
10-element Vector{Union{Missing, String}}:
 missing
 missing
 "cheap"
 missing
 "expensive"
 missing
 missing
 missing
 "cheap"
 "cheap"

Here I have combined ifelse with map for handling the missing value there are other ways but each one will require nesting some conditional function.

CodePudding user response:

You can try something like this. Using toy data.

  • First get your string values from ifelse into a vector.
  • Then prepare the string vector by converting it to a Union of strings and missing to hold missing values.
  • Finally put the missing values into the vector.
julia> using DataFrames, Random 

julia> vec = ifelse.(df.d[ismissing.(df.d) .== false] .> 0.5,"higher","lower")
40-element Vector{String}:
 "higher"
 "lower"
 "lower"
etc...

julia> vec = convert(Vector{Union{Missing,String}}, vec)
40-element Vector{Union{Missing, String}}

julia> for i in findall(ismissing.(df.d)) insert!(vec, i, missing) end

julia> df.x = vec

julia> df
42×2 DataFrame
 Row │ d                x
     │ Float64?         String?
─────┼──────────────────────────
   1 │       0.533183   higher
   2 │       0.454029   lower
   3 │       0.0176868  lower
   4 │       0.172933   lower
   5 │       0.958926   higher
   6 │       0.973566   higher
   7 │       0.30387    lower
   8 │       0.176909   lower
   9 │       0.956916   higher
  10 │       0.584284   higher
  11 │       0.937466   higher
  12 │ missing          missing
  13 │       0.422956   lower
etc...

Data

julia> Random.seed!(42)
MersenneTwister(42)

julia> data = Random.rand(42)
42-element Vector{Float64}:
 0.5331830160438613
 0.4540291355871424
etc...

julia> data = convert(Vector{Union{Missing,Float64}}, data)
42-element Vector{Union{Missing, Float64}}

julia> data[[12,34]] .= missing
2-element view(::Vector{Union{Missing, Float64}}, [12, 34]) with eltype Union{Missing, Float64}:
 missing
 missing

julia> df = DataFrame(d=data)

CodePudding user response:

i would do it with a function that returns cheap, expensive or missing:

using Statistics
data = ifelse.(rand(Bool,100),missing,100*rand(100)) #generator for the data
meandata = mean(skipmissing(data)) #mean of the data

function category_select(x)
  ismissing(x) && return missing  #short-circuit operator
  return ifelse(x<meandata,"cheap","expensive") #parentheses are optional
end

category_select2(x) = ismissing(x) ? missing : (x < meandata ? "cheap" : "expensive)

#broadcast values
x_category = category_selector.(data)
x_category = category_selector2.(data)

now, what is happening? there are two things with the ifelse function:

  1. It evaluates both branches at the same time, so if one branch can error, it will error. take this example:
maybelog(x) = ifelse(x<0,zero(x),log(x)) #ifelse
maybelog2(x) = begin if x<0; zero(x);else;log(x);end #full if expression
maybelog3(x) = x<0 ? zero(x) : log(x) #ternary operator

maybelog fails with x = -1, whereas maybelog2 and maybelog3 does not.

  1. The first argument is always a bool. In your initial expression,the result of df.x .< mean(df.x) can be true, false or missing, so ifelse also fails there.

in your modified expression, the length of skipmissing(df.x) is different than the length of x as the first one doesnt count the missing values present in x, resulting in a smaller vector than the size of your dataframe.

  • Related