how to deal with missing values in ifelse function julia-CodePudding

I am using Julia and I got a dataframe with 42 values, of which 2 are missing.

This values are prices that go from 0.23 to 0.3

I am trying to get a new column that tells if its cheap or expensive by a ifelse statement.

the ifelse should go:

df.x_category=ifelse.(df.x .< mean(df.x),"cheap", "expensive")

but i get the following error:

ERROR: TypeError: non-boolean (Missing) used in boolean context

Is there a way to skip those missing values?

I tried with:

df.x_category=ifelse.(skipmissing(df.x) .< mean(skipmissing(df.x)),"cheap", "expensive")

but get this error:

ERROR: ArgumentError: New columns must have the same length as old columns

I can't just delete missing observations.

How can i make this?

Thanks in advance!

CodePudding user response：

ifelse can handle only 2 values and you need handle 3. Assuming that you have

df = DataFrame(x=rand([0.23,0.3,missing], 10))

than mean(df.x) yields a missing since some of values are missings. You need to do instead mean(skipmissing(df.x))).

Hence the code could be:

julia> map(x -> ismissing(x) ? missing : ifelse(x,"cheap", "expensive"), df.x .< mean(skipmissing(df.x)))
10-element Vector{Union{Missing, String}}:
 missing
 missing
 "cheap"
 missing
 "expensive"
 missing
 missing
 missing
 "cheap"
 "cheap"

Here I have combined ifelse with map for handling the missing value there are other ways but each one will require nesting some conditional function.

CodePudding user response：

You can try something like this. Using toy data.

First get your string values from ifelse into a vector.
Then prepare the string vector by converting it to a Union of strings and missing to hold missing values.
Finally put the missing values into the vector.

julia> using DataFrames, Random 

julia> vec = ifelse.(df.d[ismissing.(df.d) .== false] .> 0.5,"higher","lower")
40-element Vector{String}:
 "higher"
 "lower"
 "lower"
etc...

julia> vec = convert(Vector{Union{Missing,String}}, vec)
40-element Vector{Union{Missing, String}}

julia> for i in findall(ismissing.(df.d)) insert!(vec, i, missing) end

julia> df.x = vec

julia> df
42×2 DataFrame
 Row │ d                x
     │ Float64?         String?
─────┼──────────────────────────
   1 │       0.533183   higher
   2 │       0.454029   lower
   3 │       0.0176868  lower
   4 │       0.172933   lower
   5 │       0.958926   higher
   6 │       0.973566   higher
   7 │       0.30387    lower
   8 │       0.176909   lower
   9 │       0.956916   higher
  10 │       0.584284   higher
  11 │       0.937466   higher
  12 │ missing          missing
  13 │       0.422956   lower
etc...

Data

julia> Random.seed!(42)
MersenneTwister(42)

julia> data = Random.rand(42)
42-element Vector{Float64}:
 0.5331830160438613
 0.4540291355871424
etc...

julia> data = convert(Vector{Union{Missing,Float64}}, data)
42-element Vector{Union{Missing, Float64}}

julia> data[[12,34]] .= missing
2-element view(::Vector{Union{Missing, Float64}}, [12, 34]) with eltype Union{Missing, Float64}:
 missing
 missing

julia> df = DataFrame(d=data)

CodePudding user response：

i would do it with a function that returns cheap, expensive or missing:

using Statistics
data = ifelse.(rand(Bool,100),missing,100*rand(100)) #generator for the data
meandata = mean(skipmissing(data)) #mean of the data

function category_select(x)
  ismissing(x) && return missing  #short-circuit operator
  return ifelse(x<meandata,"cheap","expensive") #parentheses are optional
end

category_select2(x) = ismissing(x) ? missing : (x < meandata ? "cheap" : "expensive)

#broadcast values
x_category = category_selector.(data)
x_category = category_selector2.(data)

now, what is happening? there are two things with the ifelse function:

It evaluates both branches at the same time, so if one branch can error, it will error. take this example:

maybelog(x) = ifelse(x<0,zero(x),log(x)) #ifelse
maybelog2(x) = begin if x<0; zero(x);else;log(x);end #full if expression
maybelog3(x) = x<0 ? zero(x) : log(x) #ternary operator

maybelog fails with x = -1, whereas maybelog2 and maybelog3 does not.

The first argument is always a bool. In your initial expression,the result of df.x .< mean(df.x) can be true, false or missing, so ifelse also fails there.

in your modified expression, the length of skipmissing(df.x) is different than the length of x as the first one doesnt count the missing values present in x, resulting in a smaller vector than the size of your dataframe.