I am using Julia and I got a dataframe with 42 values, of which 2 are missing.
This values are prices that go from 0.23 to 0.3
I am trying to get a new column that tells if its cheap or expensive by a ifelse
statement.
the ifelse should go:
df.x_category=ifelse.(df.x .< mean(df.x),"cheap", "expensive")
but i get the following error:
ERROR: TypeError: non-boolean (Missing) used in boolean context
Is there a way to skip those missing values?
I tried with:
df.x_category=ifelse.(skipmissing(df.x) .< mean(skipmissing(df.x)),"cheap", "expensive")
but get this error:
ERROR: ArgumentError: New columns must have the same length as old columns
I can't just delete missing observations.
How can i make this?
Thanks in advance!
CodePudding user response:
ifelse
can handle only 2 values and you need handle 3.
Assuming that you have
df = DataFrame(x=rand([0.23,0.3,missing], 10))
than mean(df.x)
yields a missing
since some of values are missing
s. You need to do instead mean(skipmissing(df.x)))
.
Hence the code could be:
julia> map(x -> ismissing(x) ? missing : ifelse(x,"cheap", "expensive"), df.x .< mean(skipmissing(df.x)))
10-element Vector{Union{Missing, String}}:
missing
missing
"cheap"
missing
"expensive"
missing
missing
missing
"cheap"
"cheap"
Here I have combined ifelse with map
for handling the missing value there are other ways but each one will require nesting some conditional function.
CodePudding user response:
You can try something like this. Using toy data.
- First get your string values from
ifelse
into a vector. - Then prepare the string vector by converting it to a Union of strings and missing to hold missing values.
- Finally put the missing values into the vector.
julia> using DataFrames, Random
julia> vec = ifelse.(df.d[ismissing.(df.d) .== false] .> 0.5,"higher","lower")
40-element Vector{String}:
"higher"
"lower"
"lower"
etc...
julia> vec = convert(Vector{Union{Missing,String}}, vec)
40-element Vector{Union{Missing, String}}
julia> for i in findall(ismissing.(df.d)) insert!(vec, i, missing) end
julia> df.x = vec
julia> df
42×2 DataFrame
Row │ d x
│ Float64? String?
─────┼──────────────────────────
1 │ 0.533183 higher
2 │ 0.454029 lower
3 │ 0.0176868 lower
4 │ 0.172933 lower
5 │ 0.958926 higher
6 │ 0.973566 higher
7 │ 0.30387 lower
8 │ 0.176909 lower
9 │ 0.956916 higher
10 │ 0.584284 higher
11 │ 0.937466 higher
12 │ missing missing
13 │ 0.422956 lower
etc...
Data
julia> Random.seed!(42)
MersenneTwister(42)
julia> data = Random.rand(42)
42-element Vector{Float64}:
0.5331830160438613
0.4540291355871424
etc...
julia> data = convert(Vector{Union{Missing,Float64}}, data)
42-element Vector{Union{Missing, Float64}}
julia> data[[12,34]] .= missing
2-element view(::Vector{Union{Missing, Float64}}, [12, 34]) with eltype Union{Missing, Float64}:
missing
missing
julia> df = DataFrame(d=data)
CodePudding user response:
i would do it with a function that returns cheap
, expensive
or missing
:
using Statistics
data = ifelse.(rand(Bool,100),missing,100*rand(100)) #generator for the data
meandata = mean(skipmissing(data)) #mean of the data
function category_select(x)
ismissing(x) && return missing #short-circuit operator
return ifelse(x<meandata,"cheap","expensive") #parentheses are optional
end
category_select2(x) = ismissing(x) ? missing : (x < meandata ? "cheap" : "expensive)
#broadcast values
x_category = category_selector.(data)
x_category = category_selector2.(data)
now, what is happening? there are two things with the ifelse
function:
- It evaluates both branches at the same time, so if one branch can error, it will error. take this example:
maybelog(x) = ifelse(x<0,zero(x),log(x)) #ifelse
maybelog2(x) = begin if x<0; zero(x);else;log(x);end #full if expression
maybelog3(x) = x<0 ? zero(x) : log(x) #ternary operator
maybelog
fails with x = -1, whereas maybelog2
and maybelog3
does not.
- The first argument is always a bool. In your initial expression,the result of
df.x .< mean(df.x)
can betrue
,false
ormissing
, soifelse
also fails there.
in your modified expression, the length of skipmissing(df.x)
is different than the length of x
as the first one doesnt count the missing values present in x, resulting in a smaller vector than the size of your dataframe.