In an iris
dataframe I replaced some values by missing
.
I would like to replace the missing
values in the column petal_length
by the petal_length
mean per species
. The code below does work (means before and after replacing values are equal), however I suspect there must be a more efficient way to do this which does not loop through every row while only in some rows values are missing. Also, creating a dictionary is probably not necessary in a more optimised solution. Any suggestions for optimising?
using CSV
using DataFrames
using Random
using Statistics
using StatsBase
download("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv", "iris.csv")
iris = DataFrame(CSV.File("iris.csv", delim = ","))
allowmissing!(iris)
Random.seed!(20_000)
for i in 1:100
iris[rand(1:nrow(iris)), rand(1:4)] = missing
end
Random.seed!(20_000)
iris[sample(1:nrow(iris), 10), :species] .= missing
mean_per_species = combine(groupby(iris, :species), :petal_length .=> mean∘skipmissing .=> :mean)
mean_per_species_dict = Dict(mean_per_species.species .=> mean_per_species.mean)
for row in eachrow(iris)
if ismissing(row.petal_length)
row.petal_length = mean_per_species_dict[row.species]
end
end
CodePudding user response:
Is this what you want (using DataFramesMeta.jl):
julia> @chain iris begin
groupby(:species)
@transform(:petal_length = coalesce.(:petal_length, mean(skipmissing(:petal_length))))
end
(an in-place variant would use @transform!
)
CodePudding user response:
I think I found a more optimised way by using groupby
.
for group in groupby(iris, :species)
group[ismissing.(group.petal_length), :petal_length] .= mean(skipmissing(group.petal_length))
end