I have the following dataframe called df:
df = DataFrame(i=1:5,
x=[missing, missing, missing, missing, missing],
y=[missing, missing, 1, 3, 6])
5×3 DataFrame
Row │ i x y
│ Int64 Missing Int64?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 missing missing
3 │ 3 missing 1
4 │ 4 missing 3
5 │ 5 missing 6
I would like to remove the columns where all values are missing. In this case it should remove column x because it has only all missing
values. with dropmissing
it removes all rows, but that's not what I want. So I was wondering if anyone knows how to remove only columns where all values are missing in a dataframe Julia?
CodePudding user response:
- A mediocre answer would be:
df1 = DataFrame()
foreach(
x->all(ismissing, df[!, x]) ? nothing : df1[!, x] = df[!, x],
propertynames(df)
)
df
# 5×2 DataFrame
# Row │ i y
# │ Int64 Int64?
# ─────┼────────────────
# 1 │ 1 missing
# 2 │ 2 missing
# 3 │ 3 1
# 4 │ 4 3
# 5 │ 5 6
- But a slightly better one would be using the slicing approach:
df[:, map(x->!all(ismissing, df[!, x]), propertynames(df))]
# 5×2 DataFrame
# Row │ i y
# │ Int64 Int64?
# ─────┼────────────────
# 1 │ 1 missing
# 2 │ 2 missing
# 3 │ 3 1
# 4 │ 4 3
# 5 │ 5 6
# OR
df[!, map(x->!all(ismissing, x), eachcol(df))]
# 5×2 DataFrame
# Row │ i y
# │ Int64 Int64?
# ─────┼────────────────
# 1 │ 1 missing
# 2 │ 2 missing
# 3 │ 3 1
# 4 │ 4 3
# 5 │ 5 6
#Or
df[!, Not(names(df, all.(ismissing, eachcol(df))))]
# I omitted the result to prevent this answer from becoming extensively lengthy.
#Or
df[!, Not(all.(ismissing, eachcol(df)))]
- I almost forgot the
deleteat!
function:
deleteat!(permutedims(df), all.(ismissing, eachcol(df))) |> permutedims
# 5×2 DataFrame
# Row │ i y
# │ Int64 Int64?
# ─────┼────────────────
# 1 │ 1 missing
# 2 │ 2 missing
# 3 │ 3 1
# 4 │ 4 3
# 5 │ 5 6
- You can use the
select!
function, as Dan noted:
select!(df, [k for (k,v) in pairs(eachcol(df)) if !all(ismissing, v)])
# 5×2 DataFrame
# Row │ i y
# │ Int64 Int64?
# ─────┼────────────────
# 1 │ 1 missing
# 2 │ 2 missing
# 3 │ 3 1
# 4 │ 4 3
# 5 │ 5 6
CodePudding user response:
The names
functions accepts a type as an input to select columns of a specific type, so I would do:
julia> select(df, Not(names(df, Missing)))
5×2 DataFrame
Row │ i y
│ Int64 Int64?
─────┼────────────────
1 │ 1 missing
2 │ 2 missing
3 │ 3 1
4 │ 4 3
5 │ 5 6
Without benchmarking this I would guess that it is also significantly faster, as it doesn't have to check each element of each column but as far as I know simply queries the type information for each column readily available in the DataFrame:
julia> dump(df)
DataFrame
columns: Array{AbstractVector}((3,))
1: Array{Int64}((5,)) [1, 2, 3, 4, 5]
2: Array{Missing}((5,))
1: Missing missing
2: Missing missing
3: Missing missing
4: Missing missing
5: Missing missing
3: Array{Union{Missing, Int64}}((5,))
The downside of this approach is that it relies on the type information to be correct, which might not be the case after a transformation:
julia> df2 = df[1:2, :]
2×3 DataFrame
Row │ i x y
│ Int64 Missing Int64?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 missing missing
This can be fixed by calling identity
to narrow column types, but this is again potentially expensive:
julia> identity.(df2)
2×3 DataFrame
Row │ i x y
│ Int64 Missing Missing
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 missing missing
So I'd say if you're creating a DataFrame from scratch, such as reading it in via XLSX.jl (as people loooove putting empty columns in their Excel sheet) or are creating whole columns in your workflow, names(df, Not(Missing))
is the way to go, while for analysis on subsets of DataFrames it's only guaranteed to work when using identity
so that the other approaches mentioned which check every cell are viable alternatives.