Home > Blockchain >  Remove columns with all values missing in dataframe Julia
Remove columns with all values missing in dataframe Julia

Time:12-22

I have the following dataframe called df:

df = DataFrame(i=1:5,
               x=[missing, missing, missing, missing, missing],
               y=[missing, missing, 1, 3, 6])
5×3 DataFrame
 Row │ i      x        y       
     │ Int64  Missing  Int64?  
─────┼─────────────────────────
   1 │     1  missing  missing 
   2 │     2  missing  missing 
   3 │     3  missing        1
   4 │     4  missing        3
   5 │     5  missing        6

I would like to remove the columns where all values are missing. In this case it should remove column x because it has only all missing values. with dropmissing it removes all rows, but that's not what I want. So I was wondering if anyone knows how to remove only columns where all values are missing in a dataframe Julia?

CodePudding user response:

  1. A mediocre answer would be:
df1 = DataFrame()
foreach(
  x->all(ismissing, df[!, x]) ? nothing : df1[!, x] = df[!, x],
  propertynames(df)
)

df
# 5×2 DataFrame
#  Row │ i      y
#      │ Int64  Int64?
# ─────┼────────────────
#    1 │     1  missing
#    2 │     2  missing
#    3 │     3        1
#    4 │     4        3
#    5 │     5        6
  1. But a slightly better one would be using the slicing approach:
df[:, map(x->!all(ismissing, df[!, x]), propertynames(df))]
# 5×2 DataFrame
#  Row │ i      y
#      │ Int64  Int64?
# ─────┼────────────────
#    1 │     1  missing
#    2 │     2  missing
#    3 │     3        1
#    4 │     4        3
#    5 │     5        6

# OR
df[!, map(x->!all(ismissing, x), eachcol(df))]
# 5×2 DataFrame
#  Row │ i      y
#      │ Int64  Int64?
# ─────┼────────────────
#    1 │     1  missing
#    2 │     2  missing
#    3 │     3        1
#    4 │     4        3
#    5 │     5        6

#Or
df[!, Not(names(df, all.(ismissing, eachcol(df))))]
# I omitted the result to prevent this answer from becoming extensively lengthy.

#Or
df[!, Not(all.(ismissing, eachcol(df)))]
  1. I almost forgot the deleteat! function:
deleteat!(permutedims(df), all.(ismissing, eachcol(df))) |> permutedims

# 5×2 DataFrame
#  Row │ i      y
#      │ Int64  Int64?
# ─────┼────────────────
#    1 │     1  missing
#    2 │     2  missing
#    3 │     3        1
#    4 │     4        3
#    5 │     5        6
  1. You can use the select! function, as Dan noted:
select!(df, [k for (k,v) in pairs(eachcol(df)) if !all(ismissing, v)])
# 5×2 DataFrame
#  Row │ i      y
#      │ Int64  Int64?
# ─────┼────────────────
#    1 │     1  missing
#    2 │     2  missing
#    3 │     3        1
#    4 │     4        3
#    5 │     5        6

CodePudding user response:

The names functions accepts a type as an input to select columns of a specific type, so I would do:

julia> select(df, Not(names(df, Missing)))
5×2 DataFrame
 Row │ i      y       
     │ Int64  Int64?  
─────┼────────────────
   1 │     1  missing 
   2 │     2  missing 
   3 │     3        1
   4 │     4        3
   5 │     5        6

Without benchmarking this I would guess that it is also significantly faster, as it doesn't have to check each element of each column but as far as I know simply queries the type information for each column readily available in the DataFrame:

julia> dump(df)
DataFrame
  columns: Array{AbstractVector}((3,))
    1: Array{Int64}((5,)) [1, 2, 3, 4, 5]
    2: Array{Missing}((5,))
      1: Missing missing
      2: Missing missing
      3: Missing missing
      4: Missing missing
      5: Missing missing
    3: Array{Union{Missing, Int64}}((5,))

The downside of this approach is that it relies on the type information to be correct, which might not be the case after a transformation:

julia> df2 = df[1:2, :]
2×3 DataFrame
 Row │ i      x        y       
     │ Int64  Missing  Int64?  
─────┼─────────────────────────
   1 │     1  missing  missing 
   2 │     2  missing  missing

This can be fixed by calling identity to narrow column types, but this is again potentially expensive:

julia> identity.(df2)
2×3 DataFrame
 Row │ i      x        y       
     │ Int64  Missing  Missing 
─────┼─────────────────────────
   1 │     1  missing  missing 
   2 │     2  missing  missing 

So I'd say if you're creating a DataFrame from scratch, such as reading it in via XLSX.jl (as people loooove putting empty columns in their Excel sheet) or are creating whole columns in your workflow, names(df, Not(Missing)) is the way to go, while for analysis on subsets of DataFrames it's only guaranteed to work when using identity so that the other approaches mentioned which check every cell are viable alternatives.

  • Related