Home > Software design >  How to purge missing values from a DataFrame in Julia?
How to purge missing values from a DataFrame in Julia?

Time:01-04

After reading the context, if you felt the title could be enhanced to fit the question and you had an idea, feel free to update it.
Suppose I have the following DataFrame:

using DataFrames
df = DataFrame(
  g=["a","b","a","c",missing,missing,missing,missing],
  a=[1,2,3,4,missing,missing,missing,missing],
  Column1=[missing,missing,missing,missing,false,false,false,true],
  Column2=[missing,missing,missing,missing,false,true,true,true],
  Column3=[missing,missing,missing,missing,true,true,false,false],
)
# 8×5 DataFrame
#  Row │ g        a        Column1  Column2  Column3
#      │ String?  Int64?   Bool?    Bool?    Bool?
# ─────┼─────────────────────────────────────────────
#    1 │ a              1  missing  missing  missing
#    2 │ b              2  missing  missing  missing
#    3 │ a              3  missing  missing  missing
#    4 │ c              4  missing  missing  missing
#    5 │ missing  missing    false    false     true
#    6 │ missing  missing    false     true     true
#    7 │ missing  missing    false     true    false
#    8 │ missing  missing     true     true    false

I want to convert it to this:

# 8×5 DataFrame
#  Row │ g        a        Column1  Column2  Column3
#      │ String?  Int64?   Bool?    Bool?    Bool?
# ─────┼─────────────────────────────────────────────
#    1 │ a              1    false    false     true
#    2 │ b              2    false     true     true
#    3 │ a              3    false     true    false
#    4 │ c              4     true     true    false

I tried:

DataFrame(collect.(skipmissing.(eachcol(df))), names(df))

But I think this is not an optimal way since I'm using the collect function. Is there any better way to do it?

CodePudding user response:

For me a natural way to do it would be:

julia> mapcols(x -> filter(!ismissing, x), df)
4×5 DataFrame
 Row │ g        a       Column1  Column2  Column3
     │ String?  Int64?  Bool?    Bool?    Bool?
─────┼────────────────────────────────────────────
   1 │ a             1    false    false     true
   2 │ b             2    false     true     true
   3 │ a             3    false     true    false
   4 │ c             4     true     true    false

However, this assumes that number of missing values in every column is the same (but I guess this is what you have in this exercise - right?).

skipmissing is designed for cases when user wants a non-copying iterable skipping missing values (which is not the case here).

  • Related