Home > Software design >  Remove rows with all missing values for columns that start with certain name in dataframe Julia
Remove rows with all missing values for columns that start with certain name in dataframe Julia

Time:01-04

I have the following dataframe:

using DataFrames

df = DataFrame(
           group = ["A", "A", "A", "B", "B", "B"],
           V1 = [1, missing, missing, 3, missing, missing],
           V2 = [missing, missing, missing, 2, missing, missing],
           V3 = [missing, missing, 4, missing, 1, missing],
           Z1 = [3, missing, missing, 3, missing, missing],
           Z2 = [3, 1, 5, 2, missing, 3],
           Z3 = [missing, missing, 2, missing, missing, missing])

6×7 DataFrame
 Row │ group   V1       V2       V3       Z1       Z2       Z3      
     │ String  Int64?   Int64?   Int64?   Int64?   Int64?   Int64?  
─────┼──────────────────────────────────────────────────────────────
   1 │ A             1  missing  missing        3        3  missing 
   2 │ A       missing  missing  missing  missing        1  missing 
   3 │ A       missing  missing        4  missing        5        2
   4 │ B             3        2  missing        3        2  missing 
   5 │ B       missing  missing        1  missing  missing  missing 
   6 │ B       missing  missing  missing  missing        3  missing 

I would like to remove the rows with all values missing, but only where the columns start with "V" in their column names. This means that row 2 and 6 should be removed because they have all values missing across the columns that start with "V". The desired output should look like this:

4×7 DataFrame
 Row │ group   V1       V2       V3       Z1       Z2       Z3      
     │ String  Int64?   Int64?   Int64?   Int64?   Int64?   Int64?  
─────┼──────────────────────────────────────────────────────────────
   1 │ A             1  missing  missing        3        3  missing 
   2 │ A       missing  missing        4  missing        5        2
   3 │ B             3        2  missing        3        2  missing 
   4 │ B       missing  missing        1  missing  missing  missing 

So I was wondering if anyone knows how to remove rows where all values are missing across columns that start with certain column name in a dataframe Julia?

CodePudding user response:

You can use the deleteat! function to drop the rows of the given data frame with the given indexes:

deleteat!(df, all.(ismissing, eachrow(df[!, r"V"])))
# 4×7 DataFrame
#  Row │ group   V1       V2       V3       Z1       Z2       Z3
#      │ String  Int64?   Int64?   Int64?   Int64?   Int64?   Int64?
# ─────┼──────────────────────────────────────────────────────────────
#    1 │ A             1  missing  missing        3        3  missing
#    2 │ A       missing  missing        4  missing        5        2
#    3 │ B             3        2  missing        3        2  missing
#    4 │ B       missing  missing        1  missing  missing  missing

Another way is following this approach (slicing by a mask):

mask = map(x->!all(ismissing, x), eachrow(df[!, r"V.*"]))
df[mask, :]
# 4×7 DataFrame
#  Row │ group   V1       V2       V3       Z1       Z2       Z3
#      │ String  Int64?   Int64?   Int64?   Int64?   Int64?   Int64?
# ─────┼──────────────────────────────────────────────────────────────
#    1 │ A             1  missing  missing        3        3  missing
#    2 │ A       missing  missing        4  missing        5        2
#    3 │ B             3        2  missing        3        2  missing
#    4 │ B       missing  missing        1  missing  missing  missing

# Or
mask = broadcast(~, all.(ismissing, eachrow(df[!, r"V"])))
df[mask, :]

# Or
df[Not(all.(ismissing, eachrow(df[!, r"V"]))), :]

The r"V.*" is a RegEx that is allowed for indexing by DataFrames.jl. Its interpretation:

  • V: Starts with the V letter.
  • .: Any char can appear.
  • .*: Any char(s) indefinitely can appear.

The pattern could be r"^V" which catches any sequence of chars that starts with the V letter, or even an r"V" could be enough.
Following this approach, another way is to create a mask DataFrame:

maskdf = select(df, AsTable(r"V") => ByRow(x-> !all(ismissing, x)) => :mask)
# 6×1 DataFrame
#  Row │ mask
#      │ Bool
# ─────┼───────
#    1 │  true
#    2 │ false
#    3 │  true
#    4 │  true
#    5 │  true
#    6 │ false

df[maskdf.mask, :]
# returns the desired result.
  • Related