I have the following dataframe:
using DataFrames
df = DataFrame(
group = ["A", "A", "A", "B", "B", "B"],
V1 = [1, missing, missing, 3, missing, missing],
V2 = [missing, missing, missing, 2, missing, missing],
V3 = [missing, missing, 4, missing, 1, missing],
Z1 = [3, missing, missing, 3, missing, missing],
Z2 = [3, 1, 5, 2, missing, 3],
Z3 = [missing, missing, 2, missing, missing, missing])
6×7 DataFrame
Row │ group V1 V2 V3 Z1 Z2 Z3
│ String Int64? Int64? Int64? Int64? Int64? Int64?
─────┼──────────────────────────────────────────────────────────────
1 │ A 1 missing missing 3 3 missing
2 │ A missing missing missing missing 1 missing
3 │ A missing missing 4 missing 5 2
4 │ B 3 2 missing 3 2 missing
5 │ B missing missing 1 missing missing missing
6 │ B missing missing missing missing 3 missing
I would like to remove the rows with all values missing, but only where the columns start with "V" in their column names. This means that row 2 and 6 should be removed because they have all values missing across the columns that start with "V". The desired output should look like this:
4×7 DataFrame
Row │ group V1 V2 V3 Z1 Z2 Z3
│ String Int64? Int64? Int64? Int64? Int64? Int64?
─────┼──────────────────────────────────────────────────────────────
1 │ A 1 missing missing 3 3 missing
2 │ A missing missing 4 missing 5 2
3 │ B 3 2 missing 3 2 missing
4 │ B missing missing 1 missing missing missing
So I was wondering if anyone knows how to remove rows where all values are missing across columns that start with certain column name in a dataframe Julia
?
CodePudding user response:
You can use the deleteat!
function to drop the rows of the given data frame with the given indexes:
deleteat!(df, all.(ismissing, eachrow(df[!, r"V"])))
# 4×7 DataFrame
# Row │ group V1 V2 V3 Z1 Z2 Z3
# │ String Int64? Int64? Int64? Int64? Int64? Int64?
# ─────┼──────────────────────────────────────────────────────────────
# 1 │ A 1 missing missing 3 3 missing
# 2 │ A missing missing 4 missing 5 2
# 3 │ B 3 2 missing 3 2 missing
# 4 │ B missing missing 1 missing missing missing
Another way is following this approach (slicing by a mask):
mask = map(x->!all(ismissing, x), eachrow(df[!, r"V.*"]))
df[mask, :]
# 4×7 DataFrame
# Row │ group V1 V2 V3 Z1 Z2 Z3
# │ String Int64? Int64? Int64? Int64? Int64? Int64?
# ─────┼──────────────────────────────────────────────────────────────
# 1 │ A 1 missing missing 3 3 missing
# 2 │ A missing missing 4 missing 5 2
# 3 │ B 3 2 missing 3 2 missing
# 4 │ B missing missing 1 missing missing missing
# Or
mask = broadcast(~, all.(ismissing, eachrow(df[!, r"V"])))
df[mask, :]
# Or
df[Not(all.(ismissing, eachrow(df[!, r"V"]))), :]
The r"V.*"
is a RegEx that is allowed for indexing by DataFrames.jl. Its interpretation:
V
: Starts with the V letter..
: Any char can appear..*
: Any char(s) indefinitely can appear.
The pattern could be r"^V"
which catches any sequence of chars that starts with the V
letter, or even an r"V"
could be enough.
Following this approach, another way is to create a mask DataFrame:
maskdf = select(df, AsTable(r"V") => ByRow(x-> !all(ismissing, x)) => :mask)
# 6×1 DataFrame
# Row │ mask
# │ Bool
# ─────┼───────
# 1 │ true
# 2 │ false
# 3 │ true
# 4 │ true
# 5 │ true
# 6 │ false
df[maskdf.mask, :]
# returns the desired result.