Home > other >  Merge rows with NaN per group in Julia
Merge rows with NaN per group in Julia

Time:01-03

I have the following looking dataframe:

using DataFrames

df = DataFrame(
    group = ["A", "A", "A", "B", "B", "B"],
    V1 = [1, NaN, NaN, 3, NaN, NaN],
    V2 = [NaN, 4, NaN, 2, NaN, NaN],
    V3 = [NaN, NaN, 4, NaN, 1, NaN])

Output:

6×4 DataFrame
 Row │ group   V1       V2       V3      
     │ String  Float64  Float64  Float64 
─────┼───────────────────────────────────
   1 │ A           1.0    NaN      NaN
   2 │ A         NaN        4.0    NaN
   3 │ A         NaN      NaN        4.0
   4 │ B           3.0      2.0    NaN
   5 │ B         NaN      NaN        1.0
   6 │ B         NaN      NaN      NaN

As you can see per group there are some values. I would like to merge the rows per group so we actually remove the NaN values. Here is the desired output:

2×4 DataFrame
 Row │ group   V1     V2     V3    
     │ String  Int64  Int64  Int64 
─────┼─────────────────────────────
   1 │ A           1      4      4
   2 │ B           3      2      1

As you can see the rows are now merged into one row per group without any NaN. So I was wondering if anyone knows how to merge multiple rows with some values to remove the NaNs like above in a dataframe Julia?

CodePudding user response:

julia> combine(groupby(df, :group), Not(:group) .=> x -> filter(!isequal(NaN), x), renamecols=false)
2×4 DataFrame
 Row │ group   V1       V2       V3
     │ String  Float64  Float64  Float64
─────┼───────────────────────────────────
   1 │ A           1.0      4.0      4.0
   2 │ B           3.0      2.0      1.0

Note that the output is not what you asked for as your source columns contain floats, while you asked for integers. You could get it like this:

julia> combine(groupby(df, :group), Not(:group) .=> x -> Int.(filter(!isequal(NaN), x)), renamecols=false)
2×4 DataFrame
 Row │ group   V1     V2     V3
     │ String  Int64  Int64  Int64
─────┼─────────────────────────────
   1 │ A           1      4      4
   2 │ B           3      2      1

Finally note that you should not use NaN to indicate missing data in your analyses. missing is an intended way to do it.

  • Related