Replacing a column in a DF from another DF in Julia-CodePudding

Suppose I have two DataFrames df1 and df2 as follows

df1 = DataFrame(id=["a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d"],
    var=[1, 32, 3, 22, 5, 4, 6, 7, 8, 4, 3])

11×2 DataFrame
 Row │ id      var   
     │ String  Int64 
─────┼───────────────
   1 │ a           1
   2 │ a          32
   3 │ a           3
   4 │ b          22
   5 │ b           5
   6 │ b           4
   7 │ c           6
   8 │ c           7
   9 │ c           8
  10 │ d           4
  11 │ d           3

df2 = DataFrame(id=["a", "a", "b", "b", "b", "c", "c", "c"],
    var=[1, 1, 2, 2, 2, 6, 6, 6])

8×2 DataFrame
 Row │ id      var   
     │ String  Int64 
─────┼───────────────
   1 │ a           1
   2 │ a           1
   3 │ b           2
   4 │ b           2
   5 │ b           2
   6 │ c           6
   7 │ c           6
   8 │ c           6

The objective is to replace the var column in df1 for each id with the value of var from df2 for each corresponding id, only for those id which exist in both df2 and df1

So the desired outcome will look like:

DataFrame(id=["a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d"],
    var=[1, 32, 3, 22, 5, 4, 6, 7, 8, 4, 3])

11×2 DataFrame
 Row │ id      var   
     │ String  Int64 
─────┼───────────────
   1 │ a           1
   2 │ a           1
   3 │ a           1
   4 │ b           2
   5 │ b           2
   6 │ b           2
   7 │ c           6
   8 │ c           6
   9 │ c           6
  10 │ d           4
  11 │ d           3

Tried the following but they don't work

for d1 in groupby(df1, :id)
    replace!(d1.var .= [d2.var for d1 in groupby(df2, :id)])
end

#or

[[d1.var = d2.var for d2 in groupby(df2, :id)] for d1 in groupby(df1, :id)]

Will appreciate any help. Thanks!

CodePudding user response：

I would do it like this:

julia> leftjoin!(df1, unique(df2, :id), on=:id, makeunique=true)
11×3 DataFrame
 Row │ id      var    var_1
     │ String  Int64  Int64?
─────┼────────────────────────
   1 │ a           1        1
   2 │ a          32        1
   3 │ a           3        1
   4 │ b          22        2
   5 │ b           5        2
   6 │ b           4        2
   7 │ c           6        6
   8 │ c           7        6
   9 │ c           8        6
  10 │ d           4  missing
  11 │ d           3  missing

julia> select!(df1, :id, [:var_1, :var] => ByRow(coalesce) => :var)
11×2 DataFrame
 Row │ id      var
     │ String  Int64
─────┼───────────────
   1 │ a           1
   2 │ a           1
   3 │ a           1
   4 │ b           2
   5 │ b           2
   6 │ b           2
   7 │ c           6
   8 │ c           6
   9 │ c           6
  10 │ d           4
  11 │ d           3

Note that the problem with your data is that df2 has multiple rows for the same unique :id so I run unique on it before joining (also note that number of values per group in df1 and df2 is not the same).

There are other ways to do it (please comment if you would like to see them), but they would involve more code (iteration through groups etc.). The solution I propose relies on the functions that are part of DataFrames.jl API.

CodePudding user response：

I believe @BogumiłKamiński answer is the way to go if you decide to use what DataFrames.jl API has to offer. If you want to try something longer but way faster, however, you might want to try the following:

D = Dict(df2[!, :id] .=> df2[!, :var])

for (i, v) in enumerate(df1[!, :id])
    if v in keys(D)
        df1[!, :var][i] = D[v]
    end 
end

Julia > df1
11×2 DataFrame
 Row │ id      var
     │ String  Int64
─────┼───────────────
   1 │ a           1
   2 │ a           1
   3 │ a           1
   4 │ b           2
   5 │ b           2
   6 │ b           2
   7 │ c           6
   8 │ c           6
   9 │ c           6
  10 │ d           4
  11 │ d           3