Suppose I have two DataFrames df1
and df2
as follows
df1 = DataFrame(id=["a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d"],
var=[1, 32, 3, 22, 5, 4, 6, 7, 8, 4, 3])
11×2 DataFrame
Row │ id var
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 32
3 │ a 3
4 │ b 22
5 │ b 5
6 │ b 4
7 │ c 6
8 │ c 7
9 │ c 8
10 │ d 4
11 │ d 3
df2 = DataFrame(id=["a", "a", "b", "b", "b", "c", "c", "c"],
var=[1, 1, 2, 2, 2, 6, 6, 6])
8×2 DataFrame
Row │ id var
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 1
3 │ b 2
4 │ b 2
5 │ b 2
6 │ c 6
7 │ c 6
8 │ c 6
The objective is to replace the var
column in df1 for each id
with the value of var
from df2
for each corresponding id
, only for those id
which exist in both df2
and df1
So the desired outcome will look like:
DataFrame(id=["a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d"],
var=[1, 32, 3, 22, 5, 4, 6, 7, 8, 4, 3])
11×2 DataFrame
Row │ id var
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 1
3 │ a 1
4 │ b 2
5 │ b 2
6 │ b 2
7 │ c 6
8 │ c 6
9 │ c 6
10 │ d 4
11 │ d 3
Tried the following but they don't work
for d1 in groupby(df1, :id)
replace!(d1.var .= [d2.var for d1 in groupby(df2, :id)])
end
#or
[[d1.var = d2.var for d2 in groupby(df2, :id)] for d1 in groupby(df1, :id)]
Will appreciate any help. Thanks!
CodePudding user response:
I would do it like this:
julia> leftjoin!(df1, unique(df2, :id), on=:id, makeunique=true)
11×3 DataFrame
Row │ id var var_1
│ String Int64 Int64?
─────┼────────────────────────
1 │ a 1 1
2 │ a 32 1
3 │ a 3 1
4 │ b 22 2
5 │ b 5 2
6 │ b 4 2
7 │ c 6 6
8 │ c 7 6
9 │ c 8 6
10 │ d 4 missing
11 │ d 3 missing
julia> select!(df1, :id, [:var_1, :var] => ByRow(coalesce) => :var)
11×2 DataFrame
Row │ id var
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 1
3 │ a 1
4 │ b 2
5 │ b 2
6 │ b 2
7 │ c 6
8 │ c 6
9 │ c 6
10 │ d 4
11 │ d 3
Note that the problem with your data is that df2
has multiple rows for the same unique :id
so I run unique
on it before joining (also note that number of values per group in df1
and df2
is not the same).
There are other ways to do it (please comment if you would like to see them), but they would involve more code (iteration through groups etc.). The solution I propose relies on the functions that are part of DataFrames.jl API.
CodePudding user response:
I believe @BogumiłKamiński answer is the way to go if you decide to use what DataFrames.jl API has to offer. If you want to try something longer but way faster, however, you might want to try the following:
D = Dict(df2[!, :id] .=> df2[!, :var])
for (i, v) in enumerate(df1[!, :id])
if v in keys(D)
df1[!, :var][i] = D[v]
end
end
Julia > df1
11×2 DataFrame
Row │ id var
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 1
3 │ a 1
4 │ b 2
5 │ b 2
6 │ b 2
7 │ c 6
8 │ c 6
9 │ c 6
10 │ d 4
11 │ d 3