I'm trying to merge many DataFrames. If user doesn't exist in any date's DataFrame, just keep the info of certain columns (e.g. user name) and set value of certain number type columns to 0.
df1 = pd.DataFrame({'user': ['A', 'B'],
'dt': ['2016-01-01', '2016-01-01'],
'userID': ['xxxa', 'yyyb'],
'val': [11, 22],
'val2': [111, 222]})
df2 = pd.DataFrame({'user': ['A', 'A', 'C'],
'dt': ['2016-02-13', '2016-02-13', '2016-02-13'],
'userID': ['xxxa', 'kkka', 'jjjc'],
'val': [33, 44, 55],
'val2': [333, 444, 555]})
DataFrame 1 on certain date:
dt user userID val val2 val3...
0 2016-01-01 A xxxa 11 ...
1 2016-01-01 B yyyb 22 ...
DataFrame 2 on another date:
dt user userID val val2 val3...
0 2016-02-13 A xxxa 33 ...
1 2016-02-13 A kkka 44 ...
2 2016-02-13 C jjjc 55 ...
Desired merged result:
dt user userID val val2 val3...
0 2016-01-01 A xxxa 11 ...
1 2016-02-13 A xxxa 33 ...
2 2016-01-01 A kkka 0 ...
3 2016-02-13 A kkka 44 ...
4 2016-01-01 B yyyb 22 ...
5 2016-02-13 B yyyb 0 ...
6 2016-01-01 C jjjc 0 ...
7 2016-02-13 C jjjc 55 ...
CodePudding user response:
You could use concat
pivot
fillna
to get the missing dates filled out with for each "user" and "userID"; then stack
the dates (level=1
) to get the desired data in the desired shape. Then do some cosmetic changes to get the desired output:
out = (pd.concat((df1, df2))
.pivot(['userID','user'], ['dt'], ['val','val2'])
.fillna(0)
.stack(level=1)
.reset_index()
[['dt','user','userID','val','val2']]
.sort_values('user')
.reset_index(drop=True))
Output:
dt user userID val val2
0 2016-01-01 A kkka 0.0 0.0
1 2016-02-13 A kkka 44.0 444.0
2 2016-01-01 A xxxa 11.0 111.0
3 2016-02-13 A xxxa 33.0 333.0
4 2016-01-01 B yyyb 22.0 222.0
5 2016-02-13 B yyyb 0.0 0.0
6 2016-01-01 C jjjc 0.0 0.0
7 2016-02-13 C jjjc 55.0 555.0