Home > Enterprise >  Combining DataFrames and filling 0s for missing data
Combining DataFrames and filling 0s for missing data

Time:02-28

I'm trying to merge many DataFrames. If user doesn't exist in any date's DataFrame, just keep the info of certain columns (e.g. user name) and set value of certain number type columns to 0.

df1 = pd.DataFrame({'user': ['A', 'B'],
                  'dt': ['2016-01-01', '2016-01-01'],
                  'userID': ['xxxa', 'yyyb'],
                  'val': [11, 22],
                  'val2': [111, 222]})

df2 = pd.DataFrame({'user': ['A', 'A', 'C'],
                  'dt': ['2016-02-13', '2016-02-13', '2016-02-13'],
                  'userID': ['xxxa', 'kkka', 'jjjc'],
                  'val': [33, 44, 55],
                  'val2': [333, 444, 555]})

DataFrame 1 on certain date:

            dt  user    userID  val   val2   val3...
0   2016-01-01     A    xxxa    11   ...
1   2016-01-01     B    yyyb    22   ...

DataFrame 2 on another date:

            dt  user    userID  val   val2   val3...
0   2016-02-13     A    xxxa    33   ...
1   2016-02-13     A    kkka    44   ...
2   2016-02-13     C    jjjc    55   ...

Desired merged result:

            dt  user    userID  val   val2   val3...
0   2016-01-01     A    xxxa    11    ...
1   2016-02-13     A    xxxa    33    ...
2   2016-01-01     A    kkka    0     ...
3   2016-02-13     A    kkka    44    ...
4   2016-01-01     B    yyyb    22    ...
5   2016-02-13     B    yyyb    0     ...
6   2016-01-01     C    jjjc    0     ...
7   2016-02-13     C    jjjc    55    ...

CodePudding user response:

You could use concat pivot fillna to get the missing dates filled out with for each "user" and "userID"; then stack the dates (level=1) to get the desired data in the desired shape. Then do some cosmetic changes to get the desired output:

out = (pd.concat((df1, df2))
       .pivot(['userID','user'], ['dt'], ['val','val2'])
       .fillna(0)
       .stack(level=1)
       .reset_index()
       [['dt','user','userID','val','val2']]
       .sort_values('user')
       .reset_index(drop=True))

Output:

           dt user userID   val   val2
0  2016-01-01    A   kkka   0.0    0.0
1  2016-02-13    A   kkka  44.0  444.0
2  2016-01-01    A   xxxa  11.0  111.0
3  2016-02-13    A   xxxa  33.0  333.0
4  2016-01-01    B   yyyb  22.0  222.0
5  2016-02-13    B   yyyb   0.0    0.0
6  2016-01-01    C   jjjc   0.0    0.0
7  2016-02-13    C   jjjc  55.0  555.0
  • Related