Home > Blockchain >  Combine data from lines in a pandas df where client_id is duplicated
Combine data from lines in a pandas df where client_id is duplicated

Time:11-05

I have combined 2 dataframes and I want to make sure that the user id is not repeated and all values are written in 1 line, and not duplicated. This is a ready-made dataframe, but I need to edit it and remove repetitions. I attach the input data:

rzd = pd.DataFrame({
    'client_id': [111, 112, 113, 114, 115],
    'rzd_revenue': [1093, 2810, 10283, 5774, 981]})
rzd

auto = pd.DataFrame({
    'client_id': [113, 114, 115, 116, 117],
    'auto_revenue': [57483, 83, 912, 4834, 98]})
auto

air = pd.DataFrame({
    'client_id': [115, 116, 117, 118],
    'air_revenue': [81, 4, 13, 173]})
air

client_base = pd.DataFrame({
    'client_id': [111, 112, 113, 114, 115, 116, 117, 118],
    'address': ['Комсомольская 4', 'Энтузиастов 8а', 'Левобережная 1а', 'Мира 14', 'ЗЖБИиДК 1', 
                'Строителей 18', 'Панфиловская 33', 'Мастеркова 4']})
client_base
frames = [rzd, auto, air]
result = pd.concat(frames)
result

full_result = result.merge(client_base)
full_result
    client_id  rzd_revenue  auto_revenue  air_revenue          address
0         111       1093.0           NaN          NaN  Комсомольская 4
1         112       2810.0           NaN          NaN   Энтузиастов 8а
2         113      10283.0           NaN          NaN  Левобережная 1а
3         113          NaN       57483.0          NaN  Левобережная 1а
4         114       5774.0           NaN          NaN          Мира 14
5         114          NaN          83.0          NaN          Мира 14
6         115        981.0           NaN          NaN        ЗЖБИиДК 1
7         115          NaN         912.0          NaN        ЗЖБИиДК 1
8         115          NaN           NaN         81.0        ЗЖБИиДК 1
9         116          NaN        4834.0          NaN    Строителей 18
10        116          NaN           NaN          4.0    Строителей 18
11        117          NaN          98.0          NaN  Панфиловская 33
12        117          NaN           NaN         13.0  Панфиловская 33
13        118          NaN           NaN        173.0     Мастеркова 4

The numbers are different documentation, by type merge, concat, and so on, but I did not find clear articles on editing.

As a result, it should look like this:

   client_id  rzd_revenue  auto_revenue  air_revenue          address
0        111       1093.0           0.0          0.0  Комсомольская 4
1        112       2810.0           0.0          0.0   Энтузиастов 8а
2        113      10283.0       57483.0          0.0  Левобережная 1а
3        114       5774.0          83.0          0.0          Мира 14
4        115        981.0         912.0         81.0        ЗЖБИиДК 1

CodePudding user response:

If possible, you should prefer joining so that there isn't a duplicate row issue at all. However, if you're just given full_result and not the initial tables, this can be accomplished using a groupby expression:

full_result.groupby("client_id").first().reset_index()

For each client_id, it takes the first non-null value of each column, then resets the index so client_id is just another column again.

CodePudding user response:

You should be joining. You can use concat() to do that, just need to set the index first:

frames = [df.set_index('client_id') for df in [rzd, auto, air]]
result = pd.concat(frames, axis=1)
result
           rzd_revenue  auto_revenue  air_revenue
client_id                                        
111             1093.0           NaN          NaN
112             2810.0           NaN          NaN
113            10283.0       57483.0          NaN
114             5774.0          83.0          NaN
115              981.0         912.0         81.0
116                NaN        4834.0          4.0
117                NaN          98.0         13.0
118                NaN           NaN        173.0

Then to connect client_base, you can do something similar. I'll also replace the NaNs now.

full_result = result.join(client_base.set_index('client_id')).fillna(0)
full_result
           rzd_revenue  auto_revenue  air_revenue          address
client_id                                                         
111             1093.0           0.0          0.0  Комсомольская 4
112             2810.0           0.0          0.0   Энтузиастов 8а
113            10283.0       57483.0          0.0  Левобережная 1а
114             5774.0          83.0          0.0          Мира 14
115              981.0         912.0         81.0        ЗЖБИиДК 1
116                0.0        4834.0          4.0    Строителей 18
117                0.0          98.0         13.0  Панфиловская 33
118                0.0           0.0        173.0     Мастеркова 4

At this point you could .reset_index() if you wanted.


More info:

  • Related