I have combined 2 dataframes and I want to make sure that the user id is not repeated and all values are written in 1 line, and not duplicated. This is a ready-made dataframe, but I need to edit it and remove repetitions. I attach the input data:
rzd = pd.DataFrame({
'client_id': [111, 112, 113, 114, 115],
'rzd_revenue': [1093, 2810, 10283, 5774, 981]})
rzd
auto = pd.DataFrame({
'client_id': [113, 114, 115, 116, 117],
'auto_revenue': [57483, 83, 912, 4834, 98]})
auto
air = pd.DataFrame({
'client_id': [115, 116, 117, 118],
'air_revenue': [81, 4, 13, 173]})
air
client_base = pd.DataFrame({
'client_id': [111, 112, 113, 114, 115, 116, 117, 118],
'address': ['Комсомольская 4', 'Энтузиастов 8а', 'Левобережная 1а', 'Мира 14', 'ЗЖБИиДК 1',
'Строителей 18', 'Панфиловская 33', 'Мастеркова 4']})
client_base
frames = [rzd, auto, air]
result = pd.concat(frames)
result
full_result = result.merge(client_base)
full_result
client_id rzd_revenue auto_revenue air_revenue address
0 111 1093.0 NaN NaN Комсомольская 4
1 112 2810.0 NaN NaN Энтузиастов 8а
2 113 10283.0 NaN NaN Левобережная 1а
3 113 NaN 57483.0 NaN Левобережная 1а
4 114 5774.0 NaN NaN Мира 14
5 114 NaN 83.0 NaN Мира 14
6 115 981.0 NaN NaN ЗЖБИиДК 1
7 115 NaN 912.0 NaN ЗЖБИиДК 1
8 115 NaN NaN 81.0 ЗЖБИиДК 1
9 116 NaN 4834.0 NaN Строителей 18
10 116 NaN NaN 4.0 Строителей 18
11 117 NaN 98.0 NaN Панфиловская 33
12 117 NaN NaN 13.0 Панфиловская 33
13 118 NaN NaN 173.0 Мастеркова 4
The numbers are different documentation, by type merge, concat, and so on, but I did not find clear articles on editing.
As a result, it should look like this:
client_id rzd_revenue auto_revenue air_revenue address
0 111 1093.0 0.0 0.0 Комсомольская 4
1 112 2810.0 0.0 0.0 Энтузиастов 8а
2 113 10283.0 57483.0 0.0 Левобережная 1а
3 114 5774.0 83.0 0.0 Мира 14
4 115 981.0 912.0 81.0 ЗЖБИиДК 1
CodePudding user response:
If possible, you should prefer joining so that there isn't a duplicate row issue at all. However, if you're just given full_result
and not the initial tables, this can be accomplished using a groupby
expression:
full_result.groupby("client_id").first().reset_index()
For each client_id
, it takes the first non-null value of each column, then resets the index so client_id
is just another column again.
CodePudding user response:
You should be joining. You can use concat()
to do that, just need to set the index first:
frames = [df.set_index('client_id') for df in [rzd, auto, air]]
result = pd.concat(frames, axis=1)
result
rzd_revenue auto_revenue air_revenue
client_id
111 1093.0 NaN NaN
112 2810.0 NaN NaN
113 10283.0 57483.0 NaN
114 5774.0 83.0 NaN
115 981.0 912.0 81.0
116 NaN 4834.0 4.0
117 NaN 98.0 13.0
118 NaN NaN 173.0
Then to connect client_base
, you can do something similar. I'll also replace the NaNs now.
full_result = result.join(client_base.set_index('client_id')).fillna(0)
full_result
rzd_revenue auto_revenue air_revenue address
client_id
111 1093.0 0.0 0.0 Комсомольская 4
112 2810.0 0.0 0.0 Энтузиастов 8а
113 10283.0 57483.0 0.0 Левобережная 1а
114 5774.0 83.0 0.0 Мира 14
115 981.0 912.0 81.0 ЗЖБИиДК 1
116 0.0 4834.0 4.0 Строителей 18
117 0.0 98.0 13.0 Панфиловская 33
118 0.0 0.0 173.0 Мастеркова 4
At this point you could .reset_index()
if you wanted.
More info: