I have a pandas dataframe in the following format.
| id | name | last_name | address | x | y | x_list | y_list|
| -- | ------- | --------- | ------- | ------- | - | --------- | ----- |
| 1 | 'John' | 'Smith' | 'add_1' | 'one' | 1 | ['one'] | [1] |
| 2 | 'Tom' | 'Davis' | 'add_2' | 'two' | 2 | ['two'] | [2] |
| 3 | 'John' | 'Smith' | 'add_1' | 'three' | 3 | ['three'] | [3] |
| 4 | 'Tom' | 'Davis' | 'add_2' | 'four' | 4 | ['four'] | [4] |
| 5 | 'Susan' | 'Jones' | 'add_1' | 'one' | 1 | ['one'] | [1] |
I have no idea how to approach this problem. I need this output:
| id | name | last_name | address | x_list | y_list |
| -- | ------- | ---------- | ------- | ---------------- | ------ |
| 1 | 'John' | 'Smith' | 'add_1' | ['one', 'three'] | [1, 3] |
| 2 | 'Tom' | 'Davis' | 'add_2' | ['two', 'four'] | [2, 4] |
| 3 | 'Susan' | 'Jones' | 'add_1' | ['one'] | [1] |
Basically, I need to return a new DataFrame, or modify the existing one so the columns with the same name, last_name, and address have their x_list and y_list merged. Can anyone help me how to do this in pandas? This needs to be done on a dataframe of about 58 000 rows.
CodePudding user response:
then use following code:
df.groupby(['name', 'last_name', 'address'])[['x_list', 'y_list']].sum().reset_index()
output:
name last_name address x_list y_list
0 John Smith add_1 [ one , three ] [ 1 , 3 ]
1 Susan Jones add_1 [ one ] [ 1 ]
2 Tom Davis add_2 [ two , four ] [ 2 , 4 ]
CodePudding user response:
From what I could see, your x_list
and y_list
columns are redundant data that contain x
and y
variables in list form. If my observation is right, there is no issue with dropping those two variables. You can use a groupby
with list
aggregation.
Assuming the id
variable is your index, it will go something like this
df.groupby(['name','last_name','address'], as_index=False).agg(list)