Home > Net >  How to remove rows with duplicate values in some columns, while keeping values that are different?
How to remove rows with duplicate values in some columns, while keeping values that are different?

Time:11-09

I have a pandas dataframe in the following format.

| id | name    | last_name | address | x       | y | x_list    | y_list|
| -- | ------- | --------- | ------- | ------- | - | --------- | ----- |
| 1  | 'John'  | 'Smith'   | 'add_1' | 'one'   | 1 | ['one']   | [1]   |
| 2  | 'Tom'   | 'Davis'   | 'add_2' | 'two'   | 2 | ['two']   | [2]   |
| 3  | 'John'  | 'Smith'   | 'add_1' | 'three' | 3 | ['three'] | [3]   |
| 4  | 'Tom'   | 'Davis'   | 'add_2' | 'four'  | 4 | ['four']  | [4]   |
| 5  | 'Susan' | 'Jones'   | 'add_1' | 'one'   | 1 | ['one']   | [1]   |

I have no idea how to approach this problem. I need this output:

| id | name    | last_name  | address | x_list           | y_list |
| -- | ------- | ---------- | ------- | ---------------- | ------ |
| 1  | 'John'  | 'Smith'    | 'add_1' | ['one', 'three'] | [1, 3] |
| 2  | 'Tom'   | 'Davis'    | 'add_2' | ['two', 'four']  | [2, 4] |
| 3  | 'Susan' | 'Jones'    | 'add_1' | ['one']          | [1]    |

Basically, I need to return a new DataFrame, or modify the existing one so the columns with the same name, last_name, and address have their x_list and y_list merged. Can anyone help me how to do this in pandas? This needs to be done on a dataframe of about 58 000 rows.

CodePudding user response:

then use following code:

df.groupby(['name', 'last_name', 'address'])[['x_list', 'y_list']].sum().reset_index()

output:

    name    last_name   address x_list          y_list
0   John    Smith       add_1   [ one , three ] [ 1 , 3 ]
1   Susan   Jones       add_1   [ one ]         [ 1 ]
2   Tom     Davis       add_2   [ two , four ]  [ 2 , 4 ]

CodePudding user response:

From what I could see, your x_list and y_list columns are redundant data that contain x and y variables in list form. If my observation is right, there is no issue with dropping those two variables. You can use a groupby with list aggregation.

Assuming the id variable is your index, it will go something like this

df.groupby(['name','last_name','address'], as_index=False).agg(list)

  • Related