Home > Blockchain >  Combine duplicate rows in Pandas
Combine duplicate rows in Pandas

Time:11-05

I got a dataframe where some rows contains almost duplicate values. I'd like to combine these rows as much as possible to reduce the row numbers. Let's say I got following dataframe:

One Two Three
A B C
B B B
C A B

In this example I'd like the output to be:

One Two Three
AB B CB
C A B

The real dataframe got a hundred thousand rows with ten columns.

The csv from a datafram-sample:

Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
A,A,A,A,A,A,A,A
A,A,A,A,A,A,A,B
A,A,A,A,A,A,A,C
A,A,A,A,A,A,B,A
A,A,A,A,A,A,B,B
A,A,A,A,A,A,B,C
A,A,A,A,A,A,C,A
A,A,A,A,A,A,C,B
A,A,A,A,A,A,C,C
C,C,C,C,C,C,A,A
C,C,C,C,C,C,A,B
C,C,C,C,C,C,A,C
C,C,C,C,C,C,B,A
C,C,C,C,C,C,B,B
C,C,C,C,C,C,B,C
C,C,C,C,C,C,C,A
C,C,C,C,C,C,C,B
C,C,C,C,C,C,C,C

To easier show how desired outcome woud look like:

Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
AC,AC,AC,AC,AC,AC,ABC,ABC

I've tried some code but I end up in real long code snippets which I doubt could be the best and most natural solution. Any suggestions?

CodePudding user response:

You can use something like this:

df = df.groupby(['Two'])['One','Three'].apply(''.join).reset_index()

If you can provide a small bit of code that creates the first df it'd be easier to try out solutions.

Also this other post may help: pandas - Merge nearly duplicate rows based on column value

CodePudding user response:

If your data are all characters you can end up with this solution and collapse everything to one single row:

import pandas as pd

data = pd.read_csv("path/to/data")
collapsed = data.astype(str).sum().applymap(lambda x: ''.join(set(x)))

Check this answer on how to get unique characters in a string.

  • Related