Combine duplicate rows in Pandas-CodePudding

I got a dataframe where some rows contains almost duplicate values. I'd like to combine these rows as much as possible to reduce the row numbers. Let's say I got following dataframe:

One	Two	Three
A	B	C
B	B	B
C	A	B

In this example I'd like the output to be:

One	Two	Three
AB	B	CB
C	A	B

The real dataframe got a hundred thousand rows with ten columns.

The csv from a datafram-sample:

Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
A,A,A,A,A,A,A,A
A,A,A,A,A,A,A,B
A,A,A,A,A,A,A,C
A,A,A,A,A,A,B,A
A,A,A,A,A,A,B,B
A,A,A,A,A,A,B,C
A,A,A,A,A,A,C,A
A,A,A,A,A,A,C,B
A,A,A,A,A,A,C,C
C,C,C,C,C,C,A,A
C,C,C,C,C,C,A,B
C,C,C,C,C,C,A,C
C,C,C,C,C,C,B,A
C,C,C,C,C,C,B,B
C,C,C,C,C,C,B,C
C,C,C,C,C,C,C,A
C,C,C,C,C,C,C,B
C,C,C,C,C,C,C,C

To easier show how desired outcome woud look like:

Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
AC,AC,AC,AC,AC,AC,ABC,ABC

I've tried some code but I end up in real long code snippets which I doubt could be the best and most natural solution. Any suggestions?

CodePudding user response：

You can use something like this:

df = df.groupby(['Two'])['One','Three'].apply(''.join).reset_index()

If you can provide a small bit of code that creates the first df it'd be easier to try out solutions.

Also this other post may help: pandas - Merge nearly duplicate rows based on column value

CodePudding user response：

If your data are all characters you can end up with this solution and collapse everything to one single row:

import pandas as pd

data = pd.read_csv("path/to/data")
collapsed = data.astype(str).sum().applymap(lambda x: ''.join(set(x)))

Check this answer on how to get unique characters in a string.