I have a df with 14 columns and 20,000 rows. I would like to create a two column dataframe that represents each unique pairing for the data entries within each single row. Example:
#sample df:
data = {'first': ['red', 'blue', 'yellow'],
'second': ['blue', 'pink', 'orange'],
'third': ['green', 'grey', None]}
df = pd.DataFrame(data)
df
first second third
0 red blue green
1 blue pink grey
2 yellow orange None
for this df input my desired output would be:
pairA pairB
0 red blue
1 red green
2 blue green
3 blue pink
4 blue grey
5 pink grey
6 yellow orange
7 yellow None
8 orange None
I have tried to use itertools product, but have only made that work column-wise for two columns. I believe a for loop would take way too long with the size of the data. Is there a pandas way to do this?
CodePudding user response:
Try this:
from itertools import combinations
combo = df.apply(lambda row: list(combinations(row, 2)), axis=1).explode().to_list()
pd.DataFrame(combo, columns=["pairA", "pairB"])
CodePudding user response:
You have many ways,
if order does not matter you can use:
from itertools import combinations
pd.DataFrame([c for x in df.agg(set, axis=1) for c in combinations(x, r=2)],
columns=['pairA', 'pairB'])
If order is important:
from itertools import combinations
pd.DataFrame([c for x in df.apply(lambda r: tuple(pd.Series.unique(r)), axis=1)
for c in combinations(x, r=2)],
columns=['pairA', 'pairB'])
output:
pairA pairB
0 red blue
1 red green
2 blue green
3 blue pink
4 blue grey
5 pink grey
6 yellow orange
7 yellow None
8 orange None