Home > OS >  How to join a dataframe with itself on a column and retain the unique pairs which are in the correct
How to join a dataframe with itself on a column and retain the unique pairs which are in the correct

Time:07-31

I have a dataframe df with column Items which contain item names in alphabetical order.

Items
-----
Apple
Ball
Bar
Cat

I want to join the data frame with itself to get two columns Item_x and Item_y such that in each row, the pairs of items (x,y) is unique in the sense that order in irrelevant i.e. the pair (Apple, Ball) will be considered a duplicate of the pair (Ball, Apple). So I only need to retain (Apple, Ball) because here the items are in alphabetical order and (Ball, Apple) is unwanted and must be deleted.

pd.merge(df,df, on='Items', how='outer')

does not work because it gives extra unwanted pairs such as (Apple, Apple) and (Ball, Apple)

Question: How to join data frame with itself on a column and retain only the rows with unique values the two columns which are in the correct alphabetical order?

CodePudding user response:

You're asking for combinations~

from itertools import combinations
out = pd.DataFrame(combinations(df.Items.unique(), 2), columns=['Item_x', 'Item_y'])
print(out)

Output:

  Item_x Item_y
0  Apple   Ball
1  Apple    Bar
2  Apple    Cat
3   Ball    Bar
4   Ball    Cat
5    Bar    Cat

CodePudding user response:

Another option is to compute the cross join and filter:

orig = pd.DataFrame.from_dict(
  {"Items": ["Apple", "Ball", "Bar", "Cat"]}
)

prod = pd.merge(orig, orig, how='cross')
new_df = pd.DataFrame()

l = len(orig.index)
rem_ind = [j l*i for i in range(l) for j in range(i 1)]
combination = prod.loc[~prod.index.isin(rem_ind)]
print(combination)

Output:

   Items_x Items_y
1    Apple    Ball
2    Apple     Bar
3    Apple     Cat
6     Ball     Bar
7     Ball     Cat
11     Bar     Cat
  • Related