I have a dataframe df
with column Items which contain item names in alphabetical order.
Items
-----
Apple
Ball
Bar
Cat
I want to join the data frame with itself to get two columns Item_x
and Item_y
such that in each row, the pairs of items (x,y) is unique in the sense that order in irrelevant i.e. the pair (Apple, Ball) will be considered a duplicate of the pair (Ball, Apple). So I only need to retain (Apple, Ball) because here the items are in alphabetical order and (Ball, Apple) is unwanted and must be deleted.
pd.merge(df,df, on='Items', how='outer')
does not work because it gives extra unwanted pairs such as (Apple, Apple)
and (Ball, Apple)
Question: How to join data frame with itself on a column and retain only the rows with unique values the two columns which are in the correct alphabetical order?
CodePudding user response:
You're asking for combinations~
from itertools import combinations
out = pd.DataFrame(combinations(df.Items.unique(), 2), columns=['Item_x', 'Item_y'])
print(out)
Output:
Item_x Item_y
0 Apple Ball
1 Apple Bar
2 Apple Cat
3 Ball Bar
4 Ball Cat
5 Bar Cat
CodePudding user response:
Another option is to compute the cross join and filter:
orig = pd.DataFrame.from_dict(
{"Items": ["Apple", "Ball", "Bar", "Cat"]}
)
prod = pd.merge(orig, orig, how='cross')
new_df = pd.DataFrame()
l = len(orig.index)
rem_ind = [j l*i for i in range(l) for j in range(i 1)]
combination = prod.loc[~prod.index.isin(rem_ind)]
print(combination)
Output:
Items_x Items_y
1 Apple Ball
2 Apple Bar
3 Apple Cat
6 Ball Bar
7 Ball Cat
11 Bar Cat