Find the same value in large Dataframe without using Cartesian product-CodePudding

I have a large dataframe, let's say 200,000 records.

The dataset looks like:

      company  id
0       A     123
1       A     124
2       A     135
3       B     124
...
199,997 T     124
199,998 T     632
199,999 T     135

I would like to find all the pairs which have the same id, excluding the company itself. For example, the result should be like:

   company_x company_y id_x  id_y
0      A         B      124  124
1      A         T      124  124
2      A         T      135  135
...

I know I can simply achieve it by using Cartesian product:

df['a']=1
df2=pd.merge(df,df,how='left',on='a').drop('a',axis=1)
result=df2[(df2['company_x']!=df2['company_y'])&(df2[id_x]==df2[id_y])]

But the problem is the dataframe is too large, using Cartesian product is not a good idea here, which needs a lot of memory.

Is there any better solution?

CodePudding user response：

I think instead of doing a cross merge, you should merge only on id column which would substantially reduce the size of the intermediate dataframe

df.merge(df, on='id').query('company_x != company_y')

   company_x   id company_y
2          A  124         B
3          A  124         T
4          B  124         A
6          B  124         T
7          T  124         A
8          T  124         B
11         A  135         T
12         T  135         A

CodePudding user response：

My solution will work significantly slower than "ordinary" merge, but it will produce the result, even for a big source DataFrame, what looks to be the main obstacle in your case.

The idea is to:

Decompose the problem into groups with equal id.
Perform merge solely for this group.
To avoid "reversed" pairs ("A / B" and "B / A"), filter the output by less than condition.

To do it, define a group processing function:

def grpProc(grp):
    if grp.index.size == 1:
        return None
    wrk = pd.merge(grp, grp, on='id')
    return wrk[wrk.company_x < wrk.company_y].iloc[:, [0,2]]

Then, to get the expected result, run:

result = df.groupby('id').apply(grpProc).droplevel(1).reset_index()

The result, for your sample data, is:

    id company_x company_y
0  124         A         B
1  124         A         T
2  124         B         T
3  135         A         T