I have a large dataframe, let's say 200,000 records.
The dataset looks like:
company id
0 A 123
1 A 124
2 A 135
3 B 124
...
199,997 T 124
199,998 T 632
199,999 T 135
I would like to find all the pairs which have the same id, excluding the company itself. For example, the result should be like:
company_x company_y id_x id_y
0 A B 124 124
1 A T 124 124
2 A T 135 135
...
I know I can simply achieve it by using Cartesian product:
df['a']=1
df2=pd.merge(df,df,how='left',on='a').drop('a',axis=1)
result=df2[(df2['company_x']!=df2['company_y'])&(df2[id_x]==df2[id_y])]
But the problem is the dataframe is too large, using Cartesian product is not a good idea here, which needs a lot of memory.
Is there any better solution?
CodePudding user response:
I think instead of doing a cross
merge
, you should merge
only on id
column which would substantially reduce the size of the intermediate dataframe
df.merge(df, on='id').query('company_x != company_y')
company_x id company_y
2 A 124 B
3 A 124 T
4 B 124 A
6 B 124 T
7 T 124 A
8 T 124 B
11 A 135 T
12 T 135 A
CodePudding user response:
My solution will work significantly slower than "ordinary" merge, but it will produce the result, even for a big source DataFrame, what looks to be the main obstacle in your case.
The idea is to:
- Decompose the problem into groups with equal id.
- Perform merge solely for this group.
- To avoid "reversed" pairs ("A / B" and "B / A"), filter the output by less than condition.
To do it, define a group processing function:
def grpProc(grp):
if grp.index.size == 1:
return None
wrk = pd.merge(grp, grp, on='id')
return wrk[wrk.company_x < wrk.company_y].iloc[:, [0,2]]
Then, to get the expected result, run:
result = df.groupby('id').apply(grpProc).droplevel(1).reset_index()
The result, for your sample data, is:
id company_x company_y
0 124 A B
1 124 A T
2 124 B T
3 135 A T