I have two large data frames that look like this (each>10 GB):
df1
Identifier Position Source Location
1 AY1:2301 87 ch1 14
2 BC1U:4010 105 ch1 14
3 AC44:1230 90 ch1 15
4 AJC:93410 83 ch1 16
5 ABYY:0001 101 ch1 16
6 ABC:01 42 ch1 16
7 HH:A9CX 413 ch1 17
8 LK:9310 2 ch1 17
9 JFNE:3410 132 ch1 18
10 MKASDL:11 14 ch1 18
11 MKDFA:9401 18 ch1 18
12 MKASDL1:011 184 ch2 50
13 LKOC:AMC02 18 ch2 50
14 POI:1100 900 ch2 53
15 MCJE:09HA 11 ch2 53
16 ABYCI:1123 15 ch2 53
17 MNKA:410 1 ch2 53
df2
Identifier Position Source Location
1 AY1:2301 87 ch1 14
2 BC1U:4010 105 ch1 14
3 AC44:1230 90 ch1 15
4 ABC:01 42 ch1 16
5 ABYY:0001 101 ch1 16
6 AJC:93410 83 ch1 16
7 HH:A9CX 413 ch1 17
8 LK:9310 2 ch1 17
9 MKASDL:11 14 ch1 18
10 JFNE:3410 132 ch1 18
11 MKDFA:9401 18 ch1 18
12 MKASDL1:011 184 ch2 50
13 LKOC:AMC02 18 ch2 50
14 MNKA:410 1 ch2 53
15 POI:1100 900 ch2 53
16 ABYCI:1123 15 ch2 53
17 MCJE:09HA 11 ch2 53
I want to do something similar to a "diff" but at the 'group' level (df.groupby(['Source', 'Location'])
)
I want to extract the original "row numbers" when the order of rows are 'swapped' within the same "Source/Location group".
The whole row should of course match in terms of content.
But I have no idea how to go about this. I can only think of writing a for loop which would be extremely inefficient when my original dataset has millions of rows.
Expected result:
Group_Source:Location df1.index df2.index
ch1:16 4 6
ch1:16 6 4
ch1:18 9 10
ch1:18 10 9
ch2:53 14 15
ch2:53 15 17
ch2:53 17 14
Assumptions:
- Both dataframes have the same number of rows
- Both dataframes are identical (only order of rows are swapped, so if both are sorted by Source, then Location and then Position and then Identifier, then they will be exactly identical)
- 'Swapped' rows always match exactly in terms of content in all columns
CodePudding user response:
Considering df1 & df2 as per your example, this should do the job:
>>> (df1
.reset_index()
.merge(df2.reset_index(), on=['Source','Location','Identifier'])
.groupby(['Source','Location','index_x'], as_index=False)[['index_y']].first()
.query('index_x != index_y'))
Source Location index_x index_y
ch1 16 4 6
ch1 16 6 4
ch1 18 9 10
ch1 18 10 9
ch2 53 14 15
ch2 53 15 17
ch2 53 17 14