Home > OS >  Compare 'order' of rows within 'groups' of two separate dataframes and find the
Compare 'order' of rows within 'groups' of two separate dataframes and find the

Time:05-31

I have two large data frames that look like this (each>10 GB):

df1

     Identifier  Position Source  Location
1      AY1:2301        87    ch1        14
2     BC1U:4010       105    ch1        14
3     AC44:1230        90    ch1        15
4     AJC:93410        83    ch1        16
5     ABYY:0001       101    ch1        16
6        ABC:01        42    ch1        16
7       HH:A9CX       413    ch1        17
8       LK:9310         2    ch1        17
9     JFNE:3410       132    ch1        18
10    MKASDL:11        14    ch1        18
11   MKDFA:9401        18    ch1        18
12  MKASDL1:011       184    ch2        50
13   LKOC:AMC02        18    ch2        50
14     POI:1100       900    ch2        53
15    MCJE:09HA        11    ch2        53
16   ABYCI:1123        15    ch2        53
17     MNKA:410         1    ch2        53

df2

     Identifier  Position Source  Location
1      AY1:2301        87    ch1        14
2     BC1U:4010       105    ch1        14
3     AC44:1230        90    ch1        15
4        ABC:01        42    ch1        16
5     ABYY:0001       101    ch1        16
6     AJC:93410        83    ch1        16
7       HH:A9CX       413    ch1        17
8       LK:9310         2    ch1        17
9     MKASDL:11        14    ch1        18
10    JFNE:3410       132    ch1        18
11   MKDFA:9401        18    ch1        18
12  MKASDL1:011       184    ch2        50
13   LKOC:AMC02        18    ch2        50
14     MNKA:410         1    ch2        53
15     POI:1100       900    ch2        53
16   ABYCI:1123        15    ch2        53
17    MCJE:09HA        11    ch2        53

I want to do something similar to a "diff" but at the 'group' level (df.groupby(['Source', 'Location']))

I want to extract the original "row numbers" when the order of rows are 'swapped' within the same "Source/Location group".

The whole row should of course match in terms of content.

But I have no idea how to go about this. I can only think of writing a for loop which would be extremely inefficient when my original dataset has millions of rows.

Expected result:

Group_Source:Location  df1.index  df2.index

ch1:16                         4          6
ch1:16                         6          4
ch1:18                         9         10
ch1:18                        10          9
ch2:53                        14         15
ch2:53                        15         17
ch2:53                        17         14

Assumptions:

  • Both dataframes have the same number of rows
  • Both dataframes are identical (only order of rows are swapped, so if both are sorted by Source, then Location and then Position and then Identifier, then they will be exactly identical)
  • 'Swapped' rows always match exactly in terms of content in all columns

CodePudding user response:

Considering df1 & df2 as per your example, this should do the job:

>>> (df1
.reset_index()
.merge(df2.reset_index(), on=['Source','Location','Identifier'])
.groupby(['Source','Location','index_x'], as_index=False)[['index_y']].first()
.query('index_x != index_y'))


Source  Location    index_x index_y
             ch1    16  4   6
             ch1    16  6   4
             ch1    18  9   10
             ch1    18  10  9
             ch2    53  14  15
             ch2    53  15  17
             ch2    53  17  14
  • Related