Take row pairs in one pandas dataframe and iterate through another dataframe-CodePudding

I would like to iterate through row pairs of df_a, comparing firstname1 to lastname1 and firstname2 to lastname2. So for each row pair (Ex. firstname1 & lastname1), I would then like to iterate through df_b and identify non-overlapping coordinate ranges (coordinate_start, coordinate_end) for those names, for example the coordinate ranges assigned for firstname1 do not overlap with any of the coordinate ranges for lastname1.

Starting with df_a & df_b:

a = {'ID_a': ['firstname1', 'firstname2'], 'ID_b': ['lastname1', 'lastname2']}

    
df_a = pd.DataFrame(a)



b = {'coordinate_start' :[1,6,20,35,51,1,7,15,40,51,70,85,91,70,80,94], 'coordinate_end':[5,15,27,50,55,5,14,19,47,55,78,90,93,78,84,100],
 'name': ['firstname1', 'firstname1','firstname1', 'firstname1','firstname1',
 'lastname1','lastname1','lastname1','lastname1','lastname1',
 'firstname2','firstname2','firstname2',
 'lastname2','lastname2', 'lastname2'
 ]}

df_b = pd.DataFrame(b)

I would like to return df_c, which contains the non-overlapping coordinates and the name it's associated with:

c = {'unique_coordinate_start': [20,85,80,91,94],
'unique_coordinate_end': [27,90,84,93,100],
'name': ['firstname1','firstname2', 'lastname2','firstname2','lastname2']}

df_c = pd.DataFrame(c)

CodePudding user response：

In order to compare existing coordinates easily, I created an additional dataframe (df) whose index represents the coordinates.

Code:

target_index = []
col_cs, col_ce = 'coordinate_start', 'coordinate_end'
for _, (fn, ln) in df_a.iterrows():
    df_b_subset = df_b[df_b.name.isin([fn, ln])]
    co_min, co_max = df_b_subset.agg({col_cs: 'min',col_ce: 'max'})

    df = pd.DataFrame(index=range(co_min, co_max 1), columns=[fn, ln]).fillna(-1)

    for i, (co_s, co_e, col) in df_b_subset.iterrows():
        df.loc[co_s:co_e, col] = i

    for col in [fn, ln]:
        df_agg = df.groupby(col).max()
        target_index  = df_agg[df_agg.values<0].index.tolist()

mask = df_b.index.isin(target_index)
rename_table = {col_cs: f'unique_{col_cs}', col_ce: f'unique_{col_ce}'}
df_c = df_b[mask].rename(columns=rename_table).reset_index(drop=True)

Output:

	unique_coordinate_start	unique_coordinate_end	name
0	20	27	firstname1
1	85	90	firstname2
2	91	93	firstname2
3	80	84	lastname2
4	94	100	lastname2