I would like to iterate through row pairs of df_a, comparing firstname1 to lastname1 and firstname2 to lastname2. So for each row pair (Ex. firstname1 & lastname1), I would then like to iterate through df_b and identify non-overlapping coordinate ranges (coordinate_start, coordinate_end) for those names, for example the coordinate ranges assigned for firstname1 do not overlap with any of the coordinate ranges for lastname1.
Starting with df_a & df_b:
a = {'ID_a': ['firstname1', 'firstname2'], 'ID_b': ['lastname1', 'lastname2']}
df_a = pd.DataFrame(a)
b = {'coordinate_start' :[1,6,20,35,51,1,7,15,40,51,70,85,91,70,80,94], 'coordinate_end':[5,15,27,50,55,5,14,19,47,55,78,90,93,78,84,100],
'name': ['firstname1', 'firstname1','firstname1', 'firstname1','firstname1',
'lastname1','lastname1','lastname1','lastname1','lastname1',
'firstname2','firstname2','firstname2',
'lastname2','lastname2', 'lastname2'
]}
df_b = pd.DataFrame(b)
I would like to return df_c, which contains the non-overlapping coordinates and the name it's associated with:
c = {'unique_coordinate_start': [20,85,80,91,94],
'unique_coordinate_end': [27,90,84,93,100],
'name': ['firstname1','firstname2', 'lastname2','firstname2','lastname2']}
df_c = pd.DataFrame(c)
CodePudding user response:
In order to compare existing coordinates easily, I created an additional dataframe (df
) whose index represents the coordinates.
Code:
target_index = []
col_cs, col_ce = 'coordinate_start', 'coordinate_end'
for _, (fn, ln) in df_a.iterrows():
df_b_subset = df_b[df_b.name.isin([fn, ln])]
co_min, co_max = df_b_subset.agg({col_cs: 'min',col_ce: 'max'})
df = pd.DataFrame(index=range(co_min, co_max 1), columns=[fn, ln]).fillna(-1)
for i, (co_s, co_e, col) in df_b_subset.iterrows():
df.loc[co_s:co_e, col] = i
for col in [fn, ln]:
df_agg = df.groupby(col).max()
target_index = df_agg[df_agg.values<0].index.tolist()
mask = df_b.index.isin(target_index)
rename_table = {col_cs: f'unique_{col_cs}', col_ce: f'unique_{col_ce}'}
df_c = df_b[mask].rename(columns=rename_table).reset_index(drop=True)
Output:
unique_coordinate_start | unique_coordinate_end | name | |
---|---|---|---|
0 | 20 | 27 | firstname1 |
1 | 85 | 90 | firstname2 |
2 | 91 | 93 | firstname2 |
3 | 80 | 84 | lastname2 |
4 | 94 | 100 | lastname2 |