I have two datasets (df1
and df2
) of values with a certain range (Start
and End
) in both of them.
I would like to annotate the first one (df1
) with values from column Num
of the corresponding overlapping range of values (Start
/End
) on df2
.
Example: The first row in df1
ranges from 0
-2300000
, since 2300000
is lower than the End
in first row in df2
and the whole range 0
-2300000
is overlapping with the range of 62920
-121705338
, it would be annotated with Num
3
. Similarly, also the row 2 of df1
has range 2300000
-5400000
overlapping with the range of 62920
-121705338
, row 2 would also be annotated with Num
3
.
However, in the case of the last row of df1
, the range contains two rows from df2
, so there needs to output the sum in Num
in the last two rows of df2
.
The desired output would be df3
df1.head()
|Start |End |Tag |
|---------|---------|-------|
|0 |2300000 |gneg45 |
|2300000 |5400000 |gpos25 |
|143541857|200000000|gneg34 |
df2.head()
| Start | End | Num |
|---------|---------|--------|
|62920 |121705338| 3 |
|143541857|147901334| 2 |
|147901760|151020217| 5 |
df3 =
|Start |End |Num |
|---------|---------|-------|
|0 |2300000 |3 |
|2300000 |5400000 |3 |
|143541857|200000000|7 |
I tried pandas merge
creating a key and query
based on a range of columns, but nothing really worked.
Thanks in advance!!
CodePudding user response:
From your description, you are looking for overlapping range in df1
and df2
in order for df1
to take the Num
from df2
.
To formulate the condition of overlapping range condition, let's illustrate as follows the condition of non-overlapping range:
Either:
|<-------------->|
df2.Start .df2.End
|<------------->|
df1.Start df1.End
or:
|<-------------->|
df2.Start .df2.End
|<------------->|
df1.Start df1.End
This non-overlapping range condition can be formulated as:
Either (df1.End
< df2.Start
) or (df1.Start
> df2.End
)
Therefore, the overlapping range condition is the negation of the above conditions, that is:
~ ((df1.End
< df2.Start
) | (df1.Start
> df2.End
))
which is equivalent to:
(df1.End
>= df2.Start
) & (df1.Start
<= df2.End
)
Solution 1: Simple Solution for small dataset
Step 1: For small dataset, you can cross join df1
and df2
by .merge()
, then filter by the overlapping condition using .query()
, as follows:
df3 = (df1.merge(df2, how='cross', suffixes=('_df1', '_df2'))
.query('(End_df1 >= Start_df2) & (Start_df1 <= End_df2)')
.rename({'Start_df1': 'Start', 'End_df1': 'End'}, axis=1)
[['Start', 'End', 'Num']]
)
If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross'
, you can use:
df3 = (df1.assign(key=1).merge(df2.assign(key=1), on='key', suffixes=('_df1', '_df2')).drop('key', axis=1)
.query('(End_df1 >= Start_df2) & (Start_df1 <= End_df2)')
.rename({'Start_df1': 'Start', 'End_df1': 'End'}, axis=1)
[['Start', 'End', 'Num']]
)
Intermediate result:
print(df3)
Start End Num
0 0 2300000 3
3 2300000 5400000 3
7 143541857 200000000 2
8 143541857 200000000 5
Step 2: Sum up the Num
values for same range (same Start
and End
) by .groupby()
and .sum()
:
df3 = df3.groupby(['Start', 'End'])['Num'].sum().reset_index()
Result:
print(df3)
Start End Num
0 0 2300000 3
1 2300000 5400000 3
2 143541857 200000000 7
Solution 2: Numpy Solution for large dataset
For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:
Step 1:
d1_S = df1.Start.to_numpy()
d1_E = df1.End.to_numpy()
d2_S = df2.Start.to_numpy()
d2_E = df2.End.to_numpy()
# filter for overlapping range condition and get the respective row indexes of `df1`, `df2` in `i` and `j`
i, j = np.where((d1_E[:, None] >= d2_S) & (d1_S[:, None] <= d2_E))
df3 = pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns '_df2')
)
Intermediate result:
print(df3)
Start End Tag Start_df2 End_df2 Num_df2
0 0 2300000 gneg45 62920 121705338 3
1 2300000 5400000 gpos25 62920 121705338 3
2 143541857 200000000 gneg34 143541857 147901334 2
3 143541857 200000000 gneg34 147901760 151020217 5
Step 2: Sum up the Num
values for same range (same Start
and End
) by .groupby()
and .sum()
:
df3 = df3.groupby(['Start', 'End'])['Num_df2'].sum().reset_index(name='Num')
Result:
print(df3)
Start End Num
0 0 2300000 3
1 2300000 5400000 3
2 143541857 200000000 7