I have two datasets (df1 and df2) of values with a certain range (Start and End) in both of them.
I would like to annotate the first one (df1) with the range of values (Start/End) on df2.
Example: The first row in df1 ranges from 0-2300000, this since 2300000 is lower than the End in first row in df2 would be annotated with Num 3, and also the row 2 of df1.
However, in the case of the last row of df1, the range contains two rows from df2, so there needs to output the sum in Num in the two rows of df2.
The desired output would be df3
df1.head()
|Start |End |Tag |
|---------|---------|-------|
|0 |2300000 |gneg45 |
|2300000 |5400000 |gpos25 |
|143541857|200000000|gneg34 |
df2.head()
| Start | End | Num |
|---------|---------|--------|
|62920 |121705338| 3 |
|143541857|147901334| 2 |
|147901760|151020217| 5 |
df3 =
|Start |End |Num |
|---------|---------|-------|
|0 |2300000 |3 |
|2300000 |5400000 |3 |
|143541857|200000000|7 |
I tried panda merge creating a key and query based on a range of columns, but nothing really worked.
Thanks in advance!!
CodePudding user response:
From your description, you are looking for overlapping range in df1
and df2
in order for df1
to take the Num
from df2
.
To formulate the condition of overlapping range condition, let's illustrate as follows the condition of non-overlapping range:
Either:
|<-------------->|
df2.Start .df2.End
|<------------->|
df1.Start df1.End
or:
|<-------------->|
df2.Start .df2.End
|<------------->|
df1.Start df1.End
This non-overlapping range condition can be formulated as:
Either (df1.End
< df2.Start
) or (df1.Start
> df2.End
)
Therefore, the overlapping range condition is the negation of the above conditions, that is:
~ ((df1.End
< df2.Start
) | (df1.Start
> df2.End
))
which is equivalent to:
(df1.End
>= df2.Start
) & (df1.Start
<= df2.End
)
Solution 1: Simple Solution for small dataset
Step 1: For small dataset, you can cross join df1
and df2
by .merge()
, then filter by the overlapping condition using .query()
, as follows:
df3 = (df1.merge(df2, how='cross', suffixes=('_df1', '_df2'))
.query('(End_df1 >= Start_df2) & (Start_df1 <= End_df2)')
.rename({'Start_df1': 'Start', 'End_df1': 'End'}, axis=1)
[['Start', 'End', 'Num']]
)
If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross'
, you can use:
df3 = (df1.assign(key=1).merge(df2.assign(key=1), on='key', suffixes=('_df1', '_df2')).drop('key', axis=1)
.query('(End_df1 >= Start_df2) & (Start_df1 <= End_df2)')
.rename({'Start_df1': 'Start', 'End_df1': 'End'}, axis=1)
[['Start', 'End', 'Num']]
)
Intermediate result:
print(df3)
Start End Num
0 0 2300000 3
3 2300000 5400000 3
7 143541857 200000000 2
8 143541857 200000000 5
Step 2: Sum up the Num
values for same range (same Start
and End
) by .groupby()
and .sum()
:
df3 = df3.groupby(['Start', 'End'])['Num'].sum().reset_index()
Result:
print(df3)
Start End Num
0 0 2300000 3
1 2300000 5400000 3
2 143541857 200000000 7
Solution 2: Numpy Solution for large dataset
For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:
Step 1:
d1_S = df1.Start.to_numpy()
d1_E = df1.End.to_numpy()
d2_S = df2.Start.to_numpy()
d2_E = df2.End.to_numpy()
# filter for overlapping range condition and get the respective row indexes of `df1`, `df2` in `i` and `j`
i, j = np.where((d1_E[:, None] >= d2_S) & (d1_S[:, None] <= d2_E))
df3 = pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns '_df2')
)
Intermediate result:
print(df3)
Start End Tag Start_df2 End_df2 Num_df2
0 0 2300000 gneg45 62920 121705338 3
1 2300000 5400000 gpos25 62920 121705338 3
2 143541857 200000000 gneg34 143541857 147901334 2
3 143541857 200000000 gneg34 147901760 151020217 5
Step 2: Sum up the Num
values for same range (same Start
and End
) by .groupby()
and .sum()
:
df3 = df3.groupby(['Start', 'End'])['Num_df2'].sum().reset_index(name='Num')
Result:
print(df3)
Start End Num
0 0 2300000 3
1 2300000 5400000 3
2 143541857 200000000 7