I have the following problem. I have a dataframe which look like this.
Dataframe1
start end
0 0 2
1 3 7
2 8 9
and another dataframe which looks like this.
Dataframe2
data
1 ...
4 ...
8 ...
11 ...
What I am trying to achieve is following:
For each row in Dataframe1 I want to check if there is any index value in Dataframe2 which is in range(start, end) of Dataframe1. If the condition is True, I want to create a new column["condition"] where the outcome is stored.
Since there is the possiblity to deal with large amounts of data I tried using numpy.select.
Like this:
range_start = df1.start
range_end = df1.end
condition = [
df2.index.to_series().between(range_start, range_end)
]
choice = ["True"]
df1["condition"] = np.select(condition, choice, default=0)
This gives me an error:
ValueError: Can only compare identically-labeled Series objects
I also tried a list comprehension. That didn't work either. All the things I tried are failing because I am dealing with a series (--> range_start, range_end). There has to be a way to make this work I think..
I already searched stackoverflow for this paricular problem. But I wasn't able to find a solution to this problem. It could be, that I'm just to inexperienced for this type of problem, to search for the right solution.
So maybe you can help me out here.
Thank you!
expected output:
start end condition
0 0 2 True
1 3 7 True
2 8 9 True
CodePudding user response:
Use DataFrame.drop_duplicates
for remove duplicates by both columns and index, create all combinations by DataFrame.merge
with cross join and last test at least one match by GroupBy.any
:
df3 = (df1.drop_duplicates(['start','end'])
.merge(df2.index.drop_duplicates().to_frame(), how='cross'))
df3['condition'] = df3[0].between(df3.start, df3.end)
df3 = df1.join(df3.groupby(['start','end'])['condition'].any(), on=['start','end'])
print (df3)
start end condition
0 0 2 True
1 3 7 True
2 8 9 True
If all pairs in df1
are unique is possible use:
df3 = (df1.merge(df2.index.to_frame(), how='cross'))
df3['condition'] = df3[0].between(df3.start, df3.end)
df3 = df3.groupby(['start','end'], as_index=False)['condition'].any()
print (df3)
start end condition
0 0 2 True
1 3 7 True
2 8 9 True