Home > Software engineering >  Compare Values of 2 dataframes conditionally
Compare Values of 2 dataframes conditionally

Time:07-22

I have the following problem. I have a dataframe which look like this.

Dataframe1

   start  end
0      0    2
1      3    7
2      8    9

and another dataframe which looks like this.

Dataframe2

   data
1   ...
4   ...
8   ...
11  ...

What I am trying to achieve is following:

For each row in Dataframe1 I want to check if there is any index value in Dataframe2 which is in range(start, end) of Dataframe1. If the condition is True, I want to create a new column["condition"] where the outcome is stored.

Since there is the possiblity to deal with large amounts of data I tried using numpy.select.

Like this:

range_start = df1.start
range_end = df1.end

condition = [
    df2.index.to_series().between(range_start, range_end)
]

choice = ["True"]

df1["condition"] = np.select(condition, choice, default=0)

This gives me an error:

ValueError: Can only compare identically-labeled Series objects

I also tried a list comprehension. That didn't work either. All the things I tried are failing because I am dealing with a series (--> range_start, range_end). There has to be a way to make this work I think..

I already searched stackoverflow for this paricular problem. But I wasn't able to find a solution to this problem. It could be, that I'm just to inexperienced for this type of problem, to search for the right solution.

So maybe you can help me out here.

Thank you!

expected output:

   start  end  condition
0      0    2       True
1      3    7       True
2      8    9       True

CodePudding user response:

Use DataFrame.drop_duplicates for remove duplicates by both columns and index, create all combinations by DataFrame.merge with cross join and last test at least one match by GroupBy.any:

df3 = (df1.drop_duplicates(['start','end'])
          .merge(df2.index.drop_duplicates().to_frame(), how='cross'))

df3['condition'] = df3[0].between(df3.start, df3.end)

df3 = df1.join(df3.groupby(['start','end'])['condition'].any(), on=['start','end'])
print (df3)
   start  end  condition
0      0    2       True
1      3    7       True
2      8    9       True

If all pairs in df1 are unique is possible use:

df3 = (df1.merge(df2.index.to_frame(), how='cross'))

df3['condition'] = df3[0].between(df3.start, df3.end)

df3 = df3.groupby(['start','end'], as_index=False)['condition'].any()
print (df3)
   start  end  condition
0      0    2       True
1      3    7       True
2      8    9       True
  • Related