Home > Mobile >  Pandas: how to select dataframe rows with a Series of values that must be matched?
Pandas: how to select dataframe rows with a Series of values that must be matched?

Time:09-22

Given a selector Series, that is a Series treated as a query of a DataFrame in that the selector Series's index is a subset of the rows of a DataFrame to be queried, and whose values are to be matched in the query, currently, I'm doing the following:

import pandas as pd

df = pd.DataFrame([{'a':1,'b':2,'c':3},{'a':11,'b':22,'c':33}, {'a':1,'b':22,'c':333}])
selector_series = pd.Series({'a':1,'b':22})
desired_df = df[pd.DataFrame([df[x]==selector_series[x] for x in selector_series.index.intersection(df.columns)]).all()]
print(desired_df)

Output:

   a   b    c
2  1  22  333

However, I suspect this kind of thing is a common enough operation that there is a more "Pandasic" (as opposed to Pythonic) way of doing the same thing. Is there?

CodePudding user response:

A more natural approach for pandas would be not to use loops but rather the eq method for comparison on the common indices:

idx = df.columns.intersection(selector_series.index)
df.loc[df[idx].eq(selector_series.loc[idx]).all(1)]

CodePudding user response:

Pandas comparison methods like eq perform an alignment operation. This is typically used to align for "outer" comparison:

Take this modified example

df = pd.DataFrame({'a': [1, 11, 1], 'b': [2, 22, 22], 'c': [3, 33, 333]})
selector_series = pd.Series({'a': 1, 'b': 22, 'd': 156})
df.eq(selector_series)
       a      b      c      d
0   True  False  False  False
1  False   True  False  False
2   True   True  False  False

Notice how the union of columns between the Series and the DataFrame are compared. In this case, however, we do not want this since c (only being in the DataFrame) and d (only being in the Series) can only ever be False.


We can instead, explicitly align to override the default how with 'inner'.

Here is a helper method:

def intersection_eq(df_: pd.DataFrame, s_: pd.Series) -> pd.DataFrame:
    aligned_df_, aligned_s_ = df_.align(s_, join='inner', axis=1, copy=False)
    return aligned_df_.eq(aligned_s_)

Now this operation returns only the common columns in both or the intersection.

intersection_eq(df, selector_series)

       a      b
0   True  False
1  False   True
2   True   True

Now we can filter as typical with all on axis=1:

df[intersection_eq(df, selector_series).all(axis=1)]

   a   b    c
2  1  22  333

Naturally, we could also add this to our DataFrame definition, or create a SubClass of DataFrame to allow for more natural syntax:

def intersection_eq(df_: pd.DataFrame, s_: pd.Series) -> pd.DataFrame:
    aligned_df_, aligned_s_ = df_.align(s_, join='inner', axis=1, copy=False)
    return aligned_df_.eq(aligned_s_)


pd.DataFrame.intersection_eq = intersection_eq

Which allows for usage like:

df[df.intersection_eq(selector_series).all(axis=1)]

   a   b    c
2  1  22  333
  • Related