The goal is to filter a DataFrame on a dynamic number of columns with their respective individual values. To achieve this, I've created a filter mask from a dictionary which I should be able to use each time.
However this filter mask becomes a string and therefore provides a 'KeyError'. Some example of how my logic works.
import pandas as pd
# Create a list of dictionaries with the data for each row
data = [{'col1': 1, 'col2': 'a', 'col3': True, 'col4': 1.0},
{'col1': 2, 'col2': 'b', 'col3': False, 'col4': 2.0},
{'col1': 1, 'col2': 'c', 'col3': True, 'col4': 3.0},
{'col1': 2, 'col2': 'd', 'col3': False, 'col4': 4.0},
{'col1': 1, 'col2': 'e', 'col3': True, 'col4': 5.0}]
df = pd.DataFrame(data)
filter_dict = {'col1': 1, 'col3': True,}
def create_filter_query_for_df(filter_dict):
query = ""
for i, (column, values) in enumerate(filter_dict.items()):
if i > 0:
query = " & "
if isinstance(values,float) or isinstance(values,int):
query = f"(data['{column}'] == {values})"
else:
query = f"(data['{column}'] == '{values}')"
return query
df[create_filter_query_for_df(filter_dict)]
Result is:
KeyError: "(data['col1'] == 1) & (data['col3'] == True)"
The issue is that the create_filter_query_for_df()
will return a string and it should be boolean variable. If you would make the mask as following:
mask1 = "(data['col1'] == 1) & (data['col3'] == True)" # the same error is returned;
# However if you format as below, it provides a success
mask2 = (data['col1'] == 1) & (data['col3'] == True)
The type of mask1 will be str. The type of mask2 will be boolean.
However, I can't use bool(mask1) because then I can't use it anymore as filter condition. I'm quite stuck so need some help here.
Apologies if I took a completely wrong approach in trying to get to the filter, it seemed quite a suitable solution to me.
Thanks in advance!
CodePudding user response:
The result of filtering based on mask2
is as follows:
mask2 = (df['col1'] == 1) & (df['col3'] == True)
df[mask2]
col1 col2 col3 col4
0 1 a True 1.0
2 1 c True 3.0
4 1 e True 5.0
To reach the same result with a string, we can use df.query
like so:
df.query('(col1 == 1) & (col3 == True)')
col1 col2 col3 col4
0 1 a True 1.0
2 1 c True 3.0
4 1 e True 5.0
Note that the required syntax is actually a bit different. So, let's simplify your function to end up with the string that we need:
def create_filter_query_for_df(filter_dict):
list_pairs = [f"({col} == {val})" for col, val in filter_dict.items()]
query = ' & '.join(list_pairs)
# '(col1 == 1) & (col3 == True)'
return query
df.query(create_filter_query_for_df(filter_dict))
col1 col2 col3 col4
0 1 a True 1.0
2 1 c True 3.0
4 1 e True 5.0
ALternative approach
Incidentially, if you are only using the & operator, another way to approach this problem could be as follows:
- Use a list comprehension to create two
pd.Series
and use them as input forpd.concat
withaxis
parameter set to1
. - Chain
df.all
withaxis
parameter again set to1
to evaluate if all values for each row in the resulting temporarydf
equalTrue
). - The result is a single
pd.Series
with booleans that we can use to filter thedf
.
my_mask = (pd.concat([df[k].eq(v) for k, v in filter_dict.items()],
axis=1)
.all(axis=1))
df[my_mask]
col1 col2 col3 col4
0 1 a True 1.0
2 1 c True 3.0
4 1 e True 5.0
Of course, this approach may not be ideal (or: function at all) if your actual requirements are a bit more complex.