def filter_dataframe(dataframe, column, numbers, strings=None):
number_query = f"({column} >= {numbers[0]} and {column} <= {numbers[1]})"
if strings is not None:
single_string_query_list = []
for string in strings:
single_string_query = f"({column} == '{string}')"
single_string_query_list.append(single_string_query)
if len(strings) == 1:
string_query = f"{' or '.join(single_string_query_list)}"
else:
string_query = f"({' or '.join(single_string_query_list)})"
full_query = number_query " or " string_query
else:
full_query = number_query
dataframe = dataframe.query(full_query)
return dataframe
I am writing a function filter_dataframe
that allows you to filter a specific dataframe by the values in a specific column. In particular, a column that contains both strings a numbers. The numbers
parameter should be a list in the form [min value, max value]
and the strings
parameter should be a list of the string values you want. For example, let's say that I want to filter the below dataframe based on the values inside the "Numbers" column
Numbers
0 1
1 2
2 3
3 4
4 5
5 6
6 "example 1"
7 7
8 "example 2"
9 "example 3"
10 8
which is created by the below code
numbers = [1, 2, 3, 4, 5, 6, "example 1", 7, "example 2", "example 3", 8]
df = pd.DataFrame(numbers, columns=["Numbers"])
and I want only the rows that have a number between 1 and 6 inclusive or contain the string "example 1" or "example 2". Assuming that filter_dataframe
works, then we would execute
filter_dataframe(dataframe = df, column = "Numbers", numbers = [1,6], strings = ["example 1", "example 2"])
Essentially what happens inside filter_dataframe
is it generates a pandas query (which is stored in the local variable full_query
) which looks something like "(Numbers >= 1 and Numbers <= 6) or ((Numbers == 'example 1') or (Numbers == 'example 2'))"
.
Obviously, because there are both strings and integers in the column, we will run into the error TypeError: '>=' not supported between instances of 'str' and 'int'
. I know this function isn't greatly written, but are there any ideas on how work around this error?
CodePudding user response:
Create a temp column, with numeric type.
Set it equal to your query column, or NaN
where
you find a string.
Now your .query(...)
is a disjunct between
a fast range query on the numeric column,
and str
set membership on the original column.
CodePudding user response:
You don't have to use query
, boolean indexing should be easier:
def filter_dataframe(dataframe, column, numbers, strings=None):
if strings is None:
strings = []
return dataframe[pd.to_numeric(dataframe[column], errors='coerce')
.between(*numbers)
| dataframe[columns].isin(strings)]