So, i have started learning pandas recently and i just can`t figure out why people use filter masks while there is a 'query' method. They seem for me exactly the same, but query is more convenient to use at least for me
i have done a couple of comparisons between them
CodePudding user response:
In Pandas, .filter() is a method that can be used to subset a DataFrame based on labels in the index or columns. It takes a string or list of strings and returns a new DataFrame with only the rows or columns that match the specified labels.
For example, suppose we have a DataFrame with two columns and five rows, and we want to keep only the rows where the value in the 'A' column is greater than 0. We can use the .filter() method like this:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, -1, 0], 'B': [2, 3, 4, 5, 6]})
df_filtered = df.filter(like='A', axis=0)
print(df_filtered)
This would return a new DataFrame with only the rows where the value in the 'A' column is greater than 0:
A B
0 1 2
1 2 3
2 3 4
On the other hand, .query() is a method that can be used to subset a DataFrame using a boolean expression. It takes a string that represents the expression and returns a new DataFrame with only the rows that satisfy the expression.
Using the same example as above, we could use the .query() method to achieve the same result like this:
df_filtered = df.query('A > 0')
print(df_filtered)
This would also return a new DataFrame with only the rows where the value in the 'A' column is greater than 0:
A B
0 1 2
1 2 3
2 3 4
Both .filter() and .query() can be used to subset a DataFrame, but they work in slightly different ways. .filter() is based on labels in the index or columns, while .query() is based on a boolean expression.