I have a pandas dataframe and I want to filter/select conditions based on elements of an input list. So, for example, I have something like:
filters = ['category', 'name']
# I am just trying to select the columns which would match as follows:
data = {'category_name_courses': ["Spark","PySpark","Python","pandas"], 'category_name_area': ["cloud", "cloud", "prog", "ds"], 'some_other_column': [0, 0, 0, 0]
x = pd.DataFrame(data)
selections = list()
for col in x.columns:
if ('name' in col) and ('category' in col):
selections.append(col)
In my case, this if
condition or some other way of selection should be built by 'ANDing' everything from this input list
CodePudding user response:
IIUC, do you want?
reg_str = '&'.join(filters)
x.filter(regex=reg_str)
Output:
category_name_courses category_name_area
0 Spark cloud
1 PySpark cloud
2 Python prog
3 pandas ds
CodePudding user response:
Your edit shows that you want to filter columns based on their name.
Simply use:
filters = ['category', 'name']
for col in x.columns:
if all(x in col for x in filters):
print(col)
Output:
category_name_courses
category_name_area
older answer: filtering values
You can do almost what you suggested:
x = pd.DataFrame([['flow', 'x', 'category'],['x','x','flow']])
for col in x.columns:
if ('flow' in x[col].values) and ('category' in x[col].values):
# Do something with this column...
print(f'column "{col}" matches')
Using a list of matches:
filters = ['category', 'flow']
for col in x.columns:
if all(x in x[col].values for x in filters):
# Do something with this column...
print(f'column "{col}" matches')
Or, more efficiently, using a set
:
filters = set(['category', 'flow'])
for col in x.columns:
if set(x[col]) >= filters:
# Do something with this column...
print(f'column "{col}" matches')
Example:
column "2" matches
CodePudding user response:
Try it (if I understand your problem correctly):
You have a list of condition, and you want to sure, that the column includes all of it.
filters = ['category', 'name']
# x is some pandas dataframe
# I am trying to create something which would be equivalent of:
for col in x.columns:
boo = True
for condition in filters:
if not condition in col:
boo = False
break
if boo == True:
#to do something
CodePudding user response:
I really did not understand the question. I assume you want to find all column names witch contains any value in the given list.
The pandas data frame has loads of methods and functionalities. You may be familiar with dt
which allows one to apply datetime
functionalities to a given column.
In the same way one can use str
for string
functionalities.
So you can see if any string is in any value (row) of a column:
a_df.str.contains("string")
it will return a list of bool
where condition is met.
One can use |
(or
) to check multiple strings.
a_df.str.contains("string|someting")
So the line should look like:
x.columns.str.contains("|".join(filters))
Now you have the mask you can apply it:
x.columns[x.columns.str.contains("|".join(filters))]
Since you have the column names you can access the column data itself:
x[x.columns[x.columns.str.contains("|".join(filters))]]
output:
category_name_courses category_name_area
0 Spark cloud
1 PySpark cloud
2 Python prog
3 pandas ds