I have a dataframe that looks like this: foo = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [6,7,8]})
and a list of columns list_of_columns = ['a','b']
of foo
.
The list_of_columns
are dynamically selected by the user, so it can be ['a','b']
but it can also be ['a','c']
or ['c']
or ['a','b','c']
etc
I would like for every column
in the list_of_columns
to create (nested) for
loops and query the dataframe in the following way:
In case list_of_columns = ['a','b']
the the loop would be like this:
for a in foo.a.unique():
for b in foo.b.unique():
print(foo.query(f'a=={a} and b=={b}'))
In case list_of_columns = ['a']
the the loop would be like this:
for a in foo.a.unique():
print(foo.query(f'a=={a}'))
In case list_of_columns = ['a','b','c']
the the loop would be like this:
for a in foo.a.unique():
for b in foo.b.unique():
for c in foo.c.unique():
print(foo.query(f'a=={a} and b=={b} and c=={c}'))
Is there a way to programmatically achieve that in python ?
CodePudding user response:
One approach using itertools.product
, to handle the "nested" loops:
import pandas as pd
from itertools import product
foo = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [6, 7, 8]})
list_of_columns = ['a', 'b']
for p in product(*(foo[col].unique() for col in list_of_columns)):
query = " and ".join(f"{c}=={u}" for c, u in zip(list_of_columns, p))
print(foo.query(query))
print("--")
Output
a b c
0 1 4 6
--
Empty DataFrame
Columns: [a, b, c]
Index: []
--
Empty DataFrame
Columns: [a, b, c]
Index: []
--
Empty DataFrame
Columns: [a, b, c]
Index: []
--
a b c
1 2 5 7
--
Empty DataFrame
Columns: [a, b, c]
Index: []
--
Empty DataFrame
Columns: [a, b, c]
Index: []
--
Empty DataFrame
Columns: [a, b, c]
Index: []
--
a b c
2 3 6 8
--
CodePudding user response:
Essentially, it looks like you want to loop over the unique combinations?
But as you then query for valid ones, you obtain a lot of empty DataFrames. If you do not need those, a much simpler and more efficient version would be:
for _,g in foo.groupby(list_of_columns):
print('---')
print(g)
output:
---
a b c
0 1 4 6
---
a b c
1 2 5 7
---
a b c
2 3 6 8
In comparison, the output of your nested loop:
---
a b c
0 1 4 6
---
Empty DataFrame
Columns: [a, b, c]
Index: []
---
Empty DataFrame
Columns: [a, b, c]
Index: []
---
Empty DataFrame
Columns: [a, b, c]
Index: []
---
a b c
1 2 5 7
---
Empty DataFrame
Columns: [a, b, c]
Index: []
---
Empty DataFrame
Columns: [a, b, c]
Index: []
---
Empty DataFrame
Columns: [a, b, c]
Index: []
---
a b c
2 3 6 8