I have dataframe that looks like this:
data = pd.DataFrame({"event": ["A", "B", "C", "A", "A", "E", "P", "S", "A", "Y", "A"]})
data.head(15)
event
0 A
1 B
2 C
3 A
4 A
5 E
6 P
7 S
8 A
9 Y
10 A
I want to break this dataframe into 5 small dataframes whenever the event "A" is found. So the five dataframes I want to create, would look like this in the case:
1) event
0 A
1 B
2 C
2) event
0 A
3) event
0 A
1 E
2 P
3 S
4) event
0 A
1 Y
5) event
0 A
Is there any elegant way to do this with Python Pandas and also Pyspark?
CodePudding user response:
With pandas, use groupby
with a helper grouper using data['event'].eq('A').cumsum()
:
dfs = [g for _,g in data.groupby(data['event'].eq('A').cumsum())]
or to get a new index, add a reset_index
:
dfs = [g.reset_index(drop=True)
for _,g in data.groupby(data['event'].eq('A').cumsum())]
output (without reset_index
):
[ event
0 A
1 B
2 C,
event
3 A,
event
4 A
5 E
6 P
7 S,
event
8 A
9 Y,
event
10 A]