I have a DataFrame that I want to split into multiple groups. Each group will be a sequence of rows where the column difference
is equal to 1. If not, skip it and find the next row with difference
equal to 1 and start a new group.
For example this:
id difference
0 001 1
1 001 1
2 001 1
3 001 1
4 001 1
5 001 1
6 001 2
7 001 2
8 001 1
9 001 1
10 001 1
11 001 1
12 001 4
13 001 1
14 001 1
15 001 1
16 001 1
17 001 1
18 001 1
19 001 1
will be 3 dfs first: from 0 to 5 (including 5), second: from 8 to 11, and third: from 13 to 19
Right now I do it this way, and I am new to pandas. Is there any other efficient way of doing it?
grouped = g.df((g['difference'] != g['difference'].shift()).cumsum())
for group_id, group in grouped:
if (group['difference'].iloc[0] < 1.1) & (group['difference'].iloc[0] > 0.9) and len(
group.index) > 1:
#do stuff...
CodePudding user response:
Given your splitting condition
, use cumsum
to create pseudo-groups for groupby
. Then use loc
to ignore rows that violate the condition
and extract the groups in a dict comprehension:
condition = df.difference != 1
dfs = {key: data for key, data in df.loc[~condition].groupby(condition.cumsum())}
Note that if you want to include id
as a splitting condition, just add it to the groupby
and unpack accordingly:
dfs = {key: data for (_, key), data in df.loc[~condition].groupby(['id', condition.cumsum()])}
# ^^^^^^^^ ^^^^
Output:
{0:
id difference
0 001 1
1 001 1
2 001 1
3 001 1
4 001 1
5 001 1,
2:
id difference
8 001 1
9 001 1
10 001 1
11 001 1,
3:
id difference
13 001 1
14 001 1
15 001 1
16 001 1
17 001 1
18 001 1
19 001 1}