I have thousands of recordings. I want to group the dataframe by setting starting column value(1st column) being 'B' till ending value being 'B' but ending 'B' not included. I want to work on data which is between 'B' and till 'B'. How can I do that using pandas?
B,0,15000.000000,716.881652,-0.065916
K,0,-33,1,4030
K,1,-16,2,4028
K,2,-18,12,4036
K,3,-14,-3,4054
P,0,-452,4089,329
P,1,-428,4082,427
P,2,-382,4078,518
P,3,-363,4052,545
P,4,-347,4064,508
K,4,-2,17,4048
K,5,-18,12,4048
P,5,-373,4068,409
B,1,16000.000000,715.443420,-0.067204
K,6,-16,-2,4054
P,6,-433,4082,390
K,7,-16,4,4036
P,7,-457,4104,406
B,2,17000.000000,716.930297,-0.084369
Expected output:
df1
:
B,0,15000.000000,716.881652,-0.065916
K,0,-33,1,4030
K,1,-16,2,4028
K,2,-18,12,4036
K,3,-14,-3,4054
P,0,-452,4089,329
P,1,-428,4082,427
P,2,-382,4078,518
P,3,-363,4052,545
P,4,-347,4064,508
K,4,-2,17,4048
K,5,-18,12,4048
P,5,-373,4068,409
df2
:
B,1,16000.000000,715.443420,-0.067204
K,6,-16,-2,4054
P,6,-433,4082,390
K,7,-16,4,4036
P,7,-457,4104,406
CodePudding user response:
import pandas as pd
from io import StringIO
data = """
B,0,15000.000000,716.881652,-0.065916
K,0,-33,1,4030
K,1,-16,2,4028
K,2,-18,12,4036
K,3,-14,-3,4054
P,0,-452,4089,329
P,1,-428,4082,427
P,2,-382,4078,518
P,3,-363,4052,545
P,4,-347,4064,508
K,4,-2,17,4048
K,5,-18,12,4048
P,5,-373,4068,409
B,1,16000.000000,715.443420,-0.067204
K,6,-16,-2,4054
P,6,-433,4082,390
K,7,-16,4,4036
P,7,-457,4104,406
B,2,17000.000000,716.930297,-0.084369
"""
df = pd.read_csv(StringIO(data), sep=',', header=None)
index = pd.Series(df[df[0] == "B"].index)
pd.concat([index, index.shift(-1).fillna(-1)], axis=1).astype(int).apply(lambda x: df[slice(x[0], x[1])], axis=1)
CodePudding user response:
You can slice it like that:
df = df.reset_index()
x = df.iloc[:,0]] == 'B'
df.loc[x[x].index[0]:x[x].index[1]-1]
CodePudding user response:
You can find the indices, then store chunks of the df with the right boundaries to a dict (for example).
Input data looks like this:
key val1 val2 val3 val4
0 B 0 15000.0 716.881652 -0.065916
1 K 0 -33.0 1.000000 4030.000000
2 K 1 -16.0 2.000000 4028.000000
3 K 2 -18.0 12.000000 4036.000000
4 K 3 -14.0 -3.000000 4054.000000
5 P 0 -452.0 4089.000000 329.000000
6 P 1 -428.0 4082.000000 427.000000
7 P 2 -382.0 4078.000000 518.000000
8 P 3 -363.0 4052.000000 545.000000
9 P 4 -347.0 4064.000000 508.000000
10 K 4 -2.0 17.000000 4048.000000
11 K 5 -18.0 12.000000 4048.000000
12 P 5 -373.0 4068.000000 409.000000
13 B 1 16000.0 715.443420 -0.067204
14 K 6 -16.0 -2.000000 4054.000000
15 P 6 -433.0 4082.000000 390.000000
16 K 7 -16.0 4.000000 4036.000000
17 P 7 -457.0 4104.000000 406.000000
18 B 2 17000.0 716.930297 -0.084369
# in my df first column is named 0 where I search for "B"
# you need to replace that to whatever name your columns has there
boundaries = df[df['key'] == 'B'].index.tolist()
print(f"{boundaries=}")
# [0, 13, 18]
out = {}
for idx, (start,end) in enumerate(zip(boundaries[:-1], boundaries[1:]), 1):
print(f"{idx=} , {start=} , {end=}")
out[f"df{idx}"] = df.loc[start:end-1]
print(out)
Output:
boundaries=[0, 13, 18]
idx=1 , start=0 , end=13
idx=2 , start=13 , end=18
{'df1': key val1 val2 val3 val4
0 B 0 15000.0 716.881652 -0.065916
1 K 0 -33.0 1.000000 4030.000000
2 K 1 -16.0 2.000000 4028.000000
3 K 2 -18.0 12.000000 4036.000000
4 K 3 -14.0 -3.000000 4054.000000
5 P 0 -452.0 4089.000000 329.000000
6 P 1 -428.0 4082.000000 427.000000
7 P 2 -382.0 4078.000000 518.000000
8 P 3 -363.0 4052.000000 545.000000
9 P 4 -347.0 4064.000000 508.000000
10 K 4 -2.0 17.000000 4048.000000
11 K 5 -18.0 12.000000 4048.000000
12 P 5 -373.0 4068.000000 409.000000,
'df2': key val1 val2 val3 val4
13 B 1 16000.0 715.44342 -0.067204
14 K 6 -16.0 -2.00000 4054.000000
15 P 6 -433.0 4082.00000 390.000000
16 K 7 -16.0 4.00000 4036.000000
17 P 7 -457.0 4104.00000 406.000000}
Explanation:
in boundaries you select the first column and search for all the rows in your df, where the value is B
. With index.tolist()
you only get the indices of the rows where this condition is true.
The logic behind the next step is loop through this list of indices with zip(). Both arguments of zip loop through the same list, but end
is shifted by one. Like that you can access those list elements pairwise.
Then you filter your dataframe with loc to get the rows between start and end.
enumerate() is used to get an increasing counter every loop (starting at 1, like I defined in the 2nd argument of it) to create a different key every loop and store the new dataframe with that key.
CodePudding user response:
You can simply look up the (numerical) indices where the column contains 'B'
, then split your df
into a list subdfs
accordingly:
ix = df.reset_index().index[df[0] == 'B'].tolist()
subdfs = [df.iloc[a:b] for a, b in zip(ix, ix[1:] [None])]
>>> len(subdfs)
3
Explanation
The first bit gets us the numerical indices where column 0
contains 'B'
:
ix = df.reset_index().index[df[0] == 'B'].tolist()
>>> ix
[0, 13, 18]
It is important to use .iloc
for the splitting, not .loc
, so that df.iloc[a:b]
includes row a
but excludes row b
. We also add a None
for the last subdf so that one goes from the last index of 'B'
til the end of the df
:
subdfs = [df.iloc[a:b] for a, b in zip(ix, ix[1:] [None])]
>>> subdfs[0]
0 1 2 3 4
0 B 0 15000.0 716.881652 -0.065916
1 K 0 -33.0 1.000000 4030.000000
2 K 1 -16.0 2.000000 4028.000000
3 K 2 -18.0 12.000000 4036.000000
4 K 3 -14.0 -3.000000 4054.000000
5 P 0 -452.0 4089.000000 329.000000
6 P 1 -428.0 4082.000000 427.000000
7 P 2 -382.0 4078.000000 518.000000
8 P 3 -363.0 4052.000000 545.000000
9 P 4 -347.0 4064.000000 508.000000
10 K 4 -2.0 17.000000 4048.000000
11 K 5 -18.0 12.000000 4048.000000
12 P 5 -373.0 4068.000000 409.000000
>>> subdfs[-1]
0 1 2 3 4
18 B 2 17000.0 716.930297 -0.084369
Note: This also works if df
has an Index that is not a simple RangeIndex
:
# test for 'a', 'b', ... index
from string import ascii_lowercase
df = df.set_axis(list(ascii_lowercase[:df.shape[0]]))
# repeat code above
>>> subdfs[0]
0 1 2 3 4
a B 0 15000.0 716.881652 -0.065916
b K 0 -33.0 1.000000 4030.000000
c K 1 -16.0 2.000000 4028.000000
d K 2 -18.0 12.000000 4036.000000
e K 3 -14.0 -3.000000 4054.000000
f P 0 -452.0 4089.000000 329.000000
g P 1 -428.0 4082.000000 427.000000
h P 2 -382.0 4078.000000 518.000000
i P 3 -363.0 4052.000000 545.000000
j P 4 -347.0 4064.000000 508.000000
k K 4 -2.0 17.000000 4048.000000
l K 5 -18.0 12.000000 4048.000000
m P 5 -373.0 4068.000000 409.000000