Group dataframe using start and end column value-CodePudding

I have thousands of recordings. I want to group the dataframe by setting starting column value(1st column) being 'B' till ending value being 'B' but ending 'B' not included. I want to work on data which is between 'B' and till 'B'. How can I do that using pandas?

B,0,15000.000000,716.881652,-0.065916 
K,0,-33,1,4030
K,1,-16,2,4028
K,2,-18,12,4036
K,3,-14,-3,4054
P,0,-452,4089,329
P,1,-428,4082,427
P,2,-382,4078,518
P,3,-363,4052,545
P,4,-347,4064,508
K,4,-2,17,4048
K,5,-18,12,4048
P,5,-373,4068,409
B,1,16000.000000,715.443420,-0.067204
K,6,-16,-2,4054
P,6,-433,4082,390
K,7,-16,4,4036
P,7,-457,4104,406
B,2,17000.000000,716.930297,-0.084369

Expected output:

df1:

B,0,15000.000000,716.881652,-0.065916 
K,0,-33,1,4030
K,1,-16,2,4028
K,2,-18,12,4036
K,3,-14,-3,4054
P,0,-452,4089,329
P,1,-428,4082,427
P,2,-382,4078,518
P,3,-363,4052,545
P,4,-347,4064,508
K,4,-2,17,4048
K,5,-18,12,4048
P,5,-373,4068,409

df2:

B,1,16000.000000,715.443420,-0.067204
K,6,-16,-2,4054
P,6,-433,4082,390
K,7,-16,4,4036
P,7,-457,4104,406

CodePudding user response：

import pandas as pd
from io import StringIO
data = """
B,0,15000.000000,716.881652,-0.065916 
K,0,-33,1,4030
K,1,-16,2,4028
K,2,-18,12,4036
K,3,-14,-3,4054
P,0,-452,4089,329
P,1,-428,4082,427
P,2,-382,4078,518
P,3,-363,4052,545
P,4,-347,4064,508
K,4,-2,17,4048
K,5,-18,12,4048
P,5,-373,4068,409
B,1,16000.000000,715.443420,-0.067204
K,6,-16,-2,4054
P,6,-433,4082,390
K,7,-16,4,4036
P,7,-457,4104,406
B,2,17000.000000,716.930297,-0.084369
"""
df = pd.read_csv(StringIO(data), sep=',', header=None)
index = pd.Series(df[df[0] == "B"].index)
pd.concat([index, index.shift(-1).fillna(-1)], axis=1).astype(int).apply(lambda x: df[slice(x[0], x[1])], axis=1)

CodePudding user response：

You can slice it like that:

df = df.reset_index()
x = df.iloc[:,0]] == 'B'
df.loc[x[x].index[0]:x[x].index[1]-1]

CodePudding user response：

You can find the indices, then store chunks of the df with the right boundaries to a dict (for example).

Input data looks like this:

   key  val1     val2         val3         val4
0    B     0  15000.0   716.881652    -0.065916
1    K     0    -33.0     1.000000  4030.000000
2    K     1    -16.0     2.000000  4028.000000
3    K     2    -18.0    12.000000  4036.000000
4    K     3    -14.0    -3.000000  4054.000000
5    P     0   -452.0  4089.000000   329.000000
6    P     1   -428.0  4082.000000   427.000000
7    P     2   -382.0  4078.000000   518.000000
8    P     3   -363.0  4052.000000   545.000000
9    P     4   -347.0  4064.000000   508.000000
10   K     4     -2.0    17.000000  4048.000000
11   K     5    -18.0    12.000000  4048.000000
12   P     5   -373.0  4068.000000   409.000000
13   B     1  16000.0   715.443420    -0.067204
14   K     6    -16.0    -2.000000  4054.000000
15   P     6   -433.0  4082.000000   390.000000
16   K     7    -16.0     4.000000  4036.000000
17   P     7   -457.0  4104.000000   406.000000
18   B     2  17000.0   716.930297    -0.084369

# in my df first column is named 0 where I search for "B"
# you need to replace that to whatever name your columns has there
boundaries = df[df['key'] == 'B'].index.tolist()
print(f"{boundaries=}")
# [0, 13, 18]

out = {}
for idx, (start,end) in enumerate(zip(boundaries[:-1], boundaries[1:]), 1):
    print(f"{idx=} , {start=} , {end=}")
    out[f"df{idx}"] = df.loc[start:end-1]
    
print(out)

Output:

boundaries=[0, 13, 18]
idx=1 , start=0 , end=13
idx=2 , start=13 , end=18

{'df1':    key  val1     val2         val3         val4
0    B     0  15000.0   716.881652    -0.065916
1    K     0    -33.0     1.000000  4030.000000
2    K     1    -16.0     2.000000  4028.000000
3    K     2    -18.0    12.000000  4036.000000
4    K     3    -14.0    -3.000000  4054.000000
5    P     0   -452.0  4089.000000   329.000000
6    P     1   -428.0  4082.000000   427.000000
7    P     2   -382.0  4078.000000   518.000000
8    P     3   -363.0  4052.000000   545.000000
9    P     4   -347.0  4064.000000   508.000000
10   K     4     -2.0    17.000000  4048.000000
11   K     5    -18.0    12.000000  4048.000000
12   P     5   -373.0  4068.000000   409.000000, 
'df2':    key  val1     val2        val3         val4
13   B     1  16000.0   715.44342    -0.067204
14   K     6    -16.0    -2.00000  4054.000000
15   P     6   -433.0  4082.00000   390.000000
16   K     7    -16.0     4.00000  4036.000000
17   P     7   -457.0  4104.00000   406.000000}

Explanation: in boundaries you select the first column and search for all the rows in your df, where the value is B. With index.tolist() you only get the indices of the rows where this condition is true.

The logic behind the next step is loop through this list of indices with zip(). Both arguments of zip loop through the same list, but end is shifted by one. Like that you can access those list elements pairwise. Then you filter your dataframe with loc to get the rows between start and end. enumerate() is used to get an increasing counter every loop (starting at 1, like I defined in the 2nd argument of it) to create a different key every loop and store the new dataframe with that key.

CodePudding user response：

You can simply look up the (numerical) indices where the column contains 'B', then split your df into a list subdfs accordingly:

ix = df.reset_index().index[df[0] == 'B'].tolist()
subdfs = [df.iloc[a:b] for a, b in zip(ix, ix[1:]   [None])]
>>> len(subdfs)
3

Explanation

The first bit gets us the numerical indices where column 0 contains 'B':

ix = df.reset_index().index[df[0] == 'B'].tolist()
>>> ix
[0, 13, 18]

It is important to use .iloc for the splitting, not .loc, so that df.iloc[a:b] includes row a but excludes row b. We also add a None for the last subdf so that one goes from the last index of 'B' til the end of the df:

subdfs = [df.iloc[a:b] for a, b in zip(ix, ix[1:]   [None])]
>>> subdfs[0]
    0  1        2            3            4
0   B  0  15000.0   716.881652    -0.065916
1   K  0    -33.0     1.000000  4030.000000
2   K  1    -16.0     2.000000  4028.000000
3   K  2    -18.0    12.000000  4036.000000
4   K  3    -14.0    -3.000000  4054.000000
5   P  0   -452.0  4089.000000   329.000000
6   P  1   -428.0  4082.000000   427.000000
7   P  2   -382.0  4078.000000   518.000000
8   P  3   -363.0  4052.000000   545.000000
9   P  4   -347.0  4064.000000   508.000000
10  K  4     -2.0    17.000000  4048.000000
11  K  5    -18.0    12.000000  4048.000000
12  P  5   -373.0  4068.000000   409.000000

>>> subdfs[-1]
    0  1        2           3         4
18  B  2  17000.0  716.930297 -0.084369

Note: This also works if df has an Index that is not a simple RangeIndex:

# test for 'a', 'b', ... index
from string import ascii_lowercase

df = df.set_axis(list(ascii_lowercase[:df.shape[0]]))

# repeat code above

>>> subdfs[0]
   0  1        2            3            4
a  B  0  15000.0   716.881652    -0.065916
b  K  0    -33.0     1.000000  4030.000000
c  K  1    -16.0     2.000000  4028.000000
d  K  2    -18.0    12.000000  4036.000000
e  K  3    -14.0    -3.000000  4054.000000
f  P  0   -452.0  4089.000000   329.000000
g  P  1   -428.0  4082.000000   427.000000
h  P  2   -382.0  4078.000000   518.000000
i  P  3   -363.0  4052.000000   545.000000
j  P  4   -347.0  4064.000000   508.000000
k  K  4     -2.0    17.000000  4048.000000
l  K  5    -18.0    12.000000  4048.000000
m  P  5   -373.0  4068.000000   409.000000