Pandas extracting out rows to separate dataframes-CodePudding

Given this sample dataframe I am trying to extract all rows in which the Code column has a value starting with 'HCC' and all rows that follow until the next 'HCC' into separate dataframes:

Code        Description
HCC001      <empty string>
A1          Description of A1
B1          Description of B1
HCC002      <empty string>
A2          Description of A2
B2          Description of B2
C2          Description of C2
HCC003      <empty string>
....

I have tried itterrows, just looping through the columns as well.

The expected output is: Dataframe 1

Code        Description
HCC001      <empty string>
A1          Description of A1
B1          Description of B1

Dataframe 2

Code        Description
HCC002      <empty string>
A2          Description of A2
B2          Description of B2
C2          Description of C2

....

This did not work: hcc_lst = [] #check each row in the first column for 'HCC' and make a separate dataframe

for i in range(len(icd_data)):
    if 'HCC' in icd_data.iloc[i,0]:
        hcc_lst.append(icd_data.iloc[i,0])

I am sure that something with itterrows should work I just cannot find this precise answer.

Thanks for taking a look.

CodePudding user response：

Create groups and use locals() to create new python variables dynamically (strongly discouraged, maybe your should use a dictionary to store sub dataframes):

for i, subdf in df.groupby(df['Code'].str.startswith('HCC').cumsum()):
    locals()[f"df{i}"] = subdf

Output:

>>> df1
     Code        Description
0  HCC001     <empty string>
1      A1  Description of A1
2      B1  Description of B1

>>> df2
     Code        Description
3  HCC002     <empty string>
4      A2  Description of A2
5      B2  Description of B2
6      C2  Description of C2

>>> df3
     Code     Description
7  HCC003  <empty string>

CodePudding user response：

You can do that using eq for checking rows starting with "HCC" and cumsum() to segregate the DataFrames likewise:

for k, v in df.groupby(df['Code'].str[:3].eq('HCC').cumsum()):
    print(f"DataFrame {k}")
    print(v)
    print('\n')