Python - looping through rows and concating rows until a certain value is encountered-CodePudding

I am getting myself very confused over a problem I am encountering with a short python script I am trying to put together. I am trying to iterate through a dataframe, appending rows to a new dataframe, until a certain value is encountered.

import pandas as pd


#this function will take a raw AGS file (saved as a CSV) and convert to a
#dataframe.
#it will take the AGS CSV and print the top 5 header lines  
def AGS_raw(file_loc):
    raw_df = pd.read_csv(file_loc)
    #print(raw_df.head())
    return raw_df
    
import_df = AGS_raw('test.csv')

def AGS_snip(raw_df):
    for i in raw_df.iterrows():
        df_new_row = pd.DataFrame(i)
        cut_df = pd.DataFrame(raw_df)
        if "**PROJ" == True:
            cut_df = cut_df.concat([cut_df,df_new_row],ignore_index=True, sort=False)
        elif "**ABBR" == True:
            break
        print(raw_df)
        return cut_df

I don't need to get into specifics, but the values (**PROJ and **ABBR) in this data occur as single cells as the top of tables. So I want to loop row-wise through the data, appending rows until **ABBR is encountered.

When I call AGS_snip(import_df), nothing happens. Previous incarnations just spat out the whole df, and I'm just confused over the logic of the loops. Any assistance much appreciated.

EDIT: raw text of the CSV

**PROJ,
1,32
1,76
32,56
,
**ABBR,
1,32
1,76
32,56

The test CSV looks like this:

enter image description here

CodePudding user response：

The reason that "nothing happens" is likely b/c of the conditions you're using in if and elif.

Neither "**PROJ" == True nor "**ABBR" == True will ever be True because neither "**PROJ" nor "**ABBR" are equal to True. Your code is equivalent to:

def AGS_snip(raw_df):
    for i in raw_df.iterrows():
        df_new_row = pd.DataFrame(i)
        cut_df = pd.DataFrame(raw_df)
        if False:
            cut_df = cut_df.concat([cut_df,df_new_row],ignore_index=True, sort=False)
        elif False:
            break
        print(raw_df)
        return cut_df

Which is the same as:

def AGS_snip(raw_df):
    for i in raw_df.iterrows():
        df_new_row = pd.DataFrame(i)
        cut_df = pd.DataFrame(raw_df)
        print(raw_df)
        return cut_df

You also always return from inside the loop and df_new_row isn't used for anything, so it's equivalent to:

def AGS_snip(raw_df):
    first_row = next(raw_df.iterrows(), None)
    if first_row:
        cut_df = pd.DataFrame(raw_df)
        print(raw_df)
        return cut_df

CodePudding user response：

Here's how to parse your CSV file into multiple separate dataframes based on a row condition. Each dataframe is stored in a Python dictionary, with titles as keys and dataframes as values.

import pandas as pd

df = pd.read_csv('ags.csv', header=None)

# Drop rows which consist of all NaN (Not a Number) / missing values.
# Reset index order from 0 to the end of dataframe.
df = df.dropna(axis='rows', how='all').reset_index(drop=True)

# Grab indices of rows beginning with "**", and append an "end" index.
idx = df.index[df[0].str.startswith('**')].append(pd.Index([len(df)]))

# Dictionary of { dataframe titles : dataframes }.
dfs = {}
for k in range(len(idx) - 1):
    table_name = df.iloc[idx[k],0]
    dfs[table_name] = df.iloc[idx[k] 1:idx[k 1]].reset_index(drop=True)

# Print the titles and tables.
for k,v in dfs.items():
    print(k)
    print(v)
# **PROJ
#     0     1
# 0   1  32.0
# 1   1  76.0
# 2  32  56.0
# **ABBR
#     0     1
# 0   1  32.0
# 1   1  76.0
# 2  32  56.0

# Access each dataframe by indexing the dictionary "dfs", for example:
print(dfs['**ABBR'])
#     0     1
# 0   1  32.0
# 1   1  76.0
# 2  32  56.0

# You can rename column names with for example this code:
dfs['**PROJ'].set_axis(['data1', 'data2'], axis='columns', inplace=True)
print(dfs['**PROJ'])
#   data1  data2
# 0     1   32.0
# 1     1   76.0
# 2    32   56.0