I am getting myself very confused over a problem I am encountering with a short python script I am trying to put together. I am trying to iterate through a dataframe, appending rows to a new dataframe, until a certain value is encountered.
import pandas as pd
#this function will take a raw AGS file (saved as a CSV) and convert to a
#dataframe.
#it will take the AGS CSV and print the top 5 header lines
def AGS_raw(file_loc):
raw_df = pd.read_csv(file_loc)
#print(raw_df.head())
return raw_df
import_df = AGS_raw('test.csv')
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
if "**PROJ" == True:
cut_df = cut_df.concat([cut_df,df_new_row],ignore_index=True, sort=False)
elif "**ABBR" == True:
break
print(raw_df)
return cut_df
I don't need to get into specifics, but the values (**PROJ and **ABBR) in this data occur as single cells as the top of tables. So I want to loop row-wise through the data, appending rows until **ABBR is encountered.
When I call AGS_snip(import_df)
, nothing happens. Previous incarnations just spat out the whole df, and I'm just confused over the logic of the loops. Any assistance much appreciated.
EDIT: raw text of the CSV
**PROJ,
1,32
1,76
32,56
,
**ABBR,
1,32
1,76
32,56
The test CSV looks like this:
CodePudding user response:
The reason that "nothing happens" is likely b/c of the conditions you're using in if
and elif
.
Neither "**PROJ" == True
nor "**ABBR" == True
will ever be True
because neither "**PROJ"
nor "**ABBR"
are equal to True
. Your code is equivalent to:
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
if False:
cut_df = cut_df.concat([cut_df,df_new_row],ignore_index=True, sort=False)
elif False:
break
print(raw_df)
return cut_df
Which is the same as:
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
print(raw_df)
return cut_df
You also always return
from inside the loop and df_new_row
isn't used for anything, so it's equivalent to:
def AGS_snip(raw_df):
first_row = next(raw_df.iterrows(), None)
if first_row:
cut_df = pd.DataFrame(raw_df)
print(raw_df)
return cut_df
CodePudding user response:
Here's how to parse your CSV file into multiple separate dataframes based on a row condition. Each dataframe is stored in a Python dictionary, with titles as keys and dataframes as values.
import pandas as pd
df = pd.read_csv('ags.csv', header=None)
# Drop rows which consist of all NaN (Not a Number) / missing values.
# Reset index order from 0 to the end of dataframe.
df = df.dropna(axis='rows', how='all').reset_index(drop=True)
# Grab indices of rows beginning with "**", and append an "end" index.
idx = df.index[df[0].str.startswith('**')].append(pd.Index([len(df)]))
# Dictionary of { dataframe titles : dataframes }.
dfs = {}
for k in range(len(idx) - 1):
table_name = df.iloc[idx[k],0]
dfs[table_name] = df.iloc[idx[k] 1:idx[k 1]].reset_index(drop=True)
# Print the titles and tables.
for k,v in dfs.items():
print(k)
print(v)
# **PROJ
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# **ABBR
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# Access each dataframe by indexing the dictionary "dfs", for example:
print(dfs['**ABBR'])
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# You can rename column names with for example this code:
dfs['**PROJ'].set_axis(['data1', 'data2'], axis='columns', inplace=True)
print(dfs['**PROJ'])
# data1 data2
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0