I'm trying to make a large DataFrame from a bunch of smaller DF. so I've look at multiple sites a nd they all mention to use the pd.concat()
method and create an empty DataFrame. I did, however when I print inside my for loop I still get data as if it was still sectioned by individual DataFrame, and when i print my (now supposedly filled DataFrame ) I get an empty DF.
Note: All tables have the same structure
import camelot
import pandas as pd
# GETS 73 tables
pdf_data = camelot.read_pdf('GMCC services 8517404.pdf', flavor='lattice', pages='1', encoding='utf-8' )
all_data = pd.DataFrame()
for info in pdf_data:
df= info.df
# GETTABLE IN DATAFRAME
# df = info.df
# concatenate alltables in one dataframe
# all_data = pd.concat([all_data, df.reset_index(drop=True)])
master = pd.concat([all_data, df.reset_index(drop=True)])
# CHANGE COLUMN NAMES FOR REAL DATA
master = master.rename({0:'product_no',1:'description', 2:'quantity', 3:'net_price', 4:'total'}, axis='columns')
# GET SITECODE
site = master.iloc[-1, 0]
# GET FIRST ROW && DROP FIRST ROW OF EVERY TABLE
master = master.reindex(master.index.drop(0)).reset_index(drop=True)
# # CREATE COLUMN SITE WITH SITE CODE
master['site'] = site
# Drop LAst ROW on EVERY TABLE
master.drop(index=master.index[-1],
axis=0,
inplace=True)
# # concatenate alltables in one dataframe
# all_data = pd.concat([all_data, df.reset_index(drop=True)])
print(master)
# services = all_data.groupby('description')
# unique_sevices = services.apply(lambda x: x['net_price'].unique())
# print(unique_sevices)
what I'm currently getting:
product_no | description | quantity | net_price | site |
0 11223 | Gen desc. 1 | 5 | 19 | USCA |
1 22334 | Gen desc. 2 | 11 | 15 | USCA |
product_no | description | quantity | net_price | site |
0 55667 | Gen desc. 5 | 10 | 15 | USNY |
1 22334 | Gen desc. 2 | 3 | 11 | USNY |
What I want:
product_no | description | quantity | net_price | site |
0 11223 | Gen desc. 1 | 5 | 19 | USCA |
1 22334 | Gen desc. 2 | 11 | 15 | USCA |
2 55667 | Gen desc. 5 | 10 | 15 | USNY |
3 22334 | Gen desc. 2 | 3 | 11 | USNY |
CodePudding user response:
Try:
all_data = pd.concat([t.df for t in pdf_data], ignore_index=True)
If you want to loop and modify your data, you can collect them all in a list and use concat
after the loop:
list_of_df = list()
for table in pdf_data:
df = table.df.rename({0:'product_no',1:'description', 2:'quantity', 3:'net_price', 4:'total'}, axis=1)
df["site"] = df.iat[-1, 0]
df = df[1:-1]
list_of_df.append(df)
master = pd.concat(list_of_df, ignore_index=True)