I am trying to stack multiple workbooks by sheet in python. Currently my folder contains over 50 individual workbooks which are separated by date and usually contain up to 3 sheets, although not always. The sheets include "Pay", "Tablet", "SIMO".
I want to try to stack all data from the "Tablet" sheet into a new workbook and have been using the following code.
**import os
import pandas as pd
path = r"path_location"
files = os.listdir(path)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel(file, sheet_name='Tablet'), ignore_index=True)
df.head()
df.to_csv('Tablet_total.csv')**
however after checking the files I realised this is not pulling from all workbooks that have the sheet "Tablet". I suspect this may be due to the fact that not all workbooks have this sheet but in case I missed anything I'd greatly appreciate some ideas to what I may be doing wrong.
Also, as a final request, the sheet "Tablet" across all workbooks has unnecessary columns in the beginning. I have tried incorporating df.drop(index=df.index[:7], axis=0, inplace=True) into my loop yet this only removes the first 7 rows from the first iteration. Again any support with this would be greatly appreciated.
Many thanks
CodePudding user response:
First, I would check that you do not have any .xls
files or other excel file suffixes with:
import os
path = r"path_location"
files = os.listdir(path)
print({
file.split('.')[1]
for file in files
})
Then, I would check that you don't have any sheets with trailing white spaces or capitalization issues with:
import os
import pandas
path = r"path_location"
files = os.listdir(path)
print({
sheet_name
for file in files
if file.endswith('.xlsx')
for sheet_name in pandas.ExcelFile(file).sheet_names
})
I would use pandas.concat()
with a list comprehension to concatenate the sheets. I would also add a check to ensure that the worksheet as a sheet named 'Tablet'. Finally, if want to not use the first seven columns you should a) do it on each dataframe as it's read in and before it is concatenated together with the other dataframes, and b) first include all the rows and then specify the columns with .iloc[:, 7:]
import os
import pandas
path = r"path_location"
files = os.listdir(path)
df = pandas.concat([
pandas.read_excel(file, sheet_name='Tablet').iloc[:, 7:]
for file in files
if file.endswith('.xlsx') and 'Tablet' in pandas.ExcelFile(file).sheet_names
])
df.head()
CodePudding user response:
- Check if you have excel files with other extensions - xlsm, xlsb and others?
- In order to remove seven rows at each iteration, you need to select a temporary dataframe when reading from excel and delete them from it.
df_tmp = pd.read_excel(file, sheet_name='Tablet')
df_tmp.drop(index=df_tmp.index[:7], axis=0, inplace=True)
Since append is deprecated, use concat() instead. pandas.DataFrame.append
pd.concat([df, df_tmp], ignore_index=True)