Home > Mobile >  Data Extraction from multiple excel files in pandas dataframe
Data Extraction from multiple excel files in pandas dataframe

Time:09-13

I'm trying to create a data ingestion routine to load data from multiple excel files with multiple tabs and columns in the pandas data frame. The structuring of the tabs in each of the excel files is the same. Each tab of the excel file should be a separate data frame. As of now, I have created a list of data frames for each excel file that holds all the data from all the tabs as it is concatenated. But, I'm trying to find a way to access each excel from a data structure and each tab of that excel file as a separate data frame. Below mentioned is the current code. Any improvisation would be appreciated!! Please let me know if anything else is needed.

#Assigning the path to the folder variable
folder = 'specified_path'

#Getting the list of files from the assigned path
excel_files = [file for file in os.listdir(folder)]

list_of_dfs = []
for file in excel_files :
    df = pd.concat(pd.read_excel(folder   "\\"   file, sheet_name=None), ignore_index=True)
    df['excelfile_name'] = file.split('.')[0]
    list_of_dfs.append(df)

CodePudding user response:

I would propose to change the line

    df = pd.concat(pd.read_excel(folder   "\\"   file, sheet_name=None), ignore_index=True)

to

    df = pd.concat(pd.read_excel(folder   "\\"   file, sheet_name=None))
    df.index = df.index.get_level_values(0)
    df.reset_index().rename({'index':'Tab'}, axis=1)

CodePudding user response:

To create a separate dataframe for each tab (with duplicated content) in an Excel file, one could iterate over index level 0 values and index with it:

df = pd.concat(pd.read_excel(filename, sheet_name=None))
list_of_dfs = []
for tab in df.index.get_level_values(0).unique():
    tab_df = df.loc[tab]
    list_of_dfs.append(tab_df)

For illustration, here is the dataframe content after reading an Excel file with 3 tabs: full dataframe

After running the above code, here is the content of list_of_dfs:

[        Date  Reviewed  Adjusted
 0 2022-07-11        43        20
 1 2022-07-18        16         8
 2 2022-07-25         8         3
 3 2022-08-01        17         3
 4 2022-08-15        14         6
 5 2022-08-22        12         5
 6 2022-08-29         8         4,
         Date  Reviewed  Adjusted
 0 2022-07-11        43        20
 1 2022-07-18        16         8
 2 2022-07-25         8         3
 3 2022-08-01        17         3
 4 2022-08-15        14         6
 5 2022-08-22        12         5
 6 2022-08-29         8         4,
         Date  Reviewed  Adjusted
 0 2022-07-11        43        20
 1 2022-07-18        16         8
 2 2022-07-25         8         3
 3 2022-08-01        17         3
 4 2022-08-15        14         6
 5 2022-08-22        12         5
 6 2022-08-29         8         4]
  • Related