Pandas concat dataframe per excel file-CodePudding

I have a code that read multiple files inside the directory and every excel file have more than 10 sheet's. After that I need to exclude some sheet's every file's and the others extracted.

I got all data needed, but the problem is every sheet's from the excel created new Dataframe even I used concat so when I save it to json only the last dataframe per file saved instead of whole data.

Here's my code:

excluded_sheet = ['Sheet 2','Sheet 6']
for index, xls_path in enumerate(file_paths):                                                         
      data_file = pd.ExcelFile(xls_path)
      sheets = [ sheet for sheet in data_file.sheet_names if sheet not in excluded_sheet ]         
      for sheet_name in sheets: 
            file = xls_path.rfind(".")
            head, tail = os.path.split(xls_path[1:file])
            df =pd.concat([pd.read_excel(xls_path, sheet_name=sheet_name, header=None)],ignore_index=True)
            df.insert(loc=0, column='sheet name', value=sheet_name)

            pd.DataFrame(df.to_json(f"{json_folder_path}{tail}.json", orient='records',indent=4))

I didn't used sheet_name=None because I need to read sheet name and add to column values.

Data status of my dataFrame:

I got many DF because every sheet create new DF, instead of 2 DF only since I have 2 files inside the directory. Thanks guys for your help.

CodePudding user response：

You can use list comprehension for join all sheetnames to one DataFrame:

...
...
sheets = [ sheet for sheet in data_file.sheet_names if sheet not in excluded_sheet ]

file = xls_path.rfind(".")
head, tail = os.path.split(xls_path[1:file])
dfs = [pd.read_excel(xls_path,sheet_name=sheet_name,header=None) for sheet_name in sheets]
df =pd.concat(dfs,keys=sheets)
df = df.reset_index(level=0, drop=True).rename_axis('sheet name').reset_index()

pd.DataFrame(df.to_json(f"{json_folder_path}{tail}.json", orient='records',indent=4))

Or create helper list dfs with append DataFrames per loop, outside loop use concat:

...
...
sheets = [ sheet for sheet in data_file.sheet_names if sheet not in excluded_sheet ]  

dfs = []
for sheet_name in sheets: 
    file = xls_path.rfind(".")
    head, tail = os.path.split(xls_path[1:file])
    df = pd.read_excel(xls_path, sheet_name=sheet_name, header=None)
    df.insert(loc=0, column='sheet name', value=sheet_name)
    dfs.append(df)

df1 = pd.concat(dfs,ignore_index=True)
pd.DataFrame(df1.to_json(f"{json_folder_path}{tail}.json", orient='records',indent=4))