Home > Blockchain >  How to convert excel file into pandas data frame
How to convert excel file into pandas data frame

Time:01-15

I have 6 excel files each containing 1 to 3 spread sheets. I want to upload them into pandas data frame and then do some data cleaning on them. My question is that how I can convert these excel files into one data frame in pandas?

I tried to open one of the excel files that has multiple sheets into data frame in pandas but couldn’t do that?

CodePudding user response:

You have to use the pd.read_excel. See the answer to your question here:

Using Pandas to pd.read_excel() for multiple worksheets of the same workbook

You can also take a look at the original source:

pandas.read_excel

Also I have an excel file, the first sheet has 17,000 rows and the second sheet has 3,000 rows and the columns of both sheets are the same. The name of the first sheet is "1" and the name of the second sheet is "2". Here I showed how to read each separately and together.

import pandas as pd

df1 = pd.read_excel('file.xlsx',sheet_name = '1')
df2 = pd.read_excel('file.xlsx',sheet_name = '2')

df3 = pd.read_excel('file.xlsx',sheet_name = ['1','2'])
df4 = pd.concat(df3, axis=0, ignore_index=True)

print('df1:',len(df1))
print('df2:',len(df2))
print('df3:',len(df3))
print('df4:',len(df4))

Output:

df1: 17000
df2: 3000
df3: 2
df4: 20000

CodePudding user response:

Using sheet_name=None argument returns a dictionary, which can be iterated using a dictionary comprehension to create a mapping of file names as keys and dataframes as values. Finally run a list comprehension on the mapping values and concat all dfs together.

from pathlib import Path

import pandas as pd


def concat_all_sheets_in_all_files(path: str) -> pd.DataFrame:
    files = list(Path(path).rglob("*.xlsx"))
    dfs_mapping = {Path(file).stem: pd.concat(pd.read_excel(file, sheet_name=None)) for file in files}

    return pd.concat([df for _, df in dfs_mapping.items()]).reset_index(drop=True)


file_path = "path/to/all/excel/files/"

final_df = concat_all_sheets_in_all_files(path=file_path)
print(final_df)
  • Related