Reading multiple excel files into a pandas dataframe, but also storing the file name-CodePudding

I would like to read multiple excel files and store them into a single pandas dataframe, but I would like one of the columns in the dataframe to be the file name. This is because the file name contains the date (this is monthly data) and I need that information. I can't seem to get the filename, but I'm able to get the excel files into a dataframe. Please help.

import os
import pandas as pd
import fsspec

files = os.listdir("C://Users//6J2754897//Downloads//monthlydata")
paths = "C://Users//6J2754897//Downloads//monthlydata"

a = pd.DataFrame([2], index = None)

df = pd.DataFrame()
for file in range(len(files)):
   if files[file].endswith('.xlsx'): 
      df = df.append(pd.read_excel(paths   "//"   files[file], sheet_name = "information", skiprows=7), ignore_index=True)
      df['Month'] = str(files[file])

CodePudding user response：

The order of operations here is incorrect. The line:

df['Month'] = str(files[file])

Is going to overwrite the entire column with the most recent value.

Instead we should only add the value to the current DataFrame:

import os

import pandas as pd

paths = "C://Users//6J2754897//Downloads//monthlydata"
files = os.listdir(paths)

df = pd.DataFrame()
for file in range(len(files)):
    if files[file].endswith('.xlsx'):
        # Read in File
        file_df = pd.read_excel(paths   "//"   files[file],
                                sheet_name="information",
                                skiprows=7)
        # Add to just this DataFrame
        file_df['Month'] = str(files[file])
        # Update `df`
        df = df.append(file_df, ignore_index=True)

Alternatively we can use DataFrame.assign to chain the column assignment:

import os

import pandas as pd

paths = "C://Users//6J2754897//Downloads//monthlydata"
files = os.listdir(paths)

df = pd.DataFrame()
for file in range(len(files)):
    if files[file].endswith('.xlsx'):
        # Read in File
        df = df.append(
            # Read in File
            pd.read_excel(paths   "//"   files[file],
                          sheet_name="information",
                          skiprows=7)
                .assign(Month=str(files[file])),  # Add to just this DataFrame
            ignore_index=True
        )

For general overall improvements we can use pd.concat with a list comprehension over files. This is done to avoid growing the DataFrame (which can be extremely slow). Pathlib.glob can also help with the ability to select the appropriate files:

from pathlib import Path

import pandas as pd

paths = "C://Users//6J2754897//Downloads//monthlydata"

df = pd.concat([
    pd.read_excel(file,
                  sheet_name="information",
                  skiprows=7)
        .assign(Month=file.stem)  # We may also want file.name here
    for file in Path(paths).glob('*.xlsx')
])

Some options for the Month Column are either:

file.stem will give "[t]he final path component, without its suffix".
- 'folder/folder/sample.xlsx' -> 'sample'
file.name will give "the final path component, excluding the drive and root".
- 'folder/folder/sample.xlsx' -> 'sample.xlsx'