How do I bring the filename into the data frame with read

I have a directory of excel files that will continue to grow with weekly snapshots of the same data fields. Each file has a date stamp added to the file name (e.g. "_2021_09_30").

Here are my source files:

I have figured out how to read all of the excel files into a python data frame using the code below:

import os
import pandas as pd
cwd = os.path.abspath('NETWORK DRIVE DIRECTORY') 
files = os.listdir(cwd) 
df = pd.DataFrame()
for file in files:
     if file.endswith('.xlsx'):
         df = df.append(pd.read_excel(cwd "/" file), ignore_index=True) 
df.head()

Since these files are snapshots of the same data fields, I want to be able to track how the underlying data changes from week to week. So I would like to add/include a column that has the filename so I can incorporate the date stamp in downstream analysis.

Any thoughts? Thank you in advance.

CodePudding user response：

You could add additional column on the dataframe.

Modifying from your code

temp = pd.read_excel(cwd "/" file), ignore_index=True
temp['date'] = file[-11:]
df = df.append(temp)

CodePudding user response：

You can use glob to easily combine xlsx or csv files into one dataframe. You just have to copy-paste your files' absolute path to where it says "/xlsx_path". You can also change read_excel to read_csv if you have csv files.

import pandas as pd
import glob

all_files = glob.glob(r'/xlsx_path'   "/*.xlsx")
file_list = [pd.read_excel(f) for f in all_files]
all_df = pd.concat(file_list, axis=0, ignore_index=True)

Alternatively you can use the one-liner below:

all_df = pd.concat(map(pd.read_excel, glob.glob('/xlsx_path/*.xlsx')))

CodePudding user response：

Welcome to StackOverflow! I agree with the comments that it's not exactly clear what you're looking for, so maybe clearing that up will help us be more helpful.

For example, with the filename "A_FILENAME_2020-01-23", do you want to use the name "A_FILENAME", or "A_FILENAME_2020-01-23"? Or are you not sure, because you're trying to think through how to track this downstream?

If the latter approach, this is what you would do for adding a new column:

for file in files:
     if file.endswith('.xlsx'):

         tmp = pd.read_excel(cwd "/" file)
         tmp['filename'] = file
         df = df.Append(tmp, ignore_index=True)

This would allow you to search the table by the starting of the 'filename' column, and pull the discrete data of each snapshot of the file side by side. Unfortuantely, this is a LOT of data.

If you ONLY want to store differences, you'd be able to use the .drop_duplicates function to try to drop based off a unique value that you use to decide whether there's a new, modified, or deleted row: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

But, if you don't have a unique identifier for rows, that makes this quite a tough engineering problem. Very. Do you have a unique identifier you can use as your diffing strategy?

CodePudding user response：

Not sure what you really want, but related to tracking changes, let's say you have 2 excel files, you can track changes doing the following :

df1 = pd.read_excel("file-1.xlsx")
df1

values
0   aa
1   bb
2   cc
3   dd
4   ee

df2 = pd.read_excel("file-2.xlsx") 
df2

values
0   aa
1   bb
2   cc
3   ddd
4   e

..and generate a new dataframe having rows that have changed between your 2 files :

df = pd.concat([df1, df2])
df = df.reset_index(drop=True)
new_df = df.groupby(list(df.columns))

diff = [x[0] for x in new_df.groups.values() if len(x) == 1]
df.reindex(diff)

Output :

    values
0   dd
1   ddd
2   e
3   ee