Efficiently read excel files from multiple directories at once in Pandas-CodePudding

I have four directories from where I need to read all the excel files that are present there, perform some transformation and eventually create one dataframe from this. I have written the following code block to fetch all the files from the first directory and performed some first level transforamtion as required. Now I need to do the same thing from my second directory as well, the easiest way is to replicate the code that I have already, change the directory path and append the final dataframes made from these two code blocks. But I am looking for an efficient way to that.

os.chdir(r'C:\Users\A')
allFiles = glob.glob("*.xlsx") # match your excels
li = []
for file in allFiles:
    df = pd.read_excel(file)
    li.append(df)
frame_A = pd.concat(li, axis=0, ignore_index=True)
frame_A.to_excel(r'C:\Users\A\A.xlsx',index=None)

Now I just want to read other sets of excels from C:\Users\B and the final frame from this location will be frame_B and eventually I want to merge frame_A and frame_B. How can I achieve all of this in one for loop only? I just need to make sure that the final dataframe names should be unique as shown above, frame_A and frame_B

CodePudding user response：

Consider list comprehensions to stack data within a directory and merge data frames between directories using a defined method to generalize the process:

paths = [r"C:\Users\A", r"C:\Users\B"]

def stack_data(path):
    allFiles = glob.glob(f"{path}\*.xlsx")
    dfs = [
        pd.read_excel(file).assign(
            source=os.path.basename(file), dir=os.path.basename(path)
        ) for file in allFiles 
    ]
    
    # VERTICAL APPEND
    return pd.concat(dfs, ignore_index=True)

# HORIZONTAL MERGE
master_data = pd.concat([stack_data(p) for p in paths], axis=1)

If more steps are not needed for a defined method, you can handle all in a nested one-liner:

paths = [r"C:\Users\A", r"C:\Users\B"]

master_data = pd.concat(        # OUTER HORIZONTAL MERGE
    [
        pd.concat(              # INNER VERTICAL APPEND
            [pd.read_excel(file).assign(
                source=os.path.basename(file), dir=os.path.basename(path)
            ) for file in glob.glob(f"{p}\*.xlsx")],
            ignore_index=True
        ) for p in paths
    ], 
    axis=1
)

CodePudding user response：

This will loop through all Excel files in a folder, and concatenates everything found the first 20 rows, and Columns A:Z, into a single dataframe. Just extend the ranges to what you actually need.

# import necessary libraries
import pandas as pd
import os
import glob
from io import StringIO
  
# use glob to get all the csv files
# in the folder
df_final = pd.DataFrame()
f = pd.DataFrame()

path = os.getcwd()
i = 0
for f in glob.glob('C:/Users/ryans/Desktop/*.xls'):
    #print(f)
    df = pd.read_excel(f, header=0,  skiprows=0, nrows= 20, usecols="A:Z")
    
    # get filename
    df["filename"] = f
    print(df)
    
    # print the location and filename
    #print('Location:', f)
    #print('File Name:', f.split("\\")[-1])
      
    # print the content
    print(df)
    i = i   1
    print(i)
    df_final = pd.concat([df_final, df], ignore_index=True)
    #print(df_final)

print(df_final)

You can try to find the range dynamically. I found this code online. I didn't test it.

import openpyxl 
wb = openpyxl.load_workbook(path1, data_only=True) 
wb.save(save_path)
wb.close()

This will loop through folders and subfolders and list all files that end in 'xlsx'.

import os
import pandas as pd

paths = []
for root, dirs, files in os.walk("C:/Users/ryans/Desktop/"):
    for file in files:
        if file.endswith(".xlsx"):
             #print(os.path.join(root, file))
             s = os.path.join(root, file)
             print(s)
             paths.append(s)

all_data = pd.DataFrame()
for f in paths:
    df = pd.read_excel(f)
    all_data = all_data.append(df,ignore_index=True)