I have four directories from where I need to read all the excel files that are present there, perform some transformation and eventually create one dataframe from this. I have written the following code block to fetch all the files from the first directory and performed some first level transforamtion as required. Now I need to do the same thing from my second directory as well, the easiest way is to replicate the code that I have already, change the directory path and append the final dataframes made from these two code blocks. But I am looking for an efficient way to that.
os.chdir(r'C:\Users\A')
allFiles = glob.glob("*.xlsx") # match your excels
li = []
for file in allFiles:
df = pd.read_excel(file)
li.append(df)
frame_A = pd.concat(li, axis=0, ignore_index=True)
frame_A.to_excel(r'C:\Users\A\A.xlsx',index=None)
Now I just want to read other sets of excels from C:\Users\B
and the final frame from this location will be frame_B
and eventually I want to merge frame_A and frame_B. How can I achieve all of this in one for loop only? I just need to make sure that the final dataframe names should be unique as shown above, frame_A and frame_B
CodePudding user response:
Consider list comprehensions to stack data within a directory and merge data frames between directories using a defined method to generalize the process:
paths = [r"C:\Users\A", r"C:\Users\B"]
def stack_data(path):
allFiles = glob.glob(f"{path}\*.xlsx")
dfs = [
pd.read_excel(file).assign(
source=os.path.basename(file), dir=os.path.basename(path)
) for file in allFiles
]
# VERTICAL APPEND
return pd.concat(dfs, ignore_index=True)
# HORIZONTAL MERGE
master_data = pd.concat([stack_data(p) for p in paths], axis=1)
If more steps are not needed for a defined method, you can handle all in a nested one-liner:
paths = [r"C:\Users\A", r"C:\Users\B"]
master_data = pd.concat( # OUTER HORIZONTAL MERGE
[
pd.concat( # INNER VERTICAL APPEND
[pd.read_excel(file).assign(
source=os.path.basename(file), dir=os.path.basename(path)
) for file in glob.glob(f"{p}\*.xlsx")],
ignore_index=True
) for p in paths
],
axis=1
)
CodePudding user response:
This will loop through all Excel files in a folder, and concatenates everything found the first 20 rows, and Columns A:Z, into a single dataframe. Just extend the ranges to what you actually need.
# import necessary libraries
import pandas as pd
import os
import glob
from io import StringIO
# use glob to get all the csv files
# in the folder
df_final = pd.DataFrame()
f = pd.DataFrame()
path = os.getcwd()
i = 0
for f in glob.glob('C:/Users/ryans/Desktop/*.xls'):
#print(f)
df = pd.read_excel(f, header=0, skiprows=0, nrows= 20, usecols="A:Z")
# get filename
df["filename"] = f
print(df)
# print the location and filename
#print('Location:', f)
#print('File Name:', f.split("\\")[-1])
# print the content
print(df)
i = i 1
print(i)
df_final = pd.concat([df_final, df], ignore_index=True)
#print(df_final)
print(df_final)
You can try to find the range dynamically. I found this code online. I didn't test it.
import openpyxl
wb = openpyxl.load_workbook(path1, data_only=True)
wb.save(save_path)
wb.close()
This will loop through folders and subfolders and list all files that end in 'xlsx'.
import os
import pandas as pd
paths = []
for root, dirs, files in os.walk("C:/Users/ryans/Desktop/"):
for file in files:
if file.endswith(".xlsx"):
#print(os.path.join(root, file))
s = os.path.join(root, file)
print(s)
paths.append(s)
all_data = pd.DataFrame()
for f in paths:
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)