im new using python and for a work purpose im here to ask your help.
I have on a same folder 12 excel files month by month that contains columns like: Product_Name, Quantity and Total_Value
So, what i would like to do but I have no idea how to do are:
- Add a month column on each of those files that contain the same date from the file name
- Merge those excel files into a unique file
For example:
January-21.xls:
Product_Name (type:string) | Quantity (type:float) | Total_Value (type:float) | Month (type:Date) |
---|---|---|---|
Product A | 10 | 250 | "File Name" (January-21) |
Product B | 20 | 500 | "File Name" (January-21) |
Product C | 15 | 400 | "File Name" (January-21) |
February-21.xls:
Product_Name (type:string) | Quantity (type:float) | Total_Value (type:float) | Month (type:Date) |
---|---|---|---|
Product A | 40 | 800 | "File Name" (February-21) |
Product B | 25 | 700 | "File Name" (February-21) |
Product C | 30 | 500 | "File Name" (February-21) |
After merge:
Product_Name (type:string) | Quantity (type:float) | Total_Value (type:float) | Month (type:Date) |
---|---|---|---|
Product A | 10 | 250 | "File Name" (January-21) |
Product B | 20 | 500 | "File Name" (January-21) |
Product C | 15 | 400 | "File Name" (January-21) |
Product A | 40 | 800 | "File Name" (February-21) |
Product B | 25 | 700 | "File Name" (February-21) |
Product C | 30 | 500 | "File Name" (February-21) |
Is it possible? Sorry for my bad English, I'm not a native speaker.
I really appreciate your help!
Edit.1
This is how i merge, create a csv file and convert to a dataframe using pandas:
import pandas as pd
import os
path = "/content/drive/MyDrive/Colab_Notebooks/sq_datas"
files = [file for file in os.listdir(path) if not file.startswith('.')] # Ignore hidden files
all_months_data = pd.DataFrame()
for file in files:
current_data = pd.read_excel(path "/" file)
all_months_data = pd.concat([all_months_data, current_data])
all_months_data.to_csv("/content/drive/MyDrive/Colab_Notebooks/sq_datas/all_months.csv", index=False)
So, the main problem to me is create a loop for to add the month column before merge all those files in one.
CodePudding user response:
This is extremely similar to what I do on a daily basis at work. Here's how I would approach your problem:
from pathlib import Path
path = Path("/content/drive/MyDrive/Colab_Notebooks/sq_datas")
all_data = []
for file in path.glob("*.xls"):
# Parse the month from the file's name
# month will be something like "January" and "February"
# year will be something like "20" and "21"
# date will be something like pd.Timestamp("2021-01-01")
month, year = file.stem.split("-")
date = pd.Timestamp(f"{month} 1, 20{year}")
# Read data from the current file
current_data = pd.read_excel(file).assign(Month=date)
# Append the data to the list
all_data.append(current_data)
# Combine all data from the list into a single DataFrame
all_data = pd.concat(all_data)
CodePudding user response:
At a basic level, you first need to read your Excel files, such as with pandas.read_excel:
import pandas as pd
jan21_df = pd.read_excel('January-21.xls')
feb21_df = pd.read_excel('February-21.xls')
You wrote type:Date for the Month column. To add a date column to each dataframe:
jan21_df['Month'] = pd.to_datetime('2021-01-01')
feb21_df['Month'] = pd.to_datetime('2021-02-01')
But if you wanted the file name as string:
jan21_df['Month'] = "File Name (January-21)"
feb21_df['Month'] = "File Name (February-21)"
Then to combine the two dataframes:
combined = pd.concat([jan21_df, feb21_df])
This is a proof of concept. There are ways to automate this further based on the requirements.
EDIT: based on the edit in the OP, minor addition to the loop:
for file in files:
current_data = pd.read_excel(path "/" file)
current_data['Month'] = file
all_months_data = pd.concat([all_months_data, current_data])