Add a month column to each excel file and then merge all files into a .csv-CodePudding

im new using python and for a work purpose im here to ask your help.

I have on a same folder 12 excel files month by month that contains columns like: Product_Name, Quantity and Total_Value

So, what i would like to do but I have no idea how to do are:

Add a month column on each of those files that contain the same date from the file name
Merge those excel files into a unique file

For example:

January-21.xls:

Product_Name (type:string)	Quantity (type:float)	Total_Value (type:float)	Month (type:Date)
Product A	10	250	"File Name" (January-21)
Product B	20	500	"File Name" (January-21)
Product C	15	400	"File Name" (January-21)

February-21.xls:

Product_Name (type:string)	Quantity (type:float)	Total_Value (type:float)	Month (type:Date)
Product A	40	800	"File Name" (February-21)
Product B	25	700	"File Name" (February-21)
Product C	30	500	"File Name" (February-21)

After merge:

Product_Name (type:string)	Quantity (type:float)	Total_Value (type:float)	Month (type:Date)
Product A	10	250	"File Name" (January-21)
Product B	20	500	"File Name" (January-21)
Product C	15	400	"File Name" (January-21)
Product A	40	800	"File Name" (February-21)
Product B	25	700	"File Name" (February-21)
Product C	30	500	"File Name" (February-21)

Is it possible? Sorry for my bad English, I'm not a native speaker.

I really appreciate your help!

Edit.1

This is how i merge, create a csv file and convert to a dataframe using pandas:


import pandas as pd
import os

path = "/content/drive/MyDrive/Colab_Notebooks/sq_datas"
files = [file for file in os.listdir(path) if not file.startswith('.')] # Ignore hidden files

all_months_data = pd.DataFrame()

for file in files:
    current_data = pd.read_excel(path "/" file)
    all_months_data = pd.concat([all_months_data, current_data])
    
all_months_data.to_csv("/content/drive/MyDrive/Colab_Notebooks/sq_datas/all_months.csv", index=False)

So, the main problem to me is create a loop for to add the month column before merge all those files in one.

CodePudding user response：

This is extremely similar to what I do on a daily basis at work. Here's how I would approach your problem:

from pathlib import Path

path = Path("/content/drive/MyDrive/Colab_Notebooks/sq_datas")
all_data = []

for file in path.glob("*.xls"):
    # Parse the month from the file's name
    # month will be something like "January" and "February"
    # year will be something like "20" and "21"
    # date will be something like pd.Timestamp("2021-01-01")
    month, year = file.stem.split("-")
    date = pd.Timestamp(f"{month} 1, 20{year}")
    
    # Read data from the current file
    current_data = pd.read_excel(file).assign(Month=date)

    # Append the data to the list
    all_data.append(current_data)

# Combine all data from the list into a single DataFrame
all_data = pd.concat(all_data)

CodePudding user response：

At a basic level, you first need to read your Excel files, such as with pandas.read_excel:

import pandas as pd

jan21_df = pd.read_excel('January-21.xls')
feb21_df = pd.read_excel('February-21.xls')

You wrote type:Date for the Month column. To add a date column to each dataframe:

jan21_df['Month'] = pd.to_datetime('2021-01-01')
feb21_df['Month'] = pd.to_datetime('2021-02-01')

But if you wanted the file name as string:

jan21_df['Month'] = "File Name (January-21)"
feb21_df['Month'] = "File Name (February-21)"

Then to combine the two dataframes:

combined = pd.concat([jan21_df, feb21_df])

This is a proof of concept. There are ways to automate this further based on the requirements.

EDIT: based on the edit in the OP, minor addition to the loop:

for file in files:
    current_data = pd.read_excel(path "/" file)
    current_data['Month'] = file
    all_months_data = pd.concat([all_months_data, current_data])