Home > OS >  grouping data from files into separate dataframes
grouping data from files into separate dataframes

Time:11-16

i am new to coding and trying to apply what I've learned to my work.

I have 7 sales record(by month) files in csv format, which i read by pd.read_csv().

Below is an example of what each file looks like.

Account A Account B Account C
product 1 1 2 3
product 2 1 2 3
product 3 1 2 3
product 4 1 2 3

What i am trying to achieve is to make separate dataframes for each product, with months as index and Account names as columns. Any suggestion on how can i perform this task?

Some people say that using list of list and appending data row by row is better than creating multiple empty data frames and appending data in them. so i created multiple lists for appending data in with the simple code below:

product_name = ['product 1', 'product 2', 'product 3', 'product 4']
for name in product_name:
    name = []

What should be the next steps in drawing data from the original sales record file and how can i append them to the empty list?

Afterward, should i use the code below to transform each list into dataframe?

months = ['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar']
account = ['Account A', 'Account B', 'Account C']
df = pd.DataFrame (name, columns = account, index = months)

And finally, may i know how can i save the final separated dataframes into separated files?

Thx!

CodePudding user response:

If you're using a DataFrame, you should be using the pandas module, correct? The DataFrame object is not the best choice for what you're trying to do. In the pandas module, another object class that would work for you is the Series, which is very similar to a DataFrame in that it has headers, but is only 1 row of a DataFrame. You can merge Series objects into DataFrames later if they meet certain conditions. If you are set on using DataFrames due to wanting to include the Month as the y axis, then I would recommend naming the Series upon initialization as the Product.

Also, if I'm not mistaken, DataFrames can be initialized with dictionaries, so if you wanted to use a dictionary instead of a list, that may be more useful to you.

Sorry if this wasn't extremely helpful, I'm learning pandas myself at the moment and can't give concrete examples from memory yet. Good luck!

CodePudding user response:

Assuming there are months listed somewhere in the filename you could use a regular expression and a loop to do this. You'd want to read a file to a dataframe, read the month that's in the filename, extract the month and add it to the dataframe, then append each finished dataframe.

Assuming your filenames are something like "Product_Sales_May_2021":

import glob
import re
import pandas as pd
# you can change the path expression to match your file convention for only the files you want
path = r"C:\Users\you\Documents\Spreadsheets\*.csv"
filename = glob.glob(path)

# create a blank data frame to append each frame you read in onto
df0 = pd.DataFrame()

for file in filename:
    df = pd.read_csv(file)
    # index 2 of the split object is the month name given the above naming convention
    month_name = re.split('_', file)[2]
    df['Month'] = month_name
    df0 = pd.concat([df0,df])

Now you will have a single data frame with all products and their amounts per account by month. Assuming you let pandas set indexes from zero by default instead of forcing the product type to be the index you can just use groupby and sort in one statement to write the sorted frame to a new one and go from there, though you will need to add a months list and categorical sort since you're only using months and not a datetime object.

If you have trouble with it you can comment below, but leaving that part for you to write as you'll learn a lot more figuring it out than just pasting it in.

If you need to split the grouped & sorted frame into separate frames for each product you can use this method.

If you want to set the indexes to the month you can also do so using this.

  • Related