I have 21 list pairs (date, number of items), there are 21 types of items. I would like to add all of this data to a pandas dataframe with 23 columns (the date, number of item a, number item b ,...,number of item u, total items). in some cases a day will only have one type of item, on other days there could be item a, b, and f for example.
My though was to create a blank dataframe, then append each list with the date in the first column and the "item number" in a new column for each item then somehow sort the dataframe to match the days. for example:
df=pd.DataFrame(columns='date','itemA','itemB','itemC','itemD','itemE','itemF','itemG','itemH','itemI','itemJ','itemK','itemL','itemM','itemN','itemO','itemP','itemQ','itemR','itemS','itemT','itemU','total')
For instance day jan 1 2020 might have 20 of item a 40 of item c and 5 of item m. I imagine that when first appended this data would be on 3 separate rows with data in column a and b, column a and d, column a and n. would there be a way for the pandas dataframe to recognize that the date in column a for all 3 rows are the same and consolidate the data so that it was on one row with data in column a and b and d and n?
Lastly how could I create the last column of total items/day (columns b-v) into a final column?
CodePudding user response:
If I understand correctly then you're looking for something like this.
if you managed to put your data into a list of dictionaries, as each dictionary represents a day, you can do it as follows.
Note:
The dictionaries don't have to contain all the elements only the one existing in that day.
The items inside the dictionary don't have to be ordered.
import pandas as pd
df_list = [
{'date':'06/06/2022', 'itemA':1, 'itemB':1, 'itemC':1,},
{'date':'06/06/2022', 'itemA':1, 'itemE':5, 'itemF':1,},
{'date':'06/06/2022', 'itemB':2, 'itemF':1, 'itemC':1,},
{'date':'06/06/2022', 'itemD':1, 'itemB':1, 'itemE':1,},
{'date':'06/06/2022', 'itemF':1, 'itemB':4, 'itemD':1,},
{'date':'06/06/2022', 'itemE':6, 'itemC':1, 'itemD':1,},
]
# we sort our columns with the sequence of this list
sorted_columns = ['date','itemA','itemB','itemC','itemD','itemE','itemF']
# we create a dataframe from the list of dictionaries
# we fill the empty items with zeros
df = pd.DataFrame(df_list, columns=sorted_columns).fillna(0)
# convert to integers
df[sorted_columns[1:]] = df[sorted_columns[1:]].applymap(int)
# we make a new column 'Total' that summs all the items in each day
# NOTE: the [1:] is to ignore the first column which has the date
df['Total'] = df.apply(lambda row: sum(row[1:]), axis=1)
output:
date | itemA | itemB | itemC | itemD | itemE | itemF | Total |
---|---|---|---|---|---|---|---|
06/06/2022 | 1 | 1 | 1 | 0 | 0 | 0 | 3 |
06/06/2022 | 1 | 0 | 0 | 0 | 5 | 1 | 7 |
06/06/2022 | 0 | 2 | 1 | 0 | 0 | 1 | 4 |
06/06/2022 | 0 | 1 | 0 | 1 | 1 | 0 | 3 |
06/06/2022 | 0 | 4 | 0 | 1 | 0 | 1 | 6 |
06/06/2022 | 0 | 0 | 1 | 1 | 6 | 0 | 8 |
if you find this answer helpful, kindly accept it.
CodePudding user response:
import pandas as pd
# input data according to this comment
# https://stackoverflow.com/questions/72520487/#comment128113673_72520940
itemAdates = ['1/1/20', '1/2/20', '1/3/20', '1/4/20']
itemAcounts = [4, 10, 3, 6]
itemBdates = ['1/1/20', '1/3/20', '1/4/20']
itemBcounts = [9, 5, 6]
itemCdates = ['1/2/20', '1/3/20', '1/4/20']
itemCcounts = [2, 6, 7]
# parsing the data into 1 big list of (date, item_name, item_count)
data = [
*[(date, 'itemA', item_count) for date, item_count in zip(itemAdates, itemAcounts)],
*[(date, 'itemB', item_count) for date, item_count in zip(itemBdates, itemBcounts)],
*[(date, 'itemC', item_count) for date, item_count in zip(itemCdates, itemCcounts)],
]
# parsing the big list into a dictionary with
# new_data = {date:[('date', date), (item_name, item_count), (item_name, item_count), ...]}
new_data = {}
for date, item_name, item_count in data:
new_data[date] = new_data.get(date, [('date', date)]) [(item_name, item_count)]
# converting the list of tuples into dict and appending it into the df_list
df_list = []
for date_values in new_data.values():
df_list.append(dict(date_values))
# we sort our columns with the sequence of this list
# NOTE: the date must be in the first position
sorted_columns = ['date','itemA','itemB','itemC']
# we create a dataframe from the list of dictionaries
# we fill the empty items with zeros
df = pd.DataFrame(df_list, columns=sorted_columns).fillna(0)
# convert to integers
df[sorted_columns[1:]] = df[sorted_columns[1:]].applymap(int)
# we make a new column 'Total' that summs all the items in each day
# NOTE: the [1:] is to ignore the first column which has the date
df['Total'] = df.apply(lambda row: sum(row[1:]), axis=1)
output:
date | itemA | itemB | itemC | Total |
---|---|---|---|---|
1/1/20 | 4 | 9 | 0 | 13 |
1/2/20 | 10 | 0 | 2 | 12 |
1/3/20 | 3 | 5 | 6 | 14 |
1/4/20 | 6 | 6 | 7 | 19 |