I have a dataframe that looks like below:
------- ---------- ------------- -----------------------------------------------------
| item | category | subcategory | sales_count |
------- ---------- ------------- -----------------------------------------------------
| ItemA | 0 | p | {store1:50,store2:70,store3:90,store4:44,store5:76} |
| ItemB | 0 | p | {store2:22,store3:15,store4:77,store5:0} |
| ItemC | 0 | p | {store1:46,store2:13,store3:9,store4:87,store5:45} |
| ItemD | 0 | q | {store1:88,store2:16,store4:5,store5:2} |
| ItemE | 0 | q | {store1:7,store2:55} |
| ItemF | 1 | t | {store3:25,store4:75,store5:87} |
| ItemG | 1 | t | {store1:32,store3:66,store4:87,store5:0} |
| ItemH | 1 | t | {store1:54,store2:33,store3:12,store4:67,store5:8} |
------- ---------- ------------- -----------------------------------------------------
I would like to generate a new column with the median of sales count across category and subcategory.
i.e., 'median_across_group' value for itemA should be the median of all sales_count in category = 0 & subcategory = p.
How to achieve this groupby and median for dict elements?
------- ---------- ------------- ----------------------------------------------------- ---------------------------------------------
| item | category | subcategory | sales_count | median_across_group |
------- ---------- ------------- ----------------------------------------------------- ---------------------------------------------
| ItemA | 0 | p | {store1:50,store2:70,store3:90,store4:44,store5:76} | <median of category 0, subcategory p items> |
| ItemB | 0 | p | {store2:22,store3:15,store4:77,store5:0} | <median of category 0, subcategory p items> |
| ItemC | 0 | p | {store1:46,store2:13,store3:9,store4:87,store5:45} | <median of category 0, subcategory p items> |
| ItemD | 0 | q | {store1:88,store2:16,store4:5,store5:2} | <median of category 0, subcategory q items> |
| ItemE | 0 | q | {store1:7,store2:55} | <median of category 0, subcategory q items> |
| ItemF | 1 | t | {store3:25,store4:75,store5:87} | <median of category 1, subcategory t items> |
| ItemG | 1 | t | {store1:32,store3:66,store4:87,store5:0} | <median of category 1, subcategory t items> |
| ItemH | 1 | t | {store1:54,store2:33,store3:12,store4:67,store5:8} | <median of category 1, subcategory t items> |
------- ---------- ------------- ----------------------------------------------------- ---------------------------------------------
CodePudding user response:
You might want to try this out:
Code
import pandas as pd
df = pd.DataFrame({
'item': ['ItemA', 'ItemB', 'ItemC', 'ItemD', 'ItemE', 'ItemF', 'ItemG', 'ItemH', ],
'category': [0, 0, 0, 0, 0, 1, 1, 1],
'subcategory': ['p', 'p', 'p', 'q', 'q', 't', 't', 't'],
'sales_count': [
{'store1':50,'store2':70,'store3':90,'store4':44,'store5':76},
{'store2':22,'store3':15,'store4':77,'store5':0},
{'store1':46,'store2':13,'store3':9,'store4':87,'store5':45},
{'store1':88,'store2':16,'store4':5,'store5':2},
{'store1':7,'store2':55},
{'store3':25,'store4':75,'store5':87},
{'store1':32,'store3':66,'store4':87,'store5':0},
{'store1':54,'store2':33,'store3':12,'store4':67,'store5':8}
]
})
median = {}
for idx, row in df.iterrows():
key_combo = str(row['category']) str(row['subcategory'])
values_list = list(row['sales_count'].values())
median[key_combo] = (
values_list # Add the list if key not present
if key_combo not in median else
median[key_combo] (values_list) # Append the new list if key present
)
for key, values in median.items():
median[key] = sorted(values)[len(values) // 2] # Calculate median and store in dict
def apply_median(x):
return median[str(x.category) str(x.subcategory)]
df['Median'] = df[['category', 'subcategory']].apply(apply_median, axis=1)
print(df)
Output
item category subcategory sales_count Median
0 ItemA 0 p {'store1': 50, 'store2': 70, 'store3': 90, 'st... 46
1 ItemB 0 p {'store2': 22, 'store3': 15, 'store4': 77, 'st... 46
2 ItemC 0 p {'store1': 46, 'store2': 13, 'store3': 9, 'sto... 46
3 ItemD 0 q {'store1': 88, 'store2': 16, 'store4': 5, 'sto... 16
4 ItemE 0 q {'store1': 7, 'store2': 55} 16
5 ItemF 1 t {'store3': 25, 'store4': 75, 'store5': 87} 54
6 ItemG 1 t {'store1': 32, 'store3': 66, 'store4': 87, 'st... 54
7 ItemH 1 t {'store1': 54, 'store2': 33, 'store3': 12, 'st... 54
CodePudding user response:
I found a simpler way,
def get_dict_median(x):
flat_list = [i for k in list(x) for i in k] #Flatten all lists into one single list
return(np.median(flat_list))
df['sales_count_list'] = df['sales_count'].apply(lambda x: list(x.values()))
df['group_median']=df.groupby(['category','subcategory'])['sales_count_list'].transform(get_dict_median)