I am given the following data frame with columns "total_bill," "tip," "sex," "smoker," "day," "time," and "size." Row values for "smoker" can be "Yes" or "No". Row values for "time" can be "Lunch" or "Dinner".
Given data frame:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sunday Dinner 2
1 10.34 1.66 Male No Sunday Dinner 3
2 21.01 3.50 Male No Sunday Dinner 3
3 23.68 3.31 Male No Sunday Dinner 2
4 24.59 3.61 Female No Sunday Dinner 4
I need to create the following:
- Data frame for size (number) of dinners and lunches
- Data frame for the amount of smokers by time (Lunch: Amount of smokers, Dinner: Amount of smokers)
- Combine the two data frames above
Expected output: "?" represents numerical value derived
Records | Smokers | |
---|---|---|
Lunch | ? | ? |
Dinner | ? | ? |
My code:
# Number of Time (Lunch/Dinner) records
df1 = tips_df.groupby('time')['size'].sum()
def myfunction(x):
if x == 'Yes':
return 1
else:
return 0
#Number of Smoker records by Time (Lunch/Dinner)
mydata['Smoker_numerical'] = tips_df['smoker'].apply(lambda x: myfunction(x))
mydata = mydata.astype({'Smoker_numerical': 'int32'})
mydata2 = mydata.groupby('time')['Smoker_numerical'].sum()
result = concat([df1, mydata2], axis=1)
result
My code to determine the amount of smokers by time is outputting the following error message.
KeyError Traceback (most recent call last)
/var/folders/wv/42dn23fd1cb0czpvqdnb6zw00000gn/T/ipykernel_13833/1039123812.py in <module>
8
9 mydata['Smoker_numerical'] = tips_df['smoker'].apply(lambda x: myfunction(x))
---> 10 mydata = mydata.astype({'Smoker_numerical': 'int32'})
11 mydata2 = mydata.groupby('time')['Smoker_numerical'].sum()
12
~/opt/miniconda3/lib/python3.9/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
5876 if self.ndim == 1: # i.e. Series
5877 if len(dtype) > 1 or self.name not in dtype:
-> 5878 raise KeyError(
5879 "Only the Series name can be used for "
5880 "the key in Series dtype mappings."
KeyError: 'Only the Series name can be used for the key in Series dtype mappings.'
Is there any way to fix this, or another way to determine the amount of smokers by time? Thanks.
CodePudding user response:
IIUC, use agg
:
result = df.assign(smoker=df['smoker'] == 'Yes').groupby('time', as_index=False) \
.agg(Records=('size', 'sum'), Smokers=('smoker', 'sum'))
print(result)
# Output
time Records Smokers
0 Dinner 14 0
Update
How do I get the smokers by time?
>>> df.groupby('smoker', as_index=False)['time'].count()
smoker time
0 No 5