Python Create data frame from existing data frame-CodePudding

I am given the following data frame with columns "total_bill," "tip," "sex," "smoker," "day," "time," and "size." Row values for "smoker" can be "Yes" or "No". Row values for "time" can be "Lunch" or "Dinner".

Given data frame:

total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sunday Dinner 2
1 10.34 1.66 Male No Sunday Dinner 3
2 21.01 3.50 Male No Sunday Dinner 3
3 23.68 3.31 Male No Sunday Dinner 2
4 24.59 3.61 Female No Sunday Dinner 4

I need to create the following:

Data frame for size (number) of dinners and lunches
Data frame for the amount of smokers by time (Lunch: Amount of smokers, Dinner: Amount of smokers)
Combine the two data frames above

Expected output: "?" represents numerical value derived

	Records	Smokers
Lunch	?	?
Dinner	?	?

My code:

# Number of Time (Lunch/Dinner) records
df1 = tips_df.groupby('time')['size'].sum()

def myfunction(x):
    if x == 'Yes':
        return 1
    else:
        return 0

#Number of Smoker records by Time (Lunch/Dinner)
mydata['Smoker_numerical'] = tips_df['smoker'].apply(lambda x: myfunction(x))
mydata = mydata.astype({'Smoker_numerical': 'int32'})
mydata2 = mydata.groupby('time')['Smoker_numerical'].sum()

result = concat([df1, mydata2], axis=1)
result

My code to determine the amount of smokers by time is outputting the following error message.

KeyError                                  Traceback (most recent call last)
/var/folders/wv/42dn23fd1cb0czpvqdnb6zw00000gn/T/ipykernel_13833/1039123812.py in <module>
      8 
      9 mydata['Smoker_numerical'] = tips_df['smoker'].apply(lambda x: myfunction(x))
---> 10 mydata = mydata.astype({'Smoker_numerical': 'int32'})
     11 mydata2 = mydata.groupby('time')['Smoker_numerical'].sum()
     12 

~/opt/miniconda3/lib/python3.9/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5876             if self.ndim == 1:  # i.e. Series
   5877                 if len(dtype) > 1 or self.name not in dtype:
-> 5878                     raise KeyError(
   5879                         "Only the Series name can be used for "
   5880                         "the key in Series dtype mappings."

KeyError: 'Only the Series name can be used for the key in Series dtype mappings.'

Is there any way to fix this, or another way to determine the amount of smokers by time? Thanks.

CodePudding user response：

IIUC, use agg:

result = df.assign(smoker=df['smoker'] == 'Yes').groupby('time', as_index=False) \
           .agg(Records=('size', 'sum'), Smokers=('smoker', 'sum'))
print(result)

# Output
     time  Records  Smokers
0  Dinner       14        0

Update

How do I get the smokers by time?

>>> df.groupby('smoker', as_index=False)['time'].count() 
  smoker  time
0     No     5