Dear pandas DataFrame experts,
I have been using pandas DataFrames to help with re-writing the charting code in an open source project (https://openrem.org/, https://bitbucket.org/openrem/openrem).
I've been grouping and aggregating data over fields such as study_name and x_ray_system_name.
An example dataframe might contain the following data:
study_name request_name total_dlp x_ray_system_name
head head 50.0 All systems
head head 100.0 All systems
head NaN 200.0 All systems
blank NaN 75.0 All systems
blank NaN 125.0 All systems
blank head 400.0 All systems
The following line calculates the count and mean of the total_dlp data grouped by x_ray_system_name and study_name:
df.groupby(["x_ray_system_name", "study_name"]).agg({"total_dlp": ["count", "mean"]})
with the following result:
total_dlp
count mean
x_ray_system_name study_name
All systems blank 3 200.000000
head 3 116.666667
I now have a need to be able to calculate the mean of the total_dlp data grouped over entries in study_name or request_name. So in the example above, I'd like the "head" mean to include the three study_name "head" entries, and also the single request_name "head" entry.
I would like the results to look something like this:
total_dlp
count mean
x_ray_system_name name
All systems blank 3 200.000000
head 4 187.500000
Does anyone know how I can carry out a groupby based on categories in one field or another?
Any help you can offer will be very much appreciated.
Kind regards,
David
CodePudding user response:
You (groupby) data is essentially union of:
- extract those with
study_name == request_name
- duplicate those with
study_name != request_name
, one forstudy_name
, one forrequest_name
We can duplicate the data with melt
(pd.concat([df.query('study_name==request_name') # equal part
.drop('request_name', axis=1), # remove so `melt` doesn't duplicate this data
df.query('study_name!=request_name')]) # not equal part
.melt(['x_ray_system_name','total_dlp']) # melt to duplicate
.groupby(['x_ray_system_name','value'])
['total_dlp'].mean()
)
Update: editing the above code helps me realize that we could simplify do:
# mask `request_name` with `NaN` where they equal `study_name`
# so they are ignored when duplicate/mean
(df.assign(request_name=df.request_name.mask(df.study_name==df.request_name))
.melt(['x_ray_system_name','total_dlp'])
.groupby(['x_ray_system_name','value'])
['total_dlp'].mean()
)
Output:
x_ray_system_name value
All systems blank 200.0
head 187.5
Name: total_dlp, dtype: float64
CodePudding user response:
I have a similar approach to that of @QuangHoang but with a different order of the operations.
I am using here the original (range) index to chose how to drop the duplicate data.
You can melt
, drop_duplicates
and dropna
and groupby
:
(df.reset_index()
.melt(id_vars=['index', 'total_dlp', 'x_ray_system_name'])
.drop_duplicates(['index', 'value'])
.dropna(subset=['value'])
.groupby(["x_ray_system_name", 'value'])
.agg({"total_dlp": ["count", "mean"]})
)
output:
total_dlp
count mean
x_ray_system_name value
All systems blank 3 200.0
head 4 187.5