Home > Software engineering >  How do I group a pandas DataFrame using one column or another
How do I group a pandas DataFrame using one column or another

Time:11-03

Dear pandas DataFrame experts,

I have been using pandas DataFrames to help with re-writing the charting code in an open source project (https://openrem.org/, https://bitbucket.org/openrem/openrem).

I've been grouping and aggregating data over fields such as study_name and x_ray_system_name.

An example dataframe might contain the following data:

study_name   request_name   total_dlp   x_ray_system_name
      head           head        50.0         All systems
      head           head       100.0         All systems
      head            NaN       200.0         All systems
     blank            NaN        75.0         All systems
     blank            NaN       125.0         All systems
     blank           head       400.0         All systems

The following line calculates the count and mean of the total_dlp data grouped by x_ray_system_name and study_name:

df.groupby(["x_ray_system_name", "study_name"]).agg({"total_dlp": ["count", "mean"]})

with the following result:

                                 total_dlp
                                     count         mean
x_ray_system_name   study_name   
All systems         blank                3   200.000000
                    head                 3   116.666667

I now have a need to be able to calculate the mean of the total_dlp data grouped over entries in study_name or request_name. So in the example above, I'd like the "head" mean to include the three study_name "head" entries, and also the single request_name "head" entry.

I would like the results to look something like this:

                                 total_dlp
                                     count         mean
x_ray_system_name   name   
All systems         blank                3   200.000000
                    head                 4   187.500000

Does anyone know how I can carry out a groupby based on categories in one field or another?

Any help you can offer will be very much appreciated.

Kind regards,

David

CodePudding user response:

You (groupby) data is essentially union of:

  1. extract those with study_name == request_name
  2. duplicate those with study_name != request_name, one for study_name, one for request_name

We can duplicate the data with melt

(pd.concat([df.query('study_name==request_name')    # equal part
              .drop('request_name', axis=1),        # remove so `melt` doesn't duplicate this data
            df.query('study_name!=request_name')])  # not equal part
   .melt(['x_ray_system_name','total_dlp'])         # melt to duplicate
   .groupby(['x_ray_system_name','value'])
   ['total_dlp'].mean()
)

Update: editing the above code helps me realize that we could simplify do:

# mask `request_name` with `NaN` where they equal `study_name`
# so they are ignored when duplicate/mean
(df.assign(request_name=df.request_name.mask(df.study_name==df.request_name))
   .melt(['x_ray_system_name','total_dlp']) 
   .groupby(['x_ray_system_name','value'])
   ['total_dlp'].mean()
)

Output:

x_ray_system_name  value
All systems        blank    200.0
                   head     187.5
Name: total_dlp, dtype: float64

CodePudding user response:

I have a similar approach to that of @QuangHoang but with a different order of the operations.

I am using here the original (range) index to chose how to drop the duplicate data.

You can melt, drop_duplicates and dropna and groupby:

(df.reset_index()
   .melt(id_vars=['index', 'total_dlp', 'x_ray_system_name'])
   .drop_duplicates(['index', 'value'])
   .dropna(subset=['value'])
   .groupby(["x_ray_system_name", 'value'])
   .agg({"total_dlp": ["count", "mean"]})
)

output:

                        total_dlp       
                            count   mean
x_ray_system_name value                 
All systems       blank         3  200.0
                  head          4  187.5
  • Related