Home > Software engineering >  Fit Distribution to Python Pandas DF with Groupby
Fit Distribution to Python Pandas DF with Groupby

Time:11-12

I have hourly data that looks like this in a dataframe 'df and its large in size (3418896,9). I need to fit the weibull distribution to the data but I need the outputs (shape, loc, scale) to be grouped by 'plant_name', 'month', and 'year'.

       plant_name  business_name business_code maint_region_name  wind_speed_ms            mos_time dataset  month  year
0  MAPLE RIDGE II  UNITED STATES           USA              EAST          10.06 2021-09-22 13:00:00    ERA5      9  2021
1  MAPLE RIDGE II  UNITED STATES           USA              EAST          10.04 2021-09-22 12:00:00    ERA5      9  2021
2  MAPLE RIDGE II  UNITED STATES           USA              EAST           9.84 2021-09-22 11:00:00    ERA5      9  2021
3  MAPLE RIDGE II  UNITED STATES           USA              EAST          10.67 2021-09-22 10:00:00    ERA5      9  2021
4  MAPLE RIDGE II  UNITED STATES           USA              EAST          11.47 2021-09-22 09:00:00    ERA5      9  2021

I need a shape, scale value for each plant_name, month, year from 'df'. I've tried this below but I just get a single value for the shape and scale and I need a separate shape, scale for each plant_name, month and year. Here is my attempt that provides just a single number for shape, scale:

from scipy.stats import weibull_min

shape, loc, scale = weibull_min.fit(ncData.groupby(['plant_name','month','year']).apply(lambda x:x['wind_speed_ms']), floc=0)

shape
Out[21]: 2.2556719467040596

scale
Out[22]: 7.603953856897537

I don't know how to send output to the 'shape' and 'scale' parameters with the groupby 'plant'name', 'month', 'year'. Thank you very much for your time to help with something that I can try.

CodePudding user response:

This should work

import pandas as pd
from scipy.stats import weibull_min

# function applied to each ('plant_name','month','year') group
def fit_weibull(g):
    # get wind speed data from the group
    data = g['wind_speed_ms']
    # fit weibull_min to the group wind data
    params = weibull_min.fit(data)
    # Return the fit parameters as a Series (each parameter will correspond to a different column)
    return pd.Series(params, index=['shape', 'loc', 'scale'])

fit_params = ncData.groupby(['plant_name', 'month', 'year']).apply(fit_weibull)
  • Related