Pandas Weighted Stats-CodePudding

I have a dataframe that looks like the one below.

d = {'location': ['a', 'a', 'b', 'b'], 'value': [1, 5, 3, 7], 'weight': [0.9, 0.1, 0.8, 0.2]}
df = pd.DataFrame(data=d)
df
  location value weight
0     a     1     0.9
1     a     5     0.1
2     b     3     0.8
3     b     7     0.2

I currently have code which will compute the grouped median, standard deviation, skew and quantiles for the unweighted data, I am using the below:

df = df[['location','value']]

df1 = df.groupby('location').agg(['median','skew','std']).reset_index()

df2 = df.groupby('location').quantile([0.1, 0.9, 0.25, 0.75, 0.5]).unstack(level=1).reset_index()

dfs = df1.merge(df2, how = 'left', on = 'location')

And the result is the following:

  location   value
             median skew      std  0.1  0.9 0.25 0.75  0.5
0      a         3  NaN  2.828427  1.4  4.6  2.0  4.0  3.0
1      b         5  NaN  2.828427  3.4  6.6  4.0  6.0  5.0

I would like to produce the exact same result data frame as the one above, however with weighted statistics using the weight column. How can I go about doing this?

CodePudding user response：

You can just pass the dict

df1 = df.groupby('location').agg({"value":['median','skew','std'],
                                  "weight":['median','skew','std']}).reset_index()

CodePudding user response：

Instead of merging two groupby operations, use named aggregation after weighting the values:

Generate weighted values using assign.
Aggregate using {output_col: (input_col, agg_function), ...}.

dfs = df.assign(weighted=df.value * df.weight).groupby('location').agg(**{
    'median': ('weighted', 'median'),
    'skew': ('weighted', 'skew'),
    'std': ('weighted', 'std'),
    '0.1': ('weighted', lambda x: x.quantile(0.1)),
    '0.9': ('weighted', lambda x: x.quantile(0.9)),
    '0.25': ('weighted', lambda x: x.quantile(0.25)),
    '0.75': ('weighted', lambda x: x.quantile(0.75)),
    '0.5': ('weighted', lambda x: x.quantile(0.5)),
})

Output:

          median  skew       std   0.1   0.9  0.25  0.75  0.5
location                                                     
a            0.7   NaN  0.282843  0.54  0.86  0.60  0.80  0.7
b            1.9   NaN  0.707107  1.50  2.30  1.65  2.15  1.9

CodePudding user response：

The most accurate way to add weights to observations is to duplicate the observations according to their weights.

Adjusting the weights

Before duplicating, you need to adjust your weights first because they are not integers yet. Here is a function that can help you adjust the weights.

def adjust(weights):
    base = 10**max([len(str(i).split(".")[1]) for i in weights])
    scalar = base / np.gcd.reduce((weights * base).astype(int))
    weights = weights * scalar

    return weights

If you are curious to know how this function works, you can refer to my question.

Multiply a Numpy array by a scalar to make every element an integer

df = pd.DataFrame({
    "location": ["a", "a", "b", "b"],
    "value": [1, 5, 3, 7],
    "weights": [0.9, 0.1, 0.8, 0.2]
})

df.loc[:, "weights"] = adjust(df["weights"])

Here are the weights after being adjusted.

>>> df
  location  value  weights
0        a      1      9.0
1        a      5      1.0
2        b      3      8.0
3        b      7      2.0

Duplicating the observations

After adjusting the weights, you need to duplicate the observations according to their weights.

df = df.loc[df.index.repeat(df["weights"])] \
    .reset_index(drop=True).drop("weights", axis=1)

You can refer to this excellent answer to understand how it works.

Duplicate rows, according to value in a column

Let's count the number of observations after being duplicated.

>>> df.count()
location    20
value       20

Perform Statistical Operations

Now, you can use groupby and aggregate using any statistical operations. The data is now weighted.

df1 = df.groupby("location").agg(["median","skew","std"]).reset_index()
df2 = df.groupby("location").quantile([0.1, 0.9, 0.25, 0.75, 0.5]) \
    .unstack(level=1).reset_index()

print(df1.merge(df2, how="left", on="location"))

This gives the following output.

  location  value
           median      skew       std  0.1  0.9 0.25 0.75  0.5
0        a    1.0  3.162278  1.264911  1.0  1.4  1.0  1.0  1.0
1        b    3.0  1.778781  1.686548  3.0  7.0  3.0  3.0  3.0