I have a dataframe that looks like the one below.
d = {'location': ['a', 'a', 'b', 'b'], 'value': [1, 5, 3, 7], 'weight': [0.9, 0.1, 0.8, 0.2]}
df = pd.DataFrame(data=d)
df
location value weight
0 a 1 0.9
1 a 5 0.1
2 b 3 0.8
3 b 7 0.2
I currently have code which will compute the grouped median, standard deviation, skew and quantiles for the unweighted data, I am using the below:
df = df[['location','value']]
df1 = df.groupby('location').agg(['median','skew','std']).reset_index()
df2 = df.groupby('location').quantile([0.1, 0.9, 0.25, 0.75, 0.5]).unstack(level=1).reset_index()
dfs = df1.merge(df2, how = 'left', on = 'location')
And the result is the following:
location value
median skew std 0.1 0.9 0.25 0.75 0.5
0 a 3 NaN 2.828427 1.4 4.6 2.0 4.0 3.0
1 b 5 NaN 2.828427 3.4 6.6 4.0 6.0 5.0
I would like to produce the exact same result data frame as the one above, however with weighted statistics using the weight
column. How can I go about doing this?
CodePudding user response:
You can just pass the dict
df1 = df.groupby('location').agg({"value":['median','skew','std'],
"weight":['median','skew','std']}).reset_index()
CodePudding user response:
Instead of merging two groupby
operations, use named aggregation after weighting the values:
- Generate
weighted
values usingassign
. - Aggregate using
{output_col: (input_col, agg_function), ...}
.
dfs = df.assign(weighted=df.value * df.weight).groupby('location').agg(**{
'median': ('weighted', 'median'),
'skew': ('weighted', 'skew'),
'std': ('weighted', 'std'),
'0.1': ('weighted', lambda x: x.quantile(0.1)),
'0.9': ('weighted', lambda x: x.quantile(0.9)),
'0.25': ('weighted', lambda x: x.quantile(0.25)),
'0.75': ('weighted', lambda x: x.quantile(0.75)),
'0.5': ('weighted', lambda x: x.quantile(0.5)),
})
Output:
median skew std 0.1 0.9 0.25 0.75 0.5
location
a 0.7 NaN 0.282843 0.54 0.86 0.60 0.80 0.7
b 1.9 NaN 0.707107 1.50 2.30 1.65 2.15 1.9
CodePudding user response:
The most accurate way to add weights to observations is to duplicate the observations according to their weights.
Adjusting the weights
Before duplicating, you need to adjust your weights first because they are not integers yet. Here is a function that can help you adjust the weights.
def adjust(weights):
base = 10**max([len(str(i).split(".")[1]) for i in weights])
scalar = base / np.gcd.reduce((weights * base).astype(int))
weights = weights * scalar
return weights
If you are curious to know how this function works, you can refer to my question.
df = pd.DataFrame({
"location": ["a", "a", "b", "b"],
"value": [1, 5, 3, 7],
"weights": [0.9, 0.1, 0.8, 0.2]
})
df.loc[:, "weights"] = adjust(df["weights"])
Here are the weights after being adjusted.
>>> df
location value weights
0 a 1 9.0
1 a 5 1.0
2 b 3 8.0
3 b 7 2.0
Duplicating the observations
After adjusting the weights, you need to duplicate the observations according to their weights.
df = df.loc[df.index.repeat(df["weights"])] \
.reset_index(drop=True).drop("weights", axis=1)
You can refer to this excellent answer to understand how it works.
Let's count the number of observations after being duplicated.
>>> df.count()
location 20
value 20
Perform Statistical Operations
Now, you can use groupby
and aggregate using any statistical operations. The data is now weighted.
df1 = df.groupby("location").agg(["median","skew","std"]).reset_index()
df2 = df.groupby("location").quantile([0.1, 0.9, 0.25, 0.75, 0.5]) \
.unstack(level=1).reset_index()
print(df1.merge(df2, how="left", on="location"))
This gives the following output.
location value
median skew std 0.1 0.9 0.25 0.75 0.5
0 a 1.0 3.162278 1.264911 1.0 1.4 1.0 1.0 1.0
1 b 3.0 1.778781 1.686548 3.0 7.0 3.0 3.0 3.0