I am working with an ex termly large datfarem. Here is a sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID': ['A', 'A', 'A', 'X', 'X', 'Y'],
})
ID
0 A
1 A
2 A
3 X
4 X
5 Y
Now, given the frequency of each value in column '''ID''', I want to calculate a weight using the function below and add a column that has the weight associated with each value in '''ID'''.
def get_weights_inverse_num_of_samples(label_counts, power=1.):
no_of_classes = len(label_counts)
weights_for_samples = 1.0/np.power(np.array(label_counts), power)
weights_for_samples = weights_for_samples/ np.sum(weights_for_samples)*no_of_classes
return weights_for_samples
freq = df.value_counts()
print(freq)
ID
A 3
X 2
Y 1
weights = get_weights_inverse_num_of_samples(freq)
print(weights)
[0.54545455 0.81818182 1.63636364]
So, I am looking for an efficient way to get a dataframe like this given the above weights:
ID sample_weight
0 A 0.54545455
1 A 0.54545455
2 A 0.54545455
3 X 0.81818182
4 X 0.81818182
5 Y 1.63636364
CodePudding user response:
You can map
the values:
df['sample_weight'] = df['ID'].map(dict(zip(freq.index.get_level_values(0), weights)))
NB. value_counts
returns a MultiIndex with a single level, thus the needed get_level_values
.
As noted by @ScottBoston, a better approach would be to use:
freq = df['ID'].value_counts()
df['sample_weight'] = df['ID'].map(dict(zip(freq.index, weights)))
Output:
ID sample_weight
0 A 0.545455
1 A 0.545455
2 A 0.545455
3 X 0.818182
4 X 0.818182
5 Y 1.636364
CodePudding user response:
If you rely on duck-typing a little bit more, you can rewrite your function to return the same input type as outputted.
This will save you of needing to explicitly reaching back into the .index
prior to calling .map
import pandas as pd
df = pd.DataFrame({'ID': ['A', 'A', 'A', 'X', 'X', 'Y'})
def get_weights_inverse_num_of_samples(label_counts, power=1):
"""Using object methods here instead of coercing to numpy ndarray"""
no_of_classes = len(label_counts)
weights_for_samples = 1 / (label_counts ** power)
return weights_for_samples / weights_for_samples.sum() * no_of_classes
# select the column before using `.value_counts()`
# this saves us from ending up with a `MultiIndex` Series
freq = df['ID'].value_counts()
weights = get_weights_inverse_num_of_samples(freq)
print(weights)
# A 0.545455
# X 0.818182
# Y 1.636364
# note that now our weights are still a `pd.Series`
# that we can align directly against our `"ID"` column
df['sample_weight'] = df['ID'].map(weights)
print(df)
# ID sample_weight
# 0 A 0.545455
# 1 A 0.545455
# 2 A 0.545455
# 3 X 0.818182
# 4 X 0.818182
# 5 Y 1.636364