Python add weights associated with values of a column-CodePudding

I am working with an ex termly large datfarem. Here is a sample:

import pandas as pd
import numpy as np
df = pd.DataFrame({ 
'ID': ['A', 'A', 'A', 'X', 'X', 'Y'], 
})
 ID
0  A
1  A
2  A
3  X
4  X
5  Y

Now, given the frequency of each value in column '''ID''', I want to calculate a weight using the function below and add a column that has the weight associated with each value in '''ID'''.

def get_weights_inverse_num_of_samples(label_counts, power=1.):
    no_of_classes = len(label_counts)
    weights_for_samples = 1.0/np.power(np.array(label_counts), power)
    weights_for_samples = weights_for_samples/ np.sum(weights_for_samples)*no_of_classes
    return weights_for_samples

freq = df.value_counts()
print(freq)
ID
A     3
X     2
Y     1

weights = get_weights_inverse_num_of_samples(freq)
print(weights)
[0.54545455 0.81818182 1.63636364]

So, I am looking for an efficient way to get a dataframe like this given the above weights:

   ID  sample_weight
0  A   0.54545455
1  A   0.54545455
2  A   0.54545455
3  X   0.81818182
4  X   0.81818182
5  Y   1.63636364

CodePudding user response：

You can map the values:

df['sample_weight'] = df['ID'].map(dict(zip(freq.index.get_level_values(0), weights)))

NB. value_counts returns a MultiIndex with a single level, thus the needed get_level_values.

As noted by @ScottBoston, a better approach would be to use:

freq = df['ID'].value_counts()

df['sample_weight'] = df['ID'].map(dict(zip(freq.index, weights)))

Output:

  ID  sample_weight
0  A       0.545455
1  A       0.545455
2  A       0.545455
3  X       0.818182
4  X       0.818182
5  Y       1.636364

CodePudding user response：

If you rely on duck-typing a little bit more, you can rewrite your function to return the same input type as outputted.

This will save you of needing to explicitly reaching back into the .index prior to calling .map

import pandas as pd

df = pd.DataFrame({'ID': ['A', 'A', 'A', 'X', 'X', 'Y'})

def get_weights_inverse_num_of_samples(label_counts, power=1):
    """Using object methods here instead of coercing to numpy ndarray"""

    no_of_classes = len(label_counts)
    weights_for_samples = 1 / (label_counts ** power)
    return weights_for_samples / weights_for_samples.sum() * no_of_classes

# select the column before using `.value_counts()`
#   this saves us from ending up with a `MultiIndex` Series
freq = df['ID'].value_counts() 

weights = get_weights_inverse_num_of_samples(freq)

print(weights)
# A    0.545455
# X    0.818182
# Y    1.636364

# note that now our weights are still a `pd.Series` 
#  that we can align directly against our `"ID"` column

df['sample_weight'] = df['ID'].map(weights)

print(df)
#   ID  sample_weight
# 0  A       0.545455
# 1  A       0.545455
# 2  A       0.545455
# 3  X       0.818182
# 4  X       0.818182
# 5  Y       1.636364