I am struggling to create a function that could first calculate the number of occurrences for each string in a specific column (from row 0 to row n) and then reduce this to one single value by calculating the mean
of the value_counts
from the first row to the row n.
More precisely, what I would like to do is to create a new column ['Mean'] where the value of each row n equals to the mean
of the value_counts()
from the first row to the nth row of the column ['Name'].
import pandas as pd
import datetime as dt
data = [["2022-11-1", 'Tom'], ["2022-11-2", 'Mike'], ["2022-11-3", 'Paul'], ["2022-11-4", 'Pauline'], ["2022-11-5", 'Pauline'], ["2022-11-6", 'Mike'], ["2022-11-7", 'Tom'], ["2022-11-8", 'Louise'], ["2022-11-9", 'Tom'], ["2022-11-10", 'Mike'], ["2022-11-11", 'Paul'], ["2022-11-12", 'Pauline'], ["2022-11-13", 'Pauline'], ["2022-11-14", 'Mike'], ["2022-11-15", 'Tom'], ["2022-11-16", 'Louise']]
df = pd.DataFrame(data, columns=['Date', 'Name'])
So for example, the 6th row of ['Mean'] should have a value of 1.25 as Pauline appeared twice, so the calcul should be (1 1 1 2 1)/5 = 1.25 .
Thank you,
CodePudding user response:
The logic is unclear, but assuming you want the expanding average count of values, use:
df['mean'] = pd.Series(pd.factorize(df['Name'])[0], index=df.index)
.expanding()
.apply(lambda s: s.value_counts().mean())
)
Output:
Date Name mean
0 2022-11-1 Tom 1.00
1 2022-11-2 Mike 1.00
2 2022-11-3 Paul 1.00
3 2022-11-4 Pauline 1.00
4 2022-11-5 Pauline 1.25
5 2022-11-6 Mike 1.50
6 2022-11-7 Tom 1.75
7 2022-11-8 Louise 1.60
8 2022-11-9 Tom 1.80
9 2022-11-10 Mike 2.00
10 2022-11-11 Paul 2.20
11 2022-11-12 Pauline 2.40
12 2022-11-13 Pauline 2.60
13 2022-11-14 Mike 2.80
14 2022-11-15 Tom 3.00
15 2022-11-16 Louise 3.20