I have a dataframe (df) that looks like this (unknown data for current year months 7-12) and I'm trying to calculate statistics on random selections from the existing data for the missing months below "n" number of times. I need to save the randomly selected numbers for the given data for each month in a new df. In this case, I have known data for year = 2022 and months 1 -6 and I need to randomly choose one of the 3 values from the previous months (7,8,9,10,11,12) and the years given (2019, 2020, 2021 not including 2022).
index month team year val cum_val
0 1 celtics 2019 4 4
1 2 celtics 2019 9 13
2 3 celtics 2019 17 30
3 4 celtics 2019 4 34
4 5 celtics 2019 7 41
5 6 celtics 2019 8 49
6 7 celtics 2019 2 51
7 8 celtics 2019 5 56
8 9 celtics 2019 1 57
9 10 celtics 2019 0 57
10 11 celtics 2019 18 75
11 12 celtics 2019 16 91
12 1 celtics 2020 7 7
13 2 celtics 2020 18 25
14 3 celtics 2020 27 52
15 4 celtics 2020 4 56
16 5 celtics 2020 9 65
17 6 celtics 2020 13 78
18 7 celtics 2020 2 80
19 8 celtics 2020 2 82
20 9 celtics 2020 5 87
21 10 celtics 2020 3 90
22 11 celtics 2020 7 97
23 12 celtics 2020 7 104
24 1 celtics 2021 3 3
25 2 celtics 2021 4 7
26 3 celtics 2021 2 9
27 4 celtics 2021 21 30
28 5 celtics 2021 5 35
29 6 celtics 2021 8 43
30 7 celtics 2021 27 70
31 8 celtics 2021 11 81
32 9 celtics 2021 4 85
33 10 celtics 2021 3 88
34 11 celtics 2021 1 89
35 12 celtics 2021 9 98
36 1 celtics 2022 14 14
37 2 celtics 2022 5 19
38 3 celtics 2022 29 48
39 4 celtics 2022 9 57
40 5 celtics 2022 0 57
41 6 celtics 2022 28 85
42 7 celtics 2022 ? 85
43 8 celtics 2022 ? 85
44 9 celtics 2022 ? 85
45 10 celtics 2022 ? 85
46 11 celtics 2022 ? 85
47 12 celtics 2022 ? 85
With n == 5, an outcome might look like this for the new dataframe:
month n1 n2 n3 n4 n5
7 27 2 27 2 2
8 11 11 2 5 5
9 4 1 4 5 1
10 0 3 0 3 3
11 18 1 1 7 18
12 16 9 9 7 16
Thank you for any help on how to get this going.
CodePudding user response:
try:
df[df.month.isin(
df.loc[df.val.eq('?'), 'month']
)].groupby('month')['val'].apply(
lambda x: x[x.ne('?')].sample(5, replace=True, ignore_index=True)
).unstack(level=-1)
or in one long line
df[df.month.isin(df.loc[df.val.eq('?'), 'month'])].groupby('month')['val'].apply(lambda x: x[x.ne('?')].sample(5, replace=True, ignore_index=True)).unstack(level=-1)
if instead of ?
you have nan
, change the eq('?')
to isna()
and the ne('?')
to notna()
Output:
0 1 2 3 4
month
7 2 2 2 2 2
8 5 5 11 11 2
9 5 1 4 1 5
10 3 3 0 3 0
11 7 7 18 18 1
12 7 9 7 7 7