Home > database >  Select Panda Row Data Randomly N Times to Create New Dataframe
Select Panda Row Data Randomly N Times to Create New Dataframe

Time:08-11

I have a dataframe (df) that looks like this (unknown data for current year months 7-12) and I'm trying to calculate statistics on random selections from the existing data for the missing months below "n" number of times. I need to save the randomly selected numbers for the given data for each month in a new df. In this case, I have known data for year = 2022 and months 1 -6 and I need to randomly choose one of the 3 values from the previous months (7,8,9,10,11,12) and the years given (2019, 2020, 2021 not including 2022).

index   month   team    year    val cum_val
0   1   celtics 2019    4   4
1   2   celtics 2019    9   13
2   3   celtics 2019    17  30
3   4   celtics 2019    4   34
4   5   celtics 2019    7   41
5   6   celtics 2019    8   49
6   7   celtics 2019    2   51
7   8   celtics 2019    5   56
8   9   celtics 2019    1   57
9   10  celtics 2019    0   57
10  11  celtics 2019    18  75
11  12  celtics 2019    16  91
12  1   celtics 2020    7   7
13  2   celtics 2020    18  25
14  3   celtics 2020    27  52
15  4   celtics 2020    4   56
16  5   celtics 2020    9   65
17  6   celtics 2020    13  78
18  7   celtics 2020    2   80
19  8   celtics 2020    2   82
20  9   celtics 2020    5   87
21  10  celtics 2020    3   90
22  11  celtics 2020    7   97
23  12  celtics 2020    7   104
24  1   celtics 2021    3   3
25  2   celtics 2021    4   7
26  3   celtics 2021    2   9
27  4   celtics 2021    21  30
28  5   celtics 2021    5   35
29  6   celtics 2021    8   43
30  7   celtics 2021    27  70
31  8   celtics 2021    11  81
32  9   celtics 2021    4   85
33  10  celtics 2021    3   88
34  11  celtics 2021    1   89
35  12  celtics 2021    9   98
36  1   celtics 2022    14  14
37  2   celtics 2022    5   19
38  3   celtics 2022    29  48
39  4   celtics 2022    9   57
40  5   celtics 2022    0   57
41  6   celtics 2022    28  85
42  7   celtics 2022    ?   85
43  8   celtics 2022    ?   85
44  9   celtics 2022    ?   85
45  10  celtics 2022    ?   85
46  11  celtics 2022    ?   85
47  12  celtics 2022    ?   85

With n == 5, an outcome might look like this for the new dataframe:

month   n1  n2  n3  n4  n5
7       27  2   27  2   2
8       11  11  2   5   5
9        4  1   4   5   1
10       0  3   0   3   3
11      18  1   1   7   18
12      16  9   9   7   16

Thank you for any help on how to get this going.

CodePudding user response:

try:

df[df.month.isin(
    df.loc[df.val.eq('?'), 'month']
)].groupby('month')['val'].apply(
    lambda x: x[x.ne('?')].sample(5, replace=True, ignore_index=True)
).unstack(level=-1)

or in one long line

df[df.month.isin(df.loc[df.val.eq('?'), 'month'])].groupby('month')['val'].apply(lambda x: x[x.ne('?')].sample(5, replace=True, ignore_index=True)).unstack(level=-1)

if instead of ? you have nan, change the eq('?') to isna() and the ne('?') to notna()

Output:

       0  1   2   3  4
month                 
7      2  2   2   2  2
8      5  5  11  11  2
9      5  1   4   1  5
10     3  3   0   3  0
11     7  7  18  18  1
12     7  9   7   7  7
  • Related