Dynamically updating row values based on a condition in pandas-CodePudding

I am running a simulation test where I want to dynamically change some values present in rows for each column based on certain set of conditions

The Problem Statement

My dataset has 400 rows and my first test case is to update 5% of the rows in each column, so 5% of 400 = 20 rows which needs to be updated
These 20 rows should be only updated for the top 5 categories that are present in my dataset. So 4 rows each which needs to be updated

My dataframe looks like this:

A     B     C    D      Category
1     10    3    4      X
4     9     6    9      Y
9     3     7    10     XX
10    1     9    7      YY
10    1     9    7      ZZ
10    1     9    7      YZZ
10    1     9    7      YZZ
10    1     9    7      YYYY
                        ......400 rows

The conditions are:

While updating the rows I would want to make sure that 20 rows (5% of the overall dataset) should be updated only where the top 5 categories are encountered. In my case the top 5 categories are X, Y , XX, YY and ZZ. These rows should be updated to value 7 where the previous value was 1,2,3,4,5,6

The resultant datframe should look like this:

A     B     C    D      Category
7     10    7    7      X
7     9     7    9      Y
9     7     7    10     XX
10    7     9    7      YY
10    7     9    7      ZZ
10    1     9    7      YZZ
10    1     9    7      YZZ
10    1     9    7      YYYY
                        ......400 rows

In the resultant dataframe, there is no impact on the categories which are not the top 5 categories, this case YZZ or YYYY and to demonstrate an example I can't show all the updated rows but for example in the above dataframe, 2 rows have been updated for column A where previous value was <=6 to a new value 7 and similarly the other two rows will get updated to 7 wherever the condition is met.

How can I achieve this?

CodePudding user response：

You can try the following logic:

# get only desired Categories
m = df['Category'].isin(['X', 'Y', 'XX', 'YY', 'ZZ'])

# select 20 random rows from the above 
idx = df[m].sample(n=20).index

# replace the 1 ≤ values ≤ 6 by 7
df.loc[idx] = df.loc[idx].mask(df.loc[idx].ge(1)&df.loc[idx].le(6), 7)

If you rather want 4 rows per Category, use this variant for the random sampling:

idx = df[m].groupby('Category').sample(n=4).index