I am running a simulation test where I want to dynamically change some values present in rows for each column based on certain set of conditions
The Problem Statement
- My dataset has 400 rows and my first test case is to update 5% of the rows in each column, so 5% of 400 = 20 rows which needs to be updated
- These 20 rows should be only updated for the top 5 categories that are present in my dataset. So 4 rows each which needs to be updated
My dataframe looks like this:
A B C D Category
1 10 3 4 X
4 9 6 9 Y
9 3 7 10 XX
10 1 9 7 YY
10 1 9 7 ZZ
10 1 9 7 YZZ
10 1 9 7 YZZ
10 1 9 7 YYYY
......400 rows
The conditions are:
- While updating the rows I would want to make sure that 20 rows (5% of the overall dataset) should be updated only where the top 5 categories are encountered. In my case the top 5 categories are
X, Y , XX, YY and ZZ
. These rows should be updated to value 7 where the previous value was 1,2,3,4,5,6
The resultant datframe should look like this:
A B C D Category
7 10 7 7 X
7 9 7 9 Y
9 7 7 10 XX
10 7 9 7 YY
10 7 9 7 ZZ
10 1 9 7 YZZ
10 1 9 7 YZZ
10 1 9 7 YYYY
......400 rows
In the resultant dataframe, there is no impact on the categories which are not the top 5 categories, this case YZZ or YYYY
and to demonstrate an example I can't show all the updated rows but for example in the above dataframe, 2 rows have been updated for column A where previous value was <=6 to a new value 7 and similarly the other two rows will get updated to 7 wherever the condition is met.
How can I achieve this?
CodePudding user response:
You can try the following logic:
# get only desired Categories
m = df['Category'].isin(['X', 'Y', 'XX', 'YY', 'ZZ'])
# select 20 random rows from the above
idx = df[m].sample(n=20).index
# replace the 1 ≤ values ≤ 6 by 7
df.loc[idx] = df.loc[idx].mask(df.loc[idx].ge(1)&df.loc[idx].le(6), 7)
If you rather want 4 rows per Category, use this variant for the random sampling:
idx = df[m].groupby('Category').sample(n=4).index