I want to filter the rows based on CPD(object data type) range:
Input:
---- ----------- ---------- -------- ------------ -------------
| | cpd_range | accuracy | n_size | model | frontend_id |
---- ----------- ---------- -------- ------------ -------------
| 0 | >55 | 32.86 | 140 | TFT_no_lag | 30667 |
| 1 | 50-55 | 31.43 | 35 | TFT_no_lag | 30667 |
| 2 | 45-50 | 21.43 | 28 | TFT_no_lag | 30667 |
| 3 | 40-45 | 34.29 | 35 | TFT_no_lag | 30667 |
| 4 | 35-40 | 14.29 | 70 | TFT_no_lag | 30667 |
| 5 | 30-35 | 25.00 | 56 | TFT_no_lag | 30667 |
| 6 | 25-30 | 28.57 | 49 | TFT_no_lag | 30667 |
| 7 | 20-25 | 19.05 | 105 | TFT_no_lag | 30667 |
| 8 | 15-20 | 20.06 | 329 | TFT_no_lag | 30667 |
| 9 | 10-15 | 19.37 | 630 | TFT_no_lag | 30667 |
| 10 | 5-10 | 18.07 | 1981 | TFT_no_lag | 30667 |
| 11 | 0-5 | 26.74 | 9982 | TFT_no_lag | 30667 |
---- ----------- ---------- -------- ------------ -------------
Final Output:
--- ----------- ---------- -------- ------------ -------------
| | cpd_range | accuracy | n_size | model | frontend_id |
--- ----------- ---------- -------- ------------ -------------
| 0 | >55 | 32.86 | 140 | TFT_no_lag | 30667 |
| 1 | 50-55 | 31.43 | 35 | TFT_no_lag | 30667 |
| 2 | 45-50 | 21.43 | 28 | TFT_no_lag | 30667 |
| 3 | 40-45 | 34.29 | 35 | TFT_no_lag | 30667 |
| 4 | 35-40 | 14.29 | 70 | TFT_no_lag | 30667 |
| 5 | 30-35 | 25.00 | 56 | TFT_no_lag | 30667 |
| 6 | 25-30 | 28.57 | 49 | TFT_no_lag | 30667 |
| 7 | 20-25 | 19.05 | 105 | TFT_no_lag | 30667 |
| 8 | 15-20 | 20.06 | 329 | TFT_no_lag | 30667 |
| 9 | 10-15 | 19.37 | 630 | TFT_no_lag | 30667 |
--- ----------- ---------- -------- ------------ -------------
I want to remove the rows with CPD_slot_range < 10, but the column is of object type, so I have to do this through regular expressions, but I'm new to regular expressions, can you please guide for this.
CodePudding user response:
Your cpd_range
is close to being formatted in a way that it could be made into an ordered categorical!
# Fix the one value that won't sort correctly
df.cpd_range = df.cpd_range.replace('5-10', '05-10')
# Make it an ordered categorical!
df.cpd_range = pd.Categorical(df.cpd_range, ordered=True)
# Now we can remove values by selecting only values greater than our cutoff~
df = df[df.cpd_range.gt('05-10')]
print(df)
Output:
cpd_range accuracy n_size model frontend_id
0 >55 32.86 140 TFT_no_lag 30667
1 50-55 31.43 35 TFT_no_lag 30667
2 45-50 21.43 28 TFT_no_lag 30667
3 40-45 34.29 35 TFT_no_lag 30667
4 35-40 14.29 70 TFT_no_lag 30667
5 30-35 25.00 56 TFT_no_lag 30667
6 25-30 28.57 49 TFT_no_lag 30667
7 20-25 19.05 105 TFT_no_lag 30667
8 15-20 20.06 329 TFT_no_lag 30667
9 10-15 19.37 630 TFT_no_lag 30667
# We'll print this out as well, so you can see the dtype and how it's ordered~
>>> df.cpd_range.cat.as_ordered()
0 >55
1 50-55
2 45-50
3 40-45
4 35-40
5 30-35
6 25-30
7 20-25
8 15-20
9 10-15
Name: cpd_range, dtype: category
Categories (12, object): ['0-5' < '05-10' < '10-15' < '15-20' ... '40-45' < '45-50' < '50-55' < '>55']
CodePudding user response:
If possible test last integer in column cpd_range
use Series.str.extract
with \d
for match digits, $
for end of strings, convert to integers and filter in boolean indexing
if greater like 10
:
df = df[df['cpd_range'].str.extract(r'(\d )$', expand=False).astype(int).gt(10)]
print (df)
cpd_range accuracy n_size model frontend_id
0 >55 32.86 140 TFT_no_lag 30667
1 50-55 31.43 35 TFT_no_lag 30667
2 45-50 21.43 28 TFT_no_lag 30667
3 40-45 34.29 35 TFT_no_lag 30667
4 35-40 14.29 70 TFT_no_lag 30667
5 30-35 25.00 56 TFT_no_lag 30667
6 25-30 28.57 49 TFT_no_lag 30667
7 20-25 19.05 105 TFT_no_lag 30667
8 15-20 20.06 329 TFT_no_lag 30667
9 10-15 19.37 630 TFT_no_lag 30667