I am having a csv file , which contains a range of numbers in a column. I also have list of valid range. I need identify the possible range for the matching numbers in csv.
valid_range = ["0-1", "1-2", "0-5", "5-10", "10-15", "15-20", ">20","mixed"]
csv data :
What I have:
What I need:
Anything greater than 20 will have ">20" and mixed will have the same.
df = {'Gender': {0: 'male', 1: 'male', 2: 'male', 3: 'female', 4: 'male', 5: 'male', 6: 'male', 7: 'male', 8: 'male', 9: 'mixed', 10: 'female'}, 'Range': {0: 'mixed', 1: 'mixed', 2: '5-6', 3: '4-5', 4: '5-6', 5: '12-14', 6: '9-10', 7: 'mixed', 8: '17-18', 9: 'mixed', 10: '>20'}}
CodePudding user response:
If you don't have overlap, you can rely on the upper bound.
Then use of a bit or regex magic and pandas.cut
:
# non overlapping ranges
valid_range = ["0-5", "5-10", "10-15", "15-20", ">20", "mixed"]
# define bins (could also be coded as list directly!)
bins = (pd.Series(valid_range[:-2]).str.extract('(\d )$', expand=False)
.astype(int).to_list()
)
bins = [0] bins [float('inf')]
# [0, 5, 10, 15, 20, inf]
df['Valid_Range'] = ( # replace ">x" with upper bound 1
pd.cut(df['Range'].str.replace('>.*', f'{bins[-1] 1}', regex=True)
# extract trailing digits
.str.extract('(\d )$', expand=False).astype(float),
bins=bins, labels=valid_range[:-1])
.values.add_categories(valid_range[-1]) # cut output is Categorical
.fillna(valid_range[-1]) # fill nans that correspond to initial "mixed"
)
output:
Gender Range Valid_Range
0 male mixed mixed
1 male mixed mixed
2 male 5-6 5-10
3 female 4-5 0-5
4 male 5-6 5-10
5 male 12-14 10-15
6 male 9-10 5-10
7 male mixed mixed
8 male 17-18 15-20
9 mixed mixed mixed
10 female >20 >20