In my pandas data frame, there is a column labeled Android Ver. I have to strip the trailing non-numeric characters from all values (ie. the words " and up"), so the result is a number. If there are multiple decimal places (eg. "x.y.z"), keep only the first two parts (eg "x.y"). For example, the value "4.1 and up" should be changed to "4.1". The value "4.5.6 and up" should be changed to "4.5". The value "5.6.7" should be changed to "5.6".
It currently looks like this:
0 4.0.3 and up
1 4.0.3 and up
2 4.0.3 and up
4 4.4 and up
5 2.3 and up
...
10833 2.2 and up
10834 4.1 and up
10835 4.0 and up
10836 4.1 and up
10837 4.1 and up
But I need it to look like this:
0 4.0
1 4.0
2 4.0
4 4.4
5 2.3
...
10833 2.2
10834 4.1
10835 4.0
10836 4.1
10837 4.1
my code right now is this:
Googleapps_df.replace(r'[.\d a-zA-Z%]', '', regex=True, inplace=True)
But it is not working at all.
What would be the best way to go about doing this?
CodePudding user response:
Try str.extract
:
df['Version'] = df['Android Ver'].str.extract('^(\d \.\d )')
print(df)
# Output
Android Ver Version
0 4.0.3 and up 4.0
1 4.0.3 and up 4.0
2 4.0.3 and up 4.0
4 4.4 and up 4.4
5 2.3 and up 2.3
10833 2.2 and up 2.2
10834 4.1 and up 4.1
10835 4.0 and up 4.0
10836 4.1 and up 4.1
10837 4.1 and up 4.1
CodePudding user response:
A regex replace approach might be:
df["Version"] = df["Android Ver"].str.replace(r' .*', '')
This will strip everything from the first space until the end of the string, leaving behind only the version number.