I am trying to clean a DataFrame and I have come across an annoying snag. When I web-scrape the HTML wiki table in, I am left with the reference numbers/letters and I am trying to remove them.
This is what the part of the dataframe I currently have looks like:
Output:
[1.] 19.5million[1][b]
[2.] 16million[5][6][7][d]
[3.] 13.2million[11][e]
This is what I would like for it to look like:
Output:
[1.] 19.5
[2.] 16
[3.] 13.2
I have tried to use str.replace
& str.strip
but I always end up only getting either the millions removed or nothing at all. I feel like I'm missing the correct characters to receive the response I want.
CodePudding user response:
If always there is a million after the number you want, you can use str.split('million')[0]
. This would give you desired output.
CodePudding user response:
You can parse all the lines using the following regex
:
import re
pattern = re.compile("(\[\d \.\]) ([\d\.] )")
line = "[1.] 19.5million[1][b]"
" ".join(pattern.findall(line)[0])
OUTPUT
'[1.] 19.5'
CodePudding user response:
Try this:
In [20]: testdata
Out[20]:
['[1.] 19.5million[1][b]',
'[2.] 16million[5][6][7][d]',
'[3.] 13.2million[11][e]']
In [21]: for row in testdata:
...: print(row.split('million')[0])
...:
[1.] 19.5
[2.] 16
[3.] 13.2
CodePudding user response:
You can use a regex and apply it to your pandas series using pandas string method extract
pattern = r"^(\d \.?\d*)"
df["foo"] = df["foo"].str.extract(pattern)
Explaination of the regex
- ^ match the start of the string
- ( start capture group
- \d match at least one digit
- .? optionally match a . (need \ to escape as . is a special character in regex)
- \d* match any number of digits
- ) end capture group