Removing Unwanted Characters after numbers in pandas DataFrame-CodePudding

I am trying to clean a DataFrame and I have come across an annoying snag. When I web-scrape the HTML wiki table in, I am left with the reference numbers/letters and I am trying to remove them.

This is what the part of the dataframe I currently have looks like:

Output:
[1.] 19.5million[1][b]
[2.] 16million[5][6][7][d]
[3.] 13.2million[11][e]

This is what I would like for it to look like:

Output:
[1.] 19.5
[2.] 16
[3.] 13.2

I have tried to use str.replace & str.strip but I always end up only getting either the millions removed or nothing at all. I feel like I'm missing the correct characters to receive the response I want.

CodePudding user response：

If always there is a million after the number you want, you can use str.split('million')[0]. This would give you desired output.

CodePudding user response：

You can parse all the lines using the following regex:

import re
pattern = re.compile("(\[\d \.\]) ([\d\.] )")

line = "[1.] 19.5million[1][b]"

" ".join(pattern.findall(line)[0])

OUTPUT

'[1.] 19.5'

CodePudding user response：

Try this:

In [20]: testdata
Out[20]:
['[1.] 19.5million[1][b]',
 '[2.] 16million[5][6][7][d]',
 '[3.] 13.2million[11][e]']

In [21]: for row in testdata:
    ...:     print(row.split('million')[0])
    ...:
[1.] 19.5
[2.] 16
[3.] 13.2

CodePudding user response：

You can use a regex and apply it to your pandas series using pandas string method extract

pattern = r"^(\d \.?\d*)"

df["foo"] = df["foo"].str.extract(pattern)

Explaination of the regex

^ match the start of the string
( start capture group
\d match at least one digit
.? optionally match a . (need \ to escape as . is a special character in regex)
\d* match any number of digits
) end capture group