Home > Software design >  Failing to remove all $ in strings in a dataframe
Failing to remove all $ in strings in a dataframe

Time:04-29

I'm trying to remove unnecessary values in the strings in my dataframe so I can make them integers and do some math logic on them. The commas, and upward arrows are removed as desired but I can't seem to remove $ from my dataframe. But I have removed it successfully from row two which is a very odd anomoly.

import pandas as pd
from urllib.request import Request, urlopen
req = Request('https://smallcaps.com.au/director-transactions/'
              , headers={'User-Agent': 'Mozilla/5.0'})
trades = urlopen(req).read()
df = pd.read_html(trades)[0]
del df['Director'],df['Company']

print(df.head(5))

df['Value'] = df['Value'].replace(['↑','$',','],'',regex=True)

print("############################")
print(df.head(5))

Results: Unchanged df:

        Date Code     Value
0  27/4/2022  ESR   ↑$1,075
1  27/4/2022  LNY  ↑126,750
2  26/4/2022  FGX  ↑$13,363
3  26/4/2022  CDM  ↑$25,110
4  26/4/2022  TEK  ↑$35,384
    ############################

Signs removed but failed to remove all dollar signs:

        Date Code   Value
0  27/4/2022  ESR   $1075
1  27/4/2022  LNY  126750
2  26/4/2022  FGX  $13363
3  26/4/2022  CDM  $25110
4  26/4/2022  TEK  $35384

Process finished with exit code 0

CodePudding user response:

You could replace the commas that occur within the number and then extract the whole number:

df["Value"] = df['Value'].str.replace(",","",regex=True).str.extract("(\d )").astype(int)

>>> df
        Date Code   Value
0  27/4/2022  ESR    1075
1  27/4/2022  LNY  126750
2  26/4/2022  FGX   13363
3  26/4/2022  CDM   25110
4  26/4/2022  TEK   35384

Alternatively, replace all the unnecessary characters:

df["Value"] = df['Value'].str.replace('↑|↓|\$|,|', '', regex=True).astype(int)

CodePudding user response:

$ sign has special meaning in regex, you need to escape it

df['Value'] = df['Value'].replace(['↑','\$',','], '', regex=True)
print(df)

        Date Code   Value
0  27/4/2022  ESR    1075
1  27/4/2022  LNY  126750
2  26/4/2022  FGX   13363
3  26/4/2022  CDM   25110
4  26/4/2022  TEK   35384

CodePudding user response:

As you want to work with numbers, the ideal is probably to get rid of everything but the numbers and convert to numeric.

df['Value'] = pd.to_numeric(df['Value'].str.replace(r'\D', '', regex=True))

If you have floats, use r'[^\d\.]' in place of r'\D'. output:

        Date Code   Value
0  27/4/2022  ESR    1075
1  27/4/2022  LNY  126750
2  26/4/2022  FGX   13363
3  26/4/2022  CDM   25110
4  26/4/2022  TEK   35384

CodePudding user response:

Use:

df['Value'] = df['Value'].str.replace('↑|,', '').str.replace('$', '')

In [82]: df.head()
Out[82]: 
        Date Code    Value
0  27/4/2022  ESR    1075
1  27/4/2022  LNY  126750
2  26/4/2022  FGX   13363
3  26/4/2022  CDM   25110
4  26/4/2022  TEK   35384
  • Related