I'm trying to remove unnecessary values in the strings in my dataframe so I can make them integers and do some math logic on them. The commas, and upward arrows are removed as desired but I can't seem to remove $ from my dataframe. But I have removed it successfully from row two which is a very odd anomoly.
import pandas as pd
from urllib.request import Request, urlopen
req = Request('https://smallcaps.com.au/director-transactions/'
, headers={'User-Agent': 'Mozilla/5.0'})
trades = urlopen(req).read()
df = pd.read_html(trades)[0]
del df['Director'],df['Company']
print(df.head(5))
df['Value'] = df['Value'].replace(['↑','$',','],'',regex=True)
print("############################")
print(df.head(5))
Results: Unchanged df:
Date Code Value
0 27/4/2022 ESR ↑$1,075
1 27/4/2022 LNY ↑126,750
2 26/4/2022 FGX ↑$13,363
3 26/4/2022 CDM ↑$25,110
4 26/4/2022 TEK ↑$35,384
############################
Signs removed but failed to remove all dollar signs:
Date Code Value
0 27/4/2022 ESR $1075
1 27/4/2022 LNY 126750
2 26/4/2022 FGX $13363
3 26/4/2022 CDM $25110
4 26/4/2022 TEK $35384
Process finished with exit code 0
CodePudding user response:
You could replace
the commas that occur within the number and then extract the whole number:
df["Value"] = df['Value'].str.replace(",","",regex=True).str.extract("(\d )").astype(int)
>>> df
Date Code Value
0 27/4/2022 ESR 1075
1 27/4/2022 LNY 126750
2 26/4/2022 FGX 13363
3 26/4/2022 CDM 25110
4 26/4/2022 TEK 35384
Alternatively, replace all the unnecessary characters:
df["Value"] = df['Value'].str.replace('↑|↓|\$|,|', '', regex=True).astype(int)
CodePudding user response:
$
sign has special meaning in regex, you need to escape it
df['Value'] = df['Value'].replace(['↑','\$',','], '', regex=True)
print(df)
Date Code Value
0 27/4/2022 ESR 1075
1 27/4/2022 LNY 126750
2 26/4/2022 FGX 13363
3 26/4/2022 CDM 25110
4 26/4/2022 TEK 35384
CodePudding user response:
As you want to work with numbers, the ideal is probably to get rid of everything but the numbers and convert to numeric.
df['Value'] = pd.to_numeric(df['Value'].str.replace(r'\D', '', regex=True))
If you have floats, use r'[^\d\.]'
in place of r'\D'
.
output:
Date Code Value
0 27/4/2022 ESR 1075
1 27/4/2022 LNY 126750
2 26/4/2022 FGX 13363
3 26/4/2022 CDM 25110
4 26/4/2022 TEK 35384
CodePudding user response:
Use:
df['Value'] = df['Value'].str.replace('↑|,', '').str.replace('$', '')
In [82]: df.head()
Out[82]:
Date Code Value
0 27/4/2022 ESR 1075
1 27/4/2022 LNY 126750
2 26/4/2022 FGX 13363
3 26/4/2022 CDM 25110
4 26/4/2022 TEK 35384