array([10.703, 63.756, 18.48, 5.313, 14.553, 6.853, 7.7886, 7.3304, 6.083,
9.163, 45.276, 100.716, 4.543, 8.2467, 15.3384, 9.7598, 14.3798,
12.0698, 10.9148, 7.623, 3.8423, 8.4893, 6.9223, 14.0718, 47.0393,
20.79, 15.939, 7.7, 13.629, "'0.7469", 1.1165, "'0.5352",
"'0.3889", "'0.1001", "'0.3388", 1.7402, "'0.5313", 1.3052,
"'0.7893", "'0.5583", 4.8895, 2.3524, 7.6538, "'0.0809", "'0.3504",
5.8366, 1.0126, 3.7884, 7.5576, 3.0107, "'0.0347", 7.6384, 1.3552,
4.2928, 1.7364, 1.0857, "'0.7123", 3.1108, "'0.5968", "'0.5544",
1.6093, "'0.1887", "'0.6661", "'0.6391", 1.3514, "'0.9548"])
Excel automatically adds " ' " to the values below 1. I would like to convert the column into float type,
df['vat_amount']=df['vat_amount'].astype(float)
returns
ValueError: could not convert string to float: "'0.5352"
To erase the " ' " sign before 0, I tried:
df.vat_amount.replace("'0.",",0.")
It returns the same value: '0.5352
How could I properly get rid of the ' sign when the numbers are < 1 ?
All the columns are object type.
As this process would be automated, please suggest only Python solutions, I do not want to go to Excel and replace these values every x period of days.
Could anyone provide the proposed solution via dbfiddle?
CodePudding user response:
The replace function has to be applied for each entry and then written back.
This should work:
df.vat_amount = df.vat_amount.apply(lambda x: x.replace(f"'0.","0."))
Edit: If not all entries in df.vat_amount
are string try this:
df.vat_amount = df.vat_amount.apply(lambda x: x.replace(f"'0.","0.") if type(x) is str else x)
CodePudding user response:
The reason it doesnt work is because replace doesn't work as with a standard string in python. In a pandas dataframe, replace it matches the whole value and not just part of it as I understand it.
I would suggest using a regular expression with the pandas replace function to remove the '
import pandas as pd
df = pd.DataFrame({'vat_amount': [10.703, 63.756, 18.48, 5.313, 14.553, 6.853, 7.7886, 7.3304, 6.083,
9.163, 45.276, 100.716, 4.543, 8.2467, 15.3384, 9.7598, 14.3798,
12.0698, 10.9148, 7.623, 3.8423, 8.4893, 6.9223, 14.0718, 47.0393,
20.79, 15.939, 7.7, 13.629, "'0.7469", 1.1165, "'0.5352",
"'0.3889", "'0.1001", "'0.3388", 1.7402, "'0.5313", 1.3052,
"'0.7893", "'0.5583", 4.8895, 2.3524, 7.6538, "'0.0809", "'0.3504",
5.8366, 1.0126, 3.7884, 7.5576, 3.0107, "'0.0347", 7.6384, 1.3552,
4.2928, 1.7364, 1.0857, "'0.7123", 3.1108, "'0.5968", "'0.5544",
1.6093, "'0.1887", "'0.6661", "'0.6391", 1.3514, "'0.9548"]})
df = df.replace(to_replace="(['])", value="", regex=True) #For whole dataframe
Or if you just want one column:
df.vat_amount = df.vat_amount.replace(to_replace="(['])", value="", regex=True)
This effectively removes the value that was giving you problem by searching only the string values, matching the ', and replacing it by nothing. It is a relatively simple one-line solution.
CodePudding user response:
I believe there is a better solution but temporarily you can use something like:
a=([10.703, 63.756, 18.48, 5.313, 14.553, 6.853, 7.7886, 7.3304, 6.083,
9.163, 45.276, 100.716, 4.543, 8.2467, 15.3384, 9.7598, 14.3798,
12.0698, 10.9148, 7.623, 3.8423, 8.4893, 6.9223, 14.0718, 47.0393,
20.79, 15.939, 7.7, 13.629, "'0.7469", 1.1165, "'0.5352",
"'0.3889", "'0.1001", "'0.3388", 1.7402, "'0.5313", 1.3052,
"'0.7893", "'0.5583", 4.8895, 2.3524, 7.6538, "'0.0809", "'0.3504",
5.8366, 1.0126, 3.7884, 7.5576, 3.0107, "'0.0347", 7.6384, 1.3552,
4.2928, 1.7364, 1.0857, "'0.7123", 3.1108, "'0.5968", "'0.5544",
1.6093, "'0.1887", "'0.6661", "'0.6391", 1.3514, "'0.9548"])
import pandas as pd
df=pd.DataFrame(a,columns=['value'])
for i,j in df.iterrows():
try:
if j['value'].find("'") != -1:
df.loc[i, 'value'] = j['value'].replace("'","")
else:
pass
except:
pass
df['value']=df['value'].astype(float)
simply, check the value for each row, if there is a quote replace it, do nothing otherwise.