I have a simple CSV file named input.csv
as follows:
name,money
Dan,200
Jimmy,xd
Alice,15
Deborah,30
I want to write a python script that sanitizes the data in the money
column:
every value that has non-numerical characters needs to be replaced with 0
This is my attempt so far:
import pandas as pd
df = pd.read_csv(
"./input.csv",
sep = ","
)
# this line is the problem: it doesn't update on a row by row basis, it updates all rows
df['money'] = df['money'].replace(to_replace=r'[^0‐9]', value=0, regex=True)
df.to_csv("./output.csv", index = False)
The problem is that when the script runs, because the invalud money value xd
exists on one of the rows, it will change ALL money values to 0 for ALL rows.
I want it to ONLY change the money value for the second data row (Jimmy) which has the invalid value.
this is what it gives at the end:
name,money
Dan,0
Jimmy,0
Alice,0
Deborah,0
but what I need it to give is this:
name,money
Dan,200
Jimmy,0
Alice,15
Deborah,30
What is the problem?
CodePudding user response:
You can use:
df['money'] = pd.to_numeric(df['money'], errors='coerce').fillna(0).astype(int)
The above assumes all valid values are integers. You can leave off the .astype(int)
if you want float values.
Another option would be to use a converter function in the read_csv
method. Again, this assumes integers. You can use float(x)
in place of int(x)
if you expect float money values:
def convert_to_int(x):
try:
return int(x)
except ValueError:
return 0
df = pd.read_csv(
'input.csv',
converters={'money': convert_to_int}
)
CodePudding user response:
Some list comprehension could work for this (given the "money" column has no decimals):
df.money = [x if type(x) == int else 0 for x in df.money]
If you are dealing with decimals, then something like:
df.money = [x if (type(x) == int) or (type(x) == float) else 0 for x in df.money]
... will work. Just know that pandas will convert the entire "money" column to float (decimals).