I have a dataset of 100 million rows that I need to analyze. I use this function to read the file:
csv2020=pd.read_csv('filename.txt',
sep="\t",
error_bad_lines=False,
usecols=['field1', 'field2', 'field3', 'field4'],
dtype={'field1': int,'field2': float, 'field3': float, 'field4': float})
But I'm getting an error about one of the lines not possible to convert to a float:
ValueError: could not convert string to float: 'ORCH'
I would like to omit any lines where this error occurs, but I don't know how besides the error-bad-lines argument. Help?
Thanks!
CodePudding user response:
Some of the columns you are trying to import as float has strings and therefore cannot be converted.
Read the CSV first without the "dtype...." and look at your dataframe
CodePudding user response:
The error_bad_lines
option is not for this purpose, it only applies to an incorrect number of fields.
Read your file without the dtype
option and do the conversion afterwards using pandas.to_numeric
with the errors='coerce'
option:
df = pd.read_csv(…)
df['field1'] = pd.to_numeric(df['field1'], errors='coerce')
df['field2'] = …