I have a data file that I'm cleaning, and the source uses '--' to indicate missing data. I ultimately need to have this data field be either an integer or float. But I am not sure how to remove the string.
I specified the types in a type_dict statement before importing the csv file. 6 of my 8 variables correctly came in as an integer or float. Of course, the two that are still objects are the ones I need to fix.
I've tried using the df = df.var.str.replace('--', '') I've tried using the df.var.fillna(df.var.mode().values[0], inplace=True) (and I wonder if I need to just change the values '0' to '--')
My presumption is that if I can empty those cells in some fashion, I can define the variable as an int/float.
I'm sure I'm missing something really simple, have walked away and come back, but am just not figuring it out.
CodePudding user response:
try something like this cleaning input before antering into pandas
import sys
from io import StringIO
import pandas as pd
with open('data.txt', 'r') as file:
data = StringIO(file.read().replace('--', '0'))
df = pd.read_csv(data)
CodePudding user response:
OK, we figured out two options to make this work:
solution 1: df = df.replace(r'^--$', np.nan, regex=True)
solution 2 (a simplified version of #1): df = df.replace(r'--', np.nan)
Both gave the expected output of empty cells when I exported the csv into a spreadsheet. And then when I reimported that intermediate file, I had floats instead of strings as expected.