Replacing a string with NaN or 0-CodePudding

I have a data file that I'm cleaning, and the source uses '--' to indicate missing data. I ultimately need to have this data field be either an integer or float. But I am not sure how to remove the string.

I specified the types in a type_dict statement before importing the csv file. 6 of my 8 variables correctly came in as an integer or float. Of course, the two that are still objects are the ones I need to fix.

I've tried using the df = df.var.str.replace('--', '') I've tried using the df.var.fillna(df.var.mode().values[0], inplace=True) (and I wonder if I need to just change the values '0' to '--')

My presumption is that if I can empty those cells in some fashion, I can define the variable as an int/float.

I'm sure I'm missing something really simple, have walked away and come back, but am just not figuring it out.

CodePudding user response：

try something like this cleaning input before antering into pandas

import sys
from io import StringIO
import pandas as pd

with open('data.txt', 'r') as file:
    data = StringIO(file.read().replace('--', '0'))

df = pd.read_csv(data)

CodePudding user response：

OK, we figured out two options to make this work:

solution 1: df = df.replace(r'^--$', np.nan, regex=True)

solution 2 (a simplified version of #1): df = df.replace(r'--', np.nan)

Both gave the expected output of empty cells when I exported the csv into a spreadsheet. And then when I reimported that intermediate file, I had floats instead of strings as expected.