I am analyzing data from an industrial control system that records sensor data at 1-second intervals to a CSV file. We do analysis weekly, so the CSV file is large (49 columns x ~600k rows). Sometimes (probably due to power outages in the field), the CSV file has corrupt rows. We get rows where the number of columns is right but the data type is wrong. Here is a dummy DF that illustrates the issue in smaller scale:
mydict = {'colFloat': [1.5, 2.3, 'TRUE', 'FALSE', 3.5, 7.8], 'colBool': ['FALSE', 'TRUE', 3.2, 1.1, 'TRUE', 'TRUE'], 'colInt': [2, 2, 3.7, 9.9, 1, 4]}
df = pd.DataFrame(data = mydict)
In this example, 'colFloat' should be of dtype float64, 'colBool' should be of dtype bool, 'colInt' should be of dtype int64, but of course the two buggy rows in the middle cause all three columns to be of type object. Later, when I try and do groupbys and integrate sensor readings (for example from 'colFloat'), I get an error because the non-float data in the column cause an issue.
I'd like a method that I can run on the df that will remove the faulty rows and notify me that they exist. The resulting df in my example above would look like this df:
mydictclean = {'colFloat': [1.5, 2.3, 3.5, 7.8], 'colBool': ['FALSE', 'TRUE', 'TRUE', 'TRUE'], 'colInt': [2, 2, 1, 4]}
dfclean = pd.DataFrame(data = mydictclean)
I tried specifying the dtypes using the dtype parameter in the read_csv call, but that just throws an error when the function encounters the bad rows. Thank you for your help!
CodePudding user response:
You can use a set of converter, then drop the rows with NaNs:
types = {'colFloat': 'numeric', 'colInt': 'numeric', 'colBool': 'boolean'}
converter = {'numeric': lambda s: pd.to_numeric(s, errors='coerce'),
'boolean': lambda s: s.str.upper().map({'TRUE': True, 'FALSE': False})
}
dfclean = pd.DataFrame(index=df.index)
for c, t in types.items():
dfclean[c] = converter[t](df[c])
dfclean = dfclean.dropna()
print(dfclean)
output:
colFloat colInt colBool
0 1.5 2.0 False
1 2.3 2.0 True
4 3.5 1.0 True
5 7.8 4.0 True
NB. regarding the int/float, it is not always easy to maintain this with pandas. You can use convert_dtypes
to perform automated (safe) float -> int conversion in case all values have no decimal part.
CodePudding user response:
I would try to catch all type errors AND CONVERT at the same time, in all columns.
See How do I check if a string is a number (float)? for the first function:
def is_number(n):
is_number = True
try:
num = float(n)
# check for "nan" floats
is_number = num == num # or use `math.isnan(num)`
except ValueError:
is_number = False
return is_number
Then your own:
def check_and_convert_int(x):
if x.is_integer():
return int(x)
else:
return np.nan
def check_and_convert_bool(x):
if x in [True, 'True', 'TRUE']:
return True
elif x in [False, 'False', 'FALSE']:
return False
else:
return np.nan
def check_and_convert_float(x):
if is_number(x):
return float(x)
else:
return np.nan
Then I would do :
df['int_col_clean'] = df['int_col'].apply(lambda x: check_and_convert_int(x))
df['bool_col_clean'] = df['bool_col'].apply(lambda x: check_and_convert_bool(x))
df['float_col_clean'] = df['float_col'].apply(lambda x: check_and_convert_float(x))
Inspect the columns that may be dropped using :
df[df.isna().any(axis=1)]
If everything is fine, drop them :
df = df.dropna(subset = ['int_col_clean', 'bool_col_clean', 'float_col_clean'])
CodePudding user response:
I had to change the boolean values because the way they are right now they are strings:
mydict = {'colFloat': [1.5, 2.3,'TRUE', 'FALSE', 3.5, 7.8], 'colBool': [False, True, 3.2, 1.1, True, True], 'colInt': [2, 2, 3.7, 9.9, 1, 4]}
df = pd.DataFrame(data = mydict)
Then you can filter based on the instance type:
df = df[df['colFloat'].apply(isinstance, args=[float])]
df = df[df['colBool'].apply(isinstance, args=[bool])]
df['colInt'] = df['colInt'].astype(int)
df = df[df['colInt'].apply(isinstance, args=[int])]
df
Output:
colFloat colBool colInt
0 1.5 False 2
1 2.3 True 2
4 3.5 True 1
5 7.8 True 4
CodePudding user response:
try this:
df.loc[(df.colFloat.apply(lambda x: type(x) == float)) ]
but your boolcol is string not bool.