Home > Enterprise >  How can I trace a data error using pandas dataframe?
How can I trace a data error using pandas dataframe?

Time:09-28

I'm importing data from a tsv file into a pandas dataframe.

Input = pd.read_csv(input_file, delimiter='\t', header=0, encoding='utf-8')

I want to run some checks on the input data, e.g. check for any additional values outside of the predefined columns, but I cannot find the option to do this with dataframe. Is this at all possible?

Column 1  Column 2   Column 3
dog1      dog2       dog3
kitten1   kitten2    kitten3
horse1    horse2     horse3
mouse1    mouse2     mouse3   **mouse4**

I'm trying to find a function that would report "mouse4" as an error in the original data. I have tried using pd.shape, but this only works for column headers. When trying to display the dataframe data having an error in the data, nothing is being output.

CodePudding user response:

pd.read_csv yields a ParserError you can catch. The cause and position of the error is reported in the error message Expected 3 fields in line 5, saw 4. header=0 is the default behaviour and redundant.

This approach is to read the header and select those columns with the usecols parameter if an error occurs.

import pandas as pd

try:
    df = pd.read_csv('data.tsv', sep='\t')
except pd.errors.ParserError as e:
    print(e)
    header = pd.read_csv('data.tsv', sep='\t', nrows=0)
    df = pd.read_csv('data.tsv', sep='\t', usecols=header.columns)
df

Output

Error tokenizing data. C error: Expected 3 fields in line 5, saw 4

  Column 1 Column 2 Column 3
0     dog1     dog2     dog3
1  kitten1  kitten2  kitten3
2   horse1   horse2   horse3
3   mouse1   mouse2   mouse3

To create the datafile used in this solution

# %%writefile data.tsv
Column 1    Column 2    Column 3
dog1    dog2    dog3
kitten1 kitten2 kitten3
horse1  horse2  horse3
mouse1  mouse2  mouse3  mouse4
  • Related