I'm importing data from a tsv file into a pandas dataframe.
Input = pd.read_csv(input_file, delimiter='\t', header=0, encoding='utf-8')
I want to run some checks on the input data, e.g. check for any additional values outside of the predefined columns, but I cannot find the option to do this with dataframe. Is this at all possible?
Column 1 Column 2 Column 3
dog1 dog2 dog3
kitten1 kitten2 kitten3
horse1 horse2 horse3
mouse1 mouse2 mouse3 **mouse4**
I'm trying to find a function that would report "mouse4" as an error in the original data. I have tried using pd.shape, but this only works for column headers. When trying to display the dataframe data having an error in the data, nothing is being output.
CodePudding user response:
pd.read_csv
yields a ParserError
you can catch. The cause and position of the error is reported in the error message Expected 3 fields in line 5, saw 4. header=0
is the default behaviour and redundant.
This approach is to read the header and select those columns with the usecols
parameter if an error occurs.
import pandas as pd
try:
df = pd.read_csv('data.tsv', sep='\t')
except pd.errors.ParserError as e:
print(e)
header = pd.read_csv('data.tsv', sep='\t', nrows=0)
df = pd.read_csv('data.tsv', sep='\t', usecols=header.columns)
df
Output
Error tokenizing data. C error: Expected 3 fields in line 5, saw 4
Column 1 Column 2 Column 3
0 dog1 dog2 dog3
1 kitten1 kitten2 kitten3
2 horse1 horse2 horse3
3 mouse1 mouse2 mouse3
To create the datafile used in this solution
# %%writefile data.tsv
Column 1 Column 2 Column 3
dog1 dog2 dog3
kitten1 kitten2 kitten3
horse1 horse2 horse3
mouse1 mouse2 mouse3 mouse4