I am facing one issue when reading rows containing lesser fields my dataset looks like below.
"Source1"~"schema1"~"table1"~"modifiedon"~"timestamp"~"STAGE"~15~NULL~NULL~FALSE~FALSE~TRUE
"Source1"~"schema2"~"table2"
and I am running below command to read the dataset.
tabdf = pd.read_csv('test_table.csv',sep='~',header = None)
But its not throwing any error though its supposed too. The version we are using
pip show pandas
Name: pandas
Version: 1.0.1
My question is how to make the process failed if we will get incorrect data structure.
CodePudding user response:
You could inspect the data first using Pandas and then either fail the process if bad data exists or just read the known-good rows.
Read full rows into a dataframe
df = pd.read_fwf('test_table.csv', widths=[999999], header=None)
print(df)
0
0 "Source1"~"schema1"~"table1"~"modifiedon"~"tim...
1 "Source1"~"schema2"~"table2"
Count number of separators
sep_count = df[0].str.count('~')
sep_count
0 11
1 2
Maybe just terminate the process if bad (short) rows
If the number of unique values are not 1.
sep_count.nunique()
2
Or just read the good rows
good_rows = sep_count.eq(11) # if you know what separator count should be. Or ...
good_rows = sep_count.eq(sep_count.max()) # if you know you have at least 1 good row
df = pd.read_csv('test_table.csv', sep='~', header=None).loc[good_rows]
print(df)
Result
0 1 2 3 4 5 6 7 8 9 10 11
0 Source1 schema1 table1 modifiedon timestamp STAGE 15.0 NaN NaN False False True