Home > Enterprise >  Panda's dataframe not throwing error for rows containing lesser fields
Panda's dataframe not throwing error for rows containing lesser fields

Time:09-16

I am facing one issue when reading rows containing lesser fields my dataset looks like below.

"Source1"~"schema1"~"table1"~"modifiedon"~"timestamp"~"STAGE"~15~NULL~NULL~FALSE~FALSE~TRUE
"Source1"~"schema2"~"table2"

and I am running below command to read the dataset.

tabdf = pd.read_csv('test_table.csv',sep='~',header = None)

But its not throwing any error though its supposed too. The version we are using

pip show pandas
Name: pandas
Version: 1.0.1

My question is how to make the process failed if we will get incorrect data structure.

CodePudding user response:

You could inspect the data first using Pandas and then either fail the process if bad data exists or just read the known-good rows.

Read full rows into a dataframe

df = pd.read_fwf('test_table.csv', widths=[999999], header=None)
print(df)

                                                   0
0  "Source1"~"schema1"~"table1"~"modifiedon"~"tim...
1                       "Source1"~"schema2"~"table2"

Count number of separators

sep_count = df[0].str.count('~')
sep_count

0    11
1     2

Maybe just terminate the process if bad (short) rows

If the number of unique values are not 1.

sep_count.nunique()
2

Or just read the good rows

good_rows = sep_count.eq(11)    # if you know what separator count should be. Or ... 
good_rows = sep_count.eq(sep_count.max())    # if you know you have at least 1 good row

df = pd.read_csv('test_table.csv', sep='~', header=None).loc[good_rows] 
print(df)

Result

         0        1       2           3          4      5     6   7   8      9     10    11
0  Source1  schema1  table1  modifiedon  timestamp  STAGE  15.0 NaN NaN  False  False  True 
  • Related