Convert .dat file into DataFrame when there is extra whitespace at the end of rows-CodePudding

I try to convert .dat file into pandas dataframe. However, the problem is that the .dat file contains extra whitespace at the end of each row and then there is problem with reading the file properly as dataframe. The .dat file looks like this:

~ H H H H H H H H ~ ~
~ H H H H H H H H ~ ~
~ H H H H H H T T ~ ~

There are 10 columns separated by white spaces but there is also whitespace after the 10.column, at the end of each row. Thus, when I want to open it as pandas dataframe there are 11 columns and the last one contains only NaN.

file = "mydata.dat"
colnames = ['res76','res77','res78','res79','res80','res81','res82','res83','res84','res85','res86']
import pandas as pd
df= pd.read_csv(file,sep=' ', names=colnames)
df

And the dataframe looks like this:

  res76 res77 res78 res79 res80 res81 res82 res83 res84 res85 res86
~  H     H     H     H     H     H     H     H     ~      ~     NaN
~  H     H     H     H     H     H     H     H     ~      ~     NaN
~  H     H     H     H     H     H     T     T     ~      ~     NaN

I assumed it is consequence of the extra whitespace at the end of each row in the .dat file. However, I'm not sure if there is any way how to deal with it using pandas. E.g. can I somehow skip the last whitespace? I'll be thankful for any suggestions.

CodePudding user response：

Given your input format, it's better to use read_fwf instead of read_csv

df = pd.read_fwf('mydata.dat', names=colnames)

  res76 res77 res78 res79 res80 res81 res82 res83 res84 res85 res86
0     ~     H     H     H     H     H     H     H     H     ~     ~
1     ~     H     H     H     H     H     H     H     H     ~     ~
2     ~     H     H     H     H     H     H     T     T     ~     ~

CodePudding user response：

You may drop the last column from dataframe df.drop(df.columns[-1], axis=1, inplace=True)

Or you may loops through file to remove trailing spaces (although not a clean solution).

parsed_file = open("parsed.dat", "w")

with open('mydata.dat') as mydatafile:
    for line in mydatafile:
        if line.endswith(" \n"): line = line.strip()
        parsed_file.write(line "\n")
parsed_file.close()

CodePudding user response：

3 solutions to solve this problem with Pandas:

read_csv:

>>> pd.read_csv(file, sep='\s ', engine='python', names=colnames)

  res76 res77 res78 res79 res80 res81 res82 res83 res84 res85 res86
0     ~     H     H     H     H     H     H     H     H     ~     ~
1     ~     H     H     H     H     H     H     H     H     ~     ~
2     ~     H     H     H     H     H     H     T     T     ~     ~

read_fwf:

>>> pd.read_fwf(file, names=colnames)
  res76 res77 res78 res79 res80 res81 res82 res83 res84 res85 res86
0     ~     H     H     H     H     H     H     H     H     ~     ~
1     ~     H     H     H     H     H     H     H     H     ~     ~
2     ~     H     H     H     H     H     H     T     T     ~     ~

read_table:

>>> pd.read_table(file, sep=' ', names=colnames)
  res76 res77 res78 res79 res80 res81 res82 res83 res84 res85 res86
0     ~     H     H     H     H     H     H     H     H     ~     ~
1     ~     H     H     H     H     H     H     H     H     ~     ~
2     ~     H     H     H     H     H     H     T     T     ~     ~