How to separate columns with special characters in Pandas, Python-CodePudding

My data file contain some characters that can not be defined from keybord to set as separator. Is there anyways to read this data in proper way.

My data looks different in .txt file but when I pasted here it looks like:

PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0

But I have also attached original data here data.

To read data, I simply tried by this way:

import pandas as pd
pd.read_table('data.txt',sep = '\s ')

is there better way to do that?

CodePudding user response：

You have to strip your file from invisible characters:

import pandas as pd
import io
import re

with open('pwd_data.txt') as fp:
    buffer = io.StringIO(re.sub('[\01-\03]', '', fp.read()))
    df = pd.read_table(buffer, header=None, sep='\s ')

Output:

>>> df
    0    1     2     3  4   5   6   7    8    9   10
0   PW  100  2000  2000  C   0   0   0  0.0  0.0   0
1   PW  100  2000  2000  C   0   0   0  0.0  0.0   0
2   PW  100  2000  2000  C   0   0   0  0.0  0.0   0
...
32  PW  100  2000  2000  C   0   0   0  0.0  0.0   0
33  PW  100  2000  2000  C   0   0   0  0.0  0.0   0
34  PW  100  2000  2000  C   0   0   0  0.0  0.0   0

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       35 non-null     object 
 1   1       35 non-null     int64  
 2   2       35 non-null     int64  
 3   3       35 non-null     int64  
 4   4       35 non-null     object 
 5   5       35 non-null     int64  
 6   6       35 non-null     int64  
 7   7       35 non-null     int64  
 8   8       35 non-null     float64
 9   9       35 non-null     float64
 10  10      35 non-null     int64  
dtypes: float64(2), int64(7), object(2)
memory usage: 3.1  KB

CodePudding user response：

Try changing your sep to r'[\s \x00-\x19]'

pd.read_table('data.txt',sep = r'[\s \x00-\x19]')