I want to import a txt dataset into pandas. The dataset is like:
|2014|,|H0AK00097|,|N00032846|,|John Cox (R)|,|R|,|AK01|,| |,| |,|Y|,|C|,|RL|,| | |2014|,|H0AL02087|,|N00030768|,|Martha Roby (R)|,|R|,|AL02|,|AL02|,|Y|,|Y|,|I|,|RW|,| |
The delimiters are "|,|', or just '|" at the start and end of a row. I tried several ways to set delimiters in pd.read_csv() function, but all failed. How to set this parameter?
I tried
ind=pd.read_csv('indivs14.txt',sep=',', header=None, engine='python',error_bad_lines=False)
But the outcome is incorrect.
CodePudding user response:
The following may work: it reads the CSV with normal comma separators, then strips the |
on the sides of each cell:
df = pd.read_csv('bab.txt', header=None).applymap(lambda s: s.strip('|'))
Result:
0 1 2 3 4 5 6 7 8 9 10 11
0 2014 H0AK00097 N00032846 John Cox (R) R AK01 Y C RL
1 2014 H0AL02087 N00030768 Martha Roby (R) R AL02 AL02 Y Y I RW
Caveats:
- extra
|
s that are part of the actual cell are also stripped - all columns will be of string type (with the given example data, only for the first column may this be a problem). No integers, floats or datetimes; you'd have to do any conversions yourself.
CodePudding user response:
Another possibility, depending on how the data should be interpreted, is to interpret the pipe character as a quote character (and I think this is the actual format of the input file). Then it's simply:
df = pd.read_csv('bab.txt', quotechar="|", header=None)
and the result
0 1 2 3 4 5 6 7 8 9 10 11
0 2014 H0AK00097 N00032846 John Cox (R) R AK01 Y C RL
1 2014 H0AL02087 N00030768 Martha Roby (R) R AL02 AL02 Y Y I RW
CodePudding user response:
This seems to work with the example you provided:
ind = pd.read_csv('indivs14.txt',sep=',', header=None, engine='python',error_bad_lines=False)
ind.replace({'\|': ''}, regex=True, inplace=True)
However this should not be used if the character '|' is part of some values (i.e., should not be removed) as mentioned by @timgeb in the comments.