How to set delimiters like "|,|" in python/pandas-CodePudding

I want to import a txt dataset into pandas. The dataset is like:

|2014|,|H0AK00097|,|N00032846|,|John Cox (R)|,|R|,|AK01|,| |,| |,|Y|,|C|,|RL|,| | |2014|,|H0AL02087|,|N00030768|,|Martha Roby (R)|,|R|,|AL02|,|AL02|,|Y|,|Y|,|I|,|RW|,| |

The delimiters are "|,|', or just '|" at the start and end of a row. I tried several ways to set delimiters in pd.read_csv() function, but all failed. How to set this parameter?

I tried

ind=pd.read_csv('indivs14.txt',sep=',', header=None, engine='python',error_bad_lines=False)

But the outcome is incorrect.

CodePudding user response：

The following may work: it reads the CSV with normal comma separators, then strips the | on the sides of each cell:

df = pd.read_csv('bab.txt', header=None).applymap(lambda s: s.strip('|'))

Result:

     0          1          2                3  4     5     6  7  8  9   10 11
0  2014  H0AK00097  N00032846     John Cox (R)  R  AK01           Y  C  RL
1  2014  H0AL02087  N00030768  Martha Roby (R)  R  AL02  AL02  Y  Y  I  RW

Caveats:

extra |s that are part of the actual cell are also stripped
all columns will be of string type (with the given example data, only for the first column may this be a problem). No integers, floats or datetimes; you'd have to do any conversions yourself.

CodePudding user response：

Another possibility, depending on how the data should be interpreted, is to interpret the pipe character as a quote character (and I think this is the actual format of the input file). Then it's simply:

df = pd.read_csv('bab.txt', quotechar="|", header=None)

and the result

     0          1          2                3  4     5     6  7  8  9   10 11
0  2014  H0AK00097  N00032846     John Cox (R)  R  AK01           Y  C  RL
1  2014  H0AL02087  N00030768  Martha Roby (R)  R  AL02  AL02  Y  Y  I  RW

CodePudding user response：

This seems to work with the example you provided:

ind = pd.read_csv('indivs14.txt',sep=',', header=None, engine='python',error_bad_lines=False)
ind.replace({'\|': ''}, regex=True, inplace=True)

However this should not be used if the character '|' is part of some values (i.e., should not be removed) as mentioned by @timgeb in the comments.