Pandas/Python read file with different separators-CodePudding

I have a .txt file as follows:

columnA;columnB;columnC;columnD
2022040200000000000000000000011    8000702   79005889  SPECIAL_AGENCY       LONDON

You can observe that the names of the columns are separated by a semi column ;, however, row values, have different separators. In this example, columnA has 3 spaces, columnB has 3, columnC has 2, and columnD has 7.

Considering I have a schema, that tells me for each column what is the amount of spaces (separators?) I have, how can I turn it into a pandas dataframe?

CodePudding user response：

One way is to use a double regex separator with (|) and pandas.read_csv :

df = pd.read_csv("/tmp/file.txt", sep=";|(?<=\d)\s (?=\B)", engine="python")

Output :

print(df)

                           columnA  columnB   columnC                  columnD
0  2022040200000000000000000000011  8000702  79005889   SPECIAL_AGENCY  LONDON

NB: If needed, you can add pandas.Series.replace to clean up the extra (\s) in the columnD.

CodePudding user response：

If there's a typo in the question, and there's a 5th column name (say columnE), something like this should work. If that's not the case, the OP should edit the question and clarify.

In [17]: data = Path("data.txt").read_text().splitlines()

In [18]: hdr = data[0].split(";")

In [19]: df = pd.DataFrame([row.split() for row in data[1:]], columns=hdr)

In [20]: df
Out[20]: 
                           columnA  columnB   columnC         columnD columnE
0  2022040200000000000000000000011  8000702  79005889  SPECIAL_AGENCY  LONDON