I have a .txt file as follows:
columnA;columnB;columnC;columnD
2022040200000000000000000000011 8000702 79005889 SPECIAL_AGENCY LONDON
You can observe that the names of the columns are separated by a semi column ;
, however, row values, have different separators. In this example, columnA
has 3 spaces, columnB
has 3, columnC
has 2, and columnD
has 7.
Considering I have a schema, that tells me for each column what is the amount of spaces (separators?) I have, how can I turn it into a pandas dataframe?
CodePudding user response:
One way is to use a double regex separator with (|
) and pandas.read_csv
:
df = pd.read_csv("/tmp/file.txt", sep=";|(?<=\d)\s (?=\B)", engine="python")
Output :
print(df)
columnA columnB columnC columnD
0 2022040200000000000000000000011 8000702 79005889 SPECIAL_AGENCY LONDON
NB: If needed, you can add pandas.Series.replace
to clean up the extra (\s
) in the columnD
.
CodePudding user response:
If there's a typo in the question, and there's a 5th column name (say columnE
), something like this should work. If that's not the case, the OP should edit the question and clarify.
In [17]: data = Path("data.txt").read_text().splitlines()
In [18]: hdr = data[0].split(";")
In [19]: df = pd.DataFrame([row.split() for row in data[1:]], columns=hdr)
In [20]: df
Out[20]:
columnA columnB columnC columnD columnE
0 2022040200000000000000000000011 8000702 79005889 SPECIAL_AGENCY LONDON