I'm struggling how to parse a text file into a pandas dataframe.
I think to use pandas.read_csv(sep='')
but I can not figure out the right configuration since the file uses blanks as seperator, but also contains text with separator
A sample data rows looks like this
<123> 2022-12-08T14:00:00 tag [id="451" tid="145] text message with commas
which is a line for this table
type | time | part | ids | message |
---|---|---|---|---|
<123> | 2022-12-08T14:00:00 | tag | [id="451" tid="145] | text message with commas |
CodePudding user response:
Use the read_csv
function with the sep
parameter set to '\s '
, which will split the fields by any number of spaces.
import pandas as pd
import csv
df = pd.read_csv('data.txt', sep='\s ', quoting=csv.QUOTE_NONE)
print(df)
CodePudding user response:
I would propose not to use the read_csv
function to parse this text file (as I believe it is a rather specific use case where a blank space is to be considered only sometimes as a separator).
I wrote a small sample that shows how to parse the file by programmatically reading line by line and parsing based on the general logical structure of your data. Basically taking advantage of the square brackets that the "ids" field has.
Here is the code sample:
import pandas as pd
data_list = []
for line in open("example.csv","r"):
# Separate the line by spaces into a list
data = {}
line = line.split(" ")
# The first 3 elements correspond to "type", "time" and "part"
data["type"] = line[0]
data["time"] = line[1]
data["part"] = line[2]
# Then from the third position onward, concatenate each element until we find a closing square bracket
# We will call this the id's field
data['ids'] = ""
for i in range(3,len(line)):
data["ids"] = " ".join((data["ids"],line[i]))
if line[i][-1] == "]":
break
# Finally, we will concatenate the rest of the elements into the "message" field
data["message"] = " ".join(line[i 1:])
# And we will append the data to a list
data_list.append(data)
# Now we will create a pandas dataframe from the list
df = pd.DataFrame(data_list)
print(df)