pandas.read_csv seperator with Regex-CodePudding

I'm struggling how to parse a text file into a pandas dataframe. I think to use pandas.read_csv(sep='') but I can not figure out the right configuration since the file uses blanks as seperator, but also contains text with separator

A sample data rows looks like this

<123> 2022-12-08T14:00:00 tag [id="451" tid="145] text message with commas

which is a line for this table

type	time	part	ids	message
<123>	2022-12-08T14:00:00	tag	[id="451" tid="145]	text message with commas

CodePudding user response：

Use the read_csv function with the sep parameter set to '\s ', which will split the fields by any number of spaces.

import pandas as pd
import csv

df = pd.read_csv('data.txt', sep='\s ', quoting=csv.QUOTE_NONE)

print(df)

CodePudding user response：

I would propose not to use the read_csv function to parse this text file (as I believe it is a rather specific use case where a blank space is to be considered only sometimes as a separator).

I wrote a small sample that shows how to parse the file by programmatically reading line by line and parsing based on the general logical structure of your data. Basically taking advantage of the square brackets that the "ids" field has.

Here is the code sample:

import pandas as pd

data_list = []
for line in open("example.csv","r"):
    # Separate the line by spaces into a list
    data = {}
    line = line.split(" ")

    # The first 3 elements correspond to "type", "time" and "part"
    data["type"] = line[0]
    data["time"] = line[1]
    data["part"] = line[2]

    # Then from the third position onward, concatenate each element until we find a closing square bracket
    # We will call this the id's field
    data['ids'] = ""
    for i in range(3,len(line)):
        data["ids"] = " ".join((data["ids"],line[i]))
        if line[i][-1] == "]":
            break

    # Finally, we will concatenate the rest of the elements into the "message" field
    data["message"] = " ".join(line[i 1:])

    # And we will append the data to a list
    data_list.append(data)

# Now we will create a pandas dataframe from the list
df = pd.DataFrame(data_list)

print(df)