Home > Blockchain >  Parsing Text file with irregular column width into Pandas
Parsing Text file with irregular column width into Pandas

Time:11-19

I need help in parsing below text into Pandas dataframe columns

(11/15/22 2:48:46 PM EST) 14:48:33.671 - 42300146: GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal.
(11/15/22 11:55:59 AM EST) 11:40:22.750 - 31009225: GWS: 2022-11-15 15:10:22.748  Message Type Received: Slots
(11/8/22 8:03:39 AM EST) 07:32:38.188 - 28624922: Timed out waiting for incoming connections, retry after couple of secs

Below is the expected output.

Column 1 Column 2 Column 3 Column 4
(11/15/22 2:48:46 PM EST) 14:48:33.671 42300146 GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal.
(11/15/22 11:55:59 AM EST) 11:40:22.750 31009225 GWS: 2022-11-15 15:10:22.748 Message Type Received: Slots
(11/8/22 8:03:39 AM EST) 07:32:38.188 28624922 Timed out waiting for incoming connections, retry after couple of secs

Note: The first and fourth column have different width.

I have tried to use read_fwf with specific width but didn't produce the output that I want. I couldn't use read_csv since there is no delimiter (other than whitespace) between columns.

Any help would be appreciated!

Thank you in advance.

CodePudding user response:

You can use regular expression to parse the text (regex demo.)

text = """\
(11/15/22 2:48:46 PM EST) 14:48:33.671 - 42300146: GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal.
(11/15/22 11:55:59 AM EST) 11:40:22.750 - 31009225: GWS: 2022-11-15 15:10:22.748  Message Type Received: Slots
(11/8/22 8:03:39 AM EST) 07:32:38.188 - 28624922: Timed out waiting for incoming connections, retry after couple of secs"""

import re
import pandas as pd

pat = re.compile(r"(\(.*?\))\s (\S )\s -\s (\d ):\s (.*)")

data = []
for line in text.splitlines():
    m = pat.search(line)
    if m:
        data.append(m.groups())

df = pd.DataFrame(
    data, columns=["Column 1", "Column 2", "Column 3", "Column 4"]
)
print(df)

Prints:

                     Column 1      Column 2  Column 3                                                                Column 4
0   (11/15/22 2:48:46 PM EST)  14:48:33.671  42300146            GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal.
1  (11/15/22 11:55:59 AM EST)  11:40:22.750  31009225              GWS: 2022-11-15 15:10:22.748  Message Type Received: Slots
2    (11/8/22 8:03:39 AM EST)  07:32:38.188  28624922  Timed out waiting for incoming connections, retry after couple of secs
  • Related