I need help in parsing below text into Pandas dataframe columns
(11/15/22 2:48:46 PM EST) 14:48:33.671 - 42300146: GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal.
(11/15/22 11:55:59 AM EST) 11:40:22.750 - 31009225: GWS: 2022-11-15 15:10:22.748 Message Type Received: Slots
(11/8/22 8:03:39 AM EST) 07:32:38.188 - 28624922: Timed out waiting for incoming connections, retry after couple of secs
Below is the expected output.
Column 1 | Column 2 | Column 3 | Column 4 |
---|---|---|---|
(11/15/22 2:48:46 PM EST) | 14:48:33.671 | 42300146 | GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal. |
(11/15/22 11:55:59 AM EST) | 11:40:22.750 | 31009225 | GWS: 2022-11-15 15:10:22.748 Message Type Received: Slots |
(11/8/22 8:03:39 AM EST) | 07:32:38.188 | 28624922 | Timed out waiting for incoming connections, retry after couple of secs |
Note: The first and fourth column have different width.
I have tried to use read_fwf with specific width but didn't produce the output that I want. I couldn't use read_csv since there is no delimiter (other than whitespace) between columns.
Any help would be appreciated!
Thank you in advance.
CodePudding user response:
You can use regular expression to parse the text (regex demo.)
text = """\
(11/15/22 2:48:46 PM EST) 14:48:33.671 - 42300146: GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal.
(11/15/22 11:55:59 AM EST) 11:40:22.750 - 31009225: GWS: 2022-11-15 15:10:22.748 Message Type Received: Slots
(11/8/22 8:03:39 AM EST) 07:32:38.188 - 28624922: Timed out waiting for incoming connections, retry after couple of secs"""
import re
import pandas as pd
pat = re.compile(r"(\(.*?\))\s (\S )\s -\s (\d ):\s (.*)")
data = []
for line in text.splitlines():
m = pat.search(line)
if m:
data.append(m.groups())
df = pd.DataFrame(
data, columns=["Column 1", "Column 2", "Column 3", "Column 4"]
)
print(df)
Prints:
Column 1 Column 2 Column 3 Column 4
0 (11/15/22 2:48:46 PM EST) 14:48:33.671 42300146 GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal.
1 (11/15/22 11:55:59 AM EST) 11:40:22.750 31009225 GWS: 2022-11-15 15:10:22.748 Message Type Received: Slots
2 (11/8/22 8:03:39 AM EST) 07:32:38.188 28624922 Timed out waiting for incoming connections, retry after couple of secs