I have the following text:
4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET
The format of the log line may change and more fields may add into it but they are all single words. I only want to join date and time
I want to tokenize it to :
['4/21/2021 11:43:32 PM','0RU4', 'PACKET', 'OUTPUT', 'GET']
I have used this regex "\\[|\\]|\,|\\s |\W:|=
" which gives me the output as:
['4/21/2021', '11:43:32', 'PM', '0ED4', 'PACKET', 'OUTPUT', 'GET']
What changes should I make to the regex such that I get my desired output with the entire date and time as one token.
CodePudding user response:
You could just use a single regex pattern along with re.findall
:
inp = "4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET"
matches = re.findall(r'^(\S \S [AP]M) (\S ) (\S ) \[(\S )\] (\w )', inp)
print(matches) # [('4/21/2021 11:43:32 PM', '0ED4', 'PACKET', 'OUTPUT', 'GET')]
CodePudding user response:
You can also use the following python ttp module. See the example:
from ttp import ttp
import json
import re
data_to_parse = "4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET"
ttp_template = """
{{ TIME | PHRASE }} PM {{REST | PHRASE}}
"""
parser = ttp(data=data_to_parse, template=ttp_template)
parser.parse()
# print result in JSON format
results = parser.result(format='json')[0]
#print(results)
#converting str to json.
result = json.loads(results)
rest_split = result[0]['REST'].split()
desired_data = []
desired_data.append(f"{result[0]['TIME']} PM")
pattern = "\[(\S )\]" # Just to capture everything between [], your OUTPUT data.
for i in rest_split:
if re.match(pattern, i):
i = re.findall(pattern, i)
desired_data.append(i[0])
continue
desired_data.append(i)
print(desired_data)
See the output of result first:
See the output of desired data:
CodePudding user response:
Why bother with regex?
s = '4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET'
s = s.replace(' [',' ').replace('] ',' ') #the output shows no square brackets
tags = [' AP ', ' PM ']
for t in tags:
if t in s:
start, end = s.split(t)
start = (start t).strip()
result = [start, *end.split()]
break
print(result)
#['4/21/2021 11:43:32 PM', '0ED4', 'PACKET', 'OUTPUT', 'GET']