Home > Enterprise >  Customize tokenization using regex
Customize tokenization using regex

Time:05-23

I have the following text:

4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET

The format of the log line may change and more fields may add into it but they are all single words. I only want to join date and time

I want to tokenize it to :

['4/21/2021 11:43:32 PM','0RU4', 'PACKET', 'OUTPUT', 'GET']

I have used this regex "\\[|\\]|\,|\\s |\W:|=" which gives me the output as:

['4/21/2021', '11:43:32', 'PM', '0ED4', 'PACKET', 'OUTPUT', 'GET']

What changes should I make to the regex such that I get my desired output with the entire date and time as one token.

CodePudding user response:

You could just use a single regex pattern along with re.findall:

inp = "4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET"
matches = re.findall(r'^(\S  \S  [AP]M) (\S ) (\S ) \[(\S )\] (\w )', inp)
print(matches)  # [('4/21/2021 11:43:32 PM', '0ED4', 'PACKET', 'OUTPUT', 'GET')]

CodePudding user response:

You can also use the following python ttp module. See the example:

from ttp import ttp
import json
import re

data_to_parse = "4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET"

ttp_template = """
{{ TIME | PHRASE }} PM {{REST | PHRASE}}
"""

parser = ttp(data=data_to_parse, template=ttp_template)
parser.parse()

# print result in JSON format
results = parser.result(format='json')[0]
#print(results)

#converting str to json. 
result = json.loads(results)

rest_split = result[0]['REST'].split()

desired_data = []
desired_data.append(f"{result[0]['TIME']} PM")

pattern = "\[(\S )\]" # Just to capture everything between [], your OUTPUT data. 

for i in rest_split:
    if re.match(pattern, i):
        i = re.findall(pattern, i)
        desired_data.append(i[0])
        continue
    desired_data.append(i)

print(desired_data)

See the output of result first:

enter image description here

See the output of desired data:

enter image description here

CodePudding user response:

Why bother with regex?

s = '4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET'
s = s.replace(' [',' ').replace('] ',' ') #the output shows no square brackets

tags = [' AP ', ' PM ']
for t in tags:
    if t in s:
        start, end = s.split(t)
        start = (start   t).strip()
        result = [start, *end.split()]
        break
        
print(result)
#['4/21/2021 11:43:32 PM', '0ED4', 'PACKET', 'OUTPUT', 'GET']
  • Related