I have an school assignment where I were tasked with writing a apache log parser in Python. This parser will extract all the IP addresses and all the HTTP Methods using Regex and store these in a nested dictionary. The code can be seen below:
def aggregatelog(filename):
keyvaluepairscounter = {"IP":{}, "HTTP":{}}
with open(filename, "r") as file:
for line in file:
result = search(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s*.*"(\b[A-Z] \b)', line).groups() #Combines the regexes: IP (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) and HTTP Method ("(\b[A-Z] \b))
if result[0] in set(keyvaluepairscounter["IP"].keys()): #Using set will lower look up time complexity from O(n) to O(1)
keyvaluepairscounter["IP"][result[0]] = 1
else:
keyvaluepairscounter["IP"][result[0]] = 1
if result[1] in set(keyvaluepairscounter["HTTP"].keys()):
keyvaluepairscounter["HTTP"][result[1]] = 1
else:
keyvaluepairscounter["HTTP"][result[1]] = 1
return keyvaluepairscounter
This code works (it gives me the expected data for the log files we were given). However, when extracting data from large log files (in my case, ~500 MB) the program is VERY slow (it takes ~30 min for the script to finish). According to my teacher, a good script should be able to process the large file in under 3 minutes (wth?). My question is: Is there anything I can do to speed up my script? I have done some things, like replacing the lists with sets which have better lookup times.
CodePudding user response:
At minimum, pre-compile your regex before the loop i.e.
pattern = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s*.*"(\b[A-Z] \b)')
then later in your loop:
for line in file:
result = search(pattern, line).groups()
you should also consider optimizing your pattern especially the .*
as it is an expensive operation