Apologies in advance if this question seems quite basic.
Given:
Apache HTTP Access Log file as follows:
sample_apache_access_log_line = '- - [01/Feb/2017:00:00:00 0200] "GET /aikakausi/binding/1145113/image/14 HTTP/1.1" 200 658925 "http://digi.kansalliskirjasto.fi/aikakausi/binding/1145113?page=14&term=HOIKKA" "Mozilla/5.0 (Linux; Android 5.1.1; SM-J320FN Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/49.0.2623.105 Mobile Safari/537.36 [FB_IAB/MESSENGER;FBAV/100.0.0.29.61;]" 569'
Goal:
I extract information with the following pattern:
CUSTOM_LOG_PATTERN = '- - \[.*?\] "(.*)" (\d{3}) (.*) "([^\"] )" "(.*?)" (.*)'
matched_line = re.match(ACCESS_LOG_PATTERN, sample_apache_access_log_line)
print (matched_line)
l = matched_line.groups() # WORKS OK
which I then dump all info into a list for further processing:
cleaned_lines = []
cleaned_lines.append({
"timestamp": l[0],
"client_request_line": l[1],
"status": l[2],
"bytes_sent": l[3],
"referer": l[4],
"user_agent": l[5],
"session_id": l[6],
})
Problem:
There exists sometime some lines with broken url (referer) (starting with http://192.168.8.1/
) similar to:
sample_apache_access_log_line = '- - [01/Feb/2017:12:34:51 0200] "GET /aikakausi/binding/499213?page=55 HTTP/1.1" 401 1612 "http://192.168.8.1/html/home.html?url&address=http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" 1995'
which I would like to manipulate them using regex to say always start with http://LETTERS
, that is why I changed my code to:
CUSTOM_LOG_PATTERN = '- - \[.*?\] "(http://[a-zA-Z].*)" (\d{3}) (.*) "([^\"] )" "(.*?)" (.*)'
<<<<<PROBLEM>>>>>
matched_line = re.match(CUSTOM_LOG_PATTERN, sample_apache_access_log_line)
print (matched_line)
l = matched_line.groups() # ERROR
print(l)
But then here it comes the error:
AttributeError Traceback (most recent call last)
<ipython-input-88-c7a93cfbce61> in <module>
4 matched_line = re.match(CUSTOM_LOG_PATTERN, sample_apache_access_log_line)
5 print (matched_line)
----> 6 l = matched_line.groups()
7 print(l)
AttributeError: 'NoneType' object has no attribute 'groups'
Is there anything I'm doing wrong between for re.match().groups()
?
CodePudding user response:
if url is all you need you can just use split()
sample_apache_access_log_line = [
'- - [01/Feb/2017:12:34:51 0200] "GET /aikakausi/binding/499213?page=55 HTTP/1.1" 401 1612 "http://192.168.8.1/html/home.html?url&address=http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" 1995',
'- - [01/Feb/2017:12:34:53 0200] "GET /aikakausi/binding/641892?term=PETSAMON&term=Petsamon&page=6 HTTP/1.1" 200 3162 "http://digi.kansalliskirjasto.fi/aikakausi/search?query=petsamo&requireAllKeywords=true&fuzzy=false&hasIllustrations=false&startDate=1918-12-30&endDate=1920-12-30&orderBy=RELEVANCE&pages=&page=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0" 418'
]
for i in sample_apache_access_log_line:
if 'address=' in i:
print(i.split('"')[3].split('address=')[1])
else:
print(i.split('"')[3])
# http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55
# http://digi.kansalliskirjasto.fi/aikakausi/search?query=petsamo&requireAllKeywords=true&fuzzy=false&hasIllustrations=false&startDate=1918-12-30&endDate=1920-12-30&orderBy=RELEVANCE&pages=&page=5
CodePudding user response:
Use re.findall()
then re.split()
.
pattern = '(http://\D.*)' #matches any non-digits after 'http://'
url_start = re.findall(pattern, log_file_string) #get the starting point of url
url = re.split("\s", url_start) #to get the url alone by
#splitting on whitespace
url = url[0]
You may need to use str.strip()
to remove any remaining special characters enclosing the url.
If you must use re.match()
try simplifying the pattern.
pattern = '(.*)"(http://\D.*)"(.*)'
url_start = re.match(pattern, log_file_string)
url_string = url_start.group(2)
url = re.split("\s", url_string)
url = url[0].strip('"')
Using Match.groups()
returns a tuple. Used Match.group()
above. Try:
pattern = '(.*)"(http://\D.*)"[^\"]'
url_start = re.match(pattern, log_file_string)
url = url_start.groups(2)
url = url[1]