Home > Blockchain >  parsing url using re.match().groups() in python
parsing url using re.match().groups() in python

Time:10-05

Apologies in advance if this question seems quite basic.

Given:

Apache HTTP Access Log file as follows:

sample_apache_access_log_line = '- - [01/Feb/2017:00:00:00  0200] "GET /aikakausi/binding/1145113/image/14 HTTP/1.1" 200 658925 "http://digi.kansalliskirjasto.fi/aikakausi/binding/1145113?page=14&term=HOIKKA" "Mozilla/5.0 (Linux; Android 5.1.1; SM-J320FN Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/49.0.2623.105 Mobile Safari/537.36 [FB_IAB/MESSENGER;FBAV/100.0.0.29.61;]" 569'

Goal:

I extract information with the following pattern:

CUSTOM_LOG_PATTERN = '- - \[.*?\] "(.*)" (\d{3}) (.*) "([^\"] )" "(.*?)" (.*)'    
matched_line = re.match(ACCESS_LOG_PATTERN, sample_apache_access_log_line)
print (matched_line)
l = matched_line.groups() # WORKS OK

which I then dump all info into a list for further processing:

cleaned_lines = []
cleaned_lines.append({
                "timestamp":            l[0],
                "client_request_line":  l[1],
                "status":               l[2],
                "bytes_sent":           l[3],
                "referer":              l[4],
                "user_agent":           l[5],
                "session_id":           l[6],
})

Problem:

There exists sometime some lines with broken url (referer) (starting with http://192.168.8.1/) similar to:

sample_apache_access_log_line = '- - [01/Feb/2017:12:34:51  0200] "GET /aikakausi/binding/499213?page=55 HTTP/1.1" 401 1612 "http://192.168.8.1/html/home.html?url&address=http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" 1995'

which I would like to manipulate them using regex to say always start with http://LETTERS , that is why I changed my code to:

CUSTOM_LOG_PATTERN = '- - \[.*?\] "(http://[a-zA-Z].*)" (\d{3}) (.*) "([^\"] )" "(.*?)" (.*)' 
                                    <<<<<PROBLEM>>>>>
matched_line = re.match(CUSTOM_LOG_PATTERN, sample_apache_access_log_line)
print (matched_line)
l = matched_line.groups() # ERROR
print(l)

But then here it comes the error:


AttributeError                            Traceback (most recent call last)

<ipython-input-88-c7a93cfbce61> in <module>
      4 matched_line = re.match(CUSTOM_LOG_PATTERN, sample_apache_access_log_line)
      5 print (matched_line)
----> 6 l = matched_line.groups()
      7 print(l)

AttributeError: 'NoneType' object has no attribute 'groups'

Is there anything I'm doing wrong between for re.match().groups()?

CodePudding user response:

if url is all you need you can just use split()

sample_apache_access_log_line = [

'- - [01/Feb/2017:12:34:51  0200] "GET /aikakausi/binding/499213?page=55 HTTP/1.1" 401 1612 "http://192.168.8.1/html/home.html?url&address=http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" 1995', 
'- - [01/Feb/2017:12:34:53  0200] "GET /aikakausi/binding/641892?term=PETSAMON&term=Petsamon&page=6 HTTP/1.1" 200 3162 "http://digi.kansalliskirjasto.fi/aikakausi/search?query=petsamo&requireAllKeywords=true&fuzzy=false&hasIllustrations=false&startDate=1918-12-30&endDate=1920-12-30&orderBy=RELEVANCE&pages=&page=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0" 418'
]


for i in sample_apache_access_log_line:
    if 'address=' in i:
        print(i.split('"')[3].split('address=')[1])
    else:
        print(i.split('"')[3])

# http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55
# http://digi.kansalliskirjasto.fi/aikakausi/search?query=petsamo&requireAllKeywords=true&fuzzy=false&hasIllustrations=false&startDate=1918-12-30&endDate=1920-12-30&orderBy=RELEVANCE&pages=&page=5

CodePudding user response:

Use re.findall() then re.split().

pattern = '(http://\D.*)'                        #matches any non-digits after 'http://'
url_start = re.findall(pattern, log_file_string) #get the starting point of url
url = re.split("\s", url_start)                  #to get the url alone by 
                                                 #splitting on whitespace

url = url[0]

You may need to use str.strip() to remove any remaining special characters enclosing the url.

If you must use re.match() try simplifying the pattern.

pattern = '(.*)"(http://\D.*)"(.*)'
url_start = re.match(pattern, log_file_string)
url_string = url_start.group(2)
url = re.split("\s", url_string)
url = url[0].strip('"')

Using Match.groups() returns a tuple. Used Match.group() above. Try:

pattern = '(.*)"(http://\D.*)"[^\"]'
url_start = re.match(pattern, log_file_string)
url = url_start.groups(2)
url = url[1]
  • Related