I have a list of links in which some have page numbers and some don't. I'm trying to scrape the website to get the page number but it is in this format: '\n\n\n\n\n\n\n\n«\nPrevious\n\n\n\n\n\n\n1\n\n2\n\n3\n\n\x85\n\n23'
Can someone help me just extract the last 2 characters of the list?
Here's the code I am using and the output I am getting.
for i in range(0, len(links)):
url = links[i]
response = requests.get(url, cookies)
soup = BeautifulSoup(response.content)
pr = [f.text for f in soup.find_all(class_='lia-paging-full-wrapper lia-paging-pager lia-paging-full-left-position lia-discussion-page-message-pager lia-forum-topic-page-gte-5-pager lia-component-message-pager')]
ed = [i.split('\n\n\n\n\n\nNext\n»\n\n\n\n', 1)[0] for i in pr]
print(ed)
The output I'm getting is this:
['\n\n\n\n\n\n\n\n«\nPrevious\n\n\n\n\n\n\n1\n\n2\n\n3\n\n\x85\n\n23']
[]
['\n\n\n\n\n\n\n\n«\nPrevious\n\n\n\n\n\n\n1\n\n2\n\n3']
['\n\n\n\n\n\n\n\n«\nPrevious\n\n\n\n\n\n\n1\n\n2\n\n3']
[]
[]
[]
[]
['\n\n\n\n\n\n\n\n«\nPrevious\n\n\n\n\n\n\n1\n\n2']
['\n\n\n\n\n\n\n\n«\nPrevious\n\n\n\n\n\n\n1\n\n2\n\n3\n\n\x85\n\n16']
[]
[]
[]
['\n\n\n\n\n\n\n\n«\nPrevious\n\n\n\n\n\n\n1\n\n2']
[]
How can I just get the last 2-3 characters as those represent the page numbers?
CodePudding user response:
You could do ed[-2:]
but I noticed you have 1 to 2 digit numbers, there are many ways, one way is just to look for the number at last of the string using regex:
import re
pattern = re.compile('\d $')
for i in range(0, len(links)):
url = links[i]
response = requests.get(url, cookies)
soup = BeautifulSoup(response.content)
pr = [f.text for f in soup.find_all(class_='lia-paging-full-wrapper lia-paging-pager lia-paging-full-left-position lia-discussion-page-message-pager lia-forum-topic-page-gte-5-pager lia-component-message-pager')]
ed = [i.split('\n\n\n\n\n\nNext\n»\n\n\n\n', 1)[0] for i in pr]
print(ed)
if ed:
page_count = pattern.findall(ed[0])
print(page_count[0])
else:
print('ed is empty!')
OUTPUT:
23
ed is empty!
3
3
ed is empty!
ed is empty!
ed is empty!
ed is empty!
2
16
ed is empty!
ed is empty!
ed is empty!
2
ed is empty!