Home > Mobile >  Regex not matching text scraped with soup.get_text()
Regex not matching text scraped with soup.get_text()

Time:01-25

The code below works until:

print(salary_range)

This is the code:

url = "https://nofluffjobs.com/pl/job/c-c-junior-software-developer-vesoftx-wroclaw-n6bgtv5f"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, "html.parser")
salaries = soup.find_all("h4", class_="tw-mb-0")
markup2 = str(salaries[0])
soup2 = BeautifulSoup(str(salaries[0]), 'html.parser')

salary_range = soup2.get_text().strip()
print(salary_range)

# error on line below
bottom_salary = re.search(r"^(\d{0,2} ?\d{3})", salary_range).group(1)
print(bottom_salary)

bottom_salary_int = re.sub(" ", "", bottom_salary)
print(bottom_salary_int)

Why doesn't re.search() find any match? I've tried many other regular expressions, but it never finds a match and I always get this error:

AttributeError: 'NoneType' object has no attribute 'group'

CodePudding user response:

The re.search() function returns a match object if it finds a match, or None if it doesn't find a match. It is most likely that your regex pattern didn't match anything, returned None and therefore was unable to access the match object's method: .group().

CodePudding user response:

The issue is that the character you think is a space is not actually a space, it is a non-breaking space. Despite looking the same, they are completely different characters. It means that the text won't overflow onto a new line. See this small diagram:

10 000  – 16 000  PLN
  ^   ^^
 NBSP SP  ... same deal here 

To match the non-breaking space instead, specify its hex value, 0xA0. Like this:

from bs4 import BeautifulSoup
import re
import requests

url = "https://nofluffjobs.com/pl/job/c-c-junior-software-developer-vesoftx-wroclaw-n6bgtv5f"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, "html.parser")
salaries = soup.find_all("h4", class_="tw-mb-0")
markup2 = str(salaries[0])
soup2 = BeautifulSoup(str(salaries[0]), 'html.parser')

salary_range = soup2.get_text().strip()
print(salary_range)

bottom_salary = re.search(r"^(\d{0,2}"   "\xa0"   r"?\d{3})", salary_range).group(1)
print(bottom_salary)

bottom_salary_int = re.sub(" ", "", bottom_salary)
print(bottom_salary_int)

If you're trying to match a space, but the regular space character doesn't match, then it might be a NBSP instead. You can also tell by the website's source code if it uses   instead of a regular space to encode a NBSP.

  • Related