Home > Software design >  Python - How to extract the name between two characters (> and <)
Python - How to extract the name between two characters (> and <)

Time:12-05

I am trying to build a scraper for Yelp and get the usernames from reviews.

Here is the code I have:

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service as ChromeService
options = Options()
options.headless = True
options.add_experimental_option('excludeSwitches', ['enable-logging'])

CHROMEDRIVER_PATH ='../Selenium/chromedriver.exe'
service = ChromeService(executable_path=CHROMEDRIVER_PATH)
driver = webdriver.Chrome(service=service, options=options)
driver.get('https://www.yelp.com/biz/taste-of-texas-houston')
User = []
content = driver.page_source
soup = BeautifulSoup(content)
data = soup.findAll("li", attrs={'class':'margin-b5__09f24__pTvws border-color--default__09f24__NPAKY'})
for each in data:
    users = BeautifulSoup(str(each))
    names = users.find_all("a", attrs={'class':'css-1422juy' , "href": lambda L: L and L.startswith("/user_details?userid=")})
    for name in names:
        print(name)

which gives me output like this:

<a class="css-1422juy" href="/user_details?userid=70GQY6hu-iW2LZA6RoYXfw">Melissa G.</a>
<a class="css-1422juy" href="/user_details?userid=u766nLu7-4ptpYTPdlGbnA">Shannon B.</a>
<a class="css-1422juy" href="/user_details?userid=Icf-gy1YWYck0P3zW3f-pg">Jason Z.</a>
<a class="css-1422juy" href="/user_details?userid=m-BAEdY7IfKmXm5zk_XoLw">Marty H.</a>
<a class="css-1422juy" href="/user_details?userid=lbY_bZoZJwXNV7NiVAUr9w">Jules R.</a>
<a class="css-1422juy" href="/user_details?userid=aXCzU04l53gmgiQvISreaA">Christine H.</a>
<a class="css-1422juy" href="/user_details?userid=tbAN5YiUmBbccxdZAC4Daw">Tanisha K.</a>
<a class="css-1422juy" href="/user_details?userid=ZcG3eoa9mI_-CEgp5XagiQ">Steven T.</a>
<a class="css-1422juy" href="/user_details?userid=DZqA7lTwUsNUW8gF5dUjyw">Dan D.</a>
<a class="css-1422juy" href="/user_details?userid=e7xkB5I6BygG8yNX2URvgA">Ashley Y.</a>

Now, I am having a hard time getting the names out of these strings. I tried using several different methods to extract the name as a string between two characters > and < but I haven't had any luck.

here is what i tried:

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service as ChromeService
options = Options()
options.headless = True
options.add_experimental_option('excludeSwitches', ['enable-logging'])

CHROMEDRIVER_PATH ='../Selenium/chromedriver.exe'
service = ChromeService(executable_path=CHROMEDRIVER_PATH)
driver = webdriver.Chrome(service=service, options=options)
driver.get('https://www.yelp.com/biz/taste-of-texas-houston')
User = []
content = driver.page_source
soup = BeautifulSoup(content)
data = soup.findAll("li", attrs={'class':'margin-b5__09f24__pTvws border-color--default__09f24__NPAKY'})
for each in data:
    users = BeautifulSoup(str(each))
    names = users.find_all("a", attrs={'class':'css-1422juy' , "href": lambda L: L and L.startswith("/user_details?userid=")})
    for name in names:
        start = name.find('>')
        end = name.find('<',start 1)
        substring = name.find[start 1:end]
        print(f"Start: {start}, End: {end}")
        print(substring)

But this doesn't work either. Any suggestion on how to get these names out into the list?

CodePudding user response:

If you want to correct your code, problem is here:

    start = name.find('>')
    end = name.find('<',start 1)
    substring = name.find[start 1:end]

which needs to be changed to:

    namestr=str(name)
    start = namestr.find('>')
    end = namestr.find('<',start 1)
    substring = namestr[start 1:end]

to create a slice of string. Normally you should not use str() method, but it seems your data is not string. But you can use regular expression like what Tim's answer. Generally speaking, regular expressions can be slow if they are complex. So sometimes I just use something like yours.

Update:

I tried a simplified version of your code and it works for me. Here is that code:

names = ['<a  href="/user_details?userid=70GQY6hu-iW2LZA6RoYXfw">Melissa G.</a>',
'<a  href="/user_details?userid=u766nLu7-4ptpYTPdlGbnA">Shannon B.</a>',
'<a  href="/user_details?userid=Icf-gy1YWYck0P3zW3f-pg">Jason Z.</a>',
'<a  href="/user_details?userid=m-BAEdY7IfKmXm5zk_XoLw">Marty H.</a>',
'<a  href="/user_details?userid=lbY_bZoZJwXNV7NiVAUr9w">Jules R.</a>',
'<a  href="/user_details?userid=aXCzU04l53gmgiQvISreaA">Christine H.</a>',
'<a  href="/user_details?userid=tbAN5YiUmBbccxdZAC4Daw">Tanisha K.</a>',
'<a  href="/user_details?userid=ZcG3eoa9mI_-CEgp5XagiQ">Steven T.</a>',
'<a  href="/user_details?userid=DZqA7lTwUsNUW8gF5dUjyw">Dan D.</a>',
'<a  href="/user_details?userid=e7xkB5I6BygG8yNX2URvgA">Ashley Y.</a>']

for name in names:
    namestr=str(name)
    start = namestr.find('>')
    end = namestr.find('<',start 1)
    substring = namestr[start 1:end]
    print(f"Start: {start}, End: {end}")
    print(substring)

CodePudding user response:

Simply Use this

s='<a  href="/user_details?userid=70GQY6hu-iW2LZA6RoYXfw">Melissa G.</a>'
Username = s[s.find('">') len('">'):s.find('</a>')]
  • Related