I want to extract all of 'data-test-id'='^fdbk-item-.*$'
in <span>
from link.
Futhermore, within that contain whichever capital or lower case mirror|tray|ceramic
.
source
Using find_all()
, retrieving 'data-test-id'='^fdbk-item-. *$'
was successfully.
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
import re
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=options, executable_path='/Users/user/Desktop/pytest/chromedriver')
driver.implicitly_wait(10)
url="https://www.ebay.com/fdbk/feedback_profile/blueberbestmall?filter=feedback_page:RECEIVED_AS_SELLER,period:TWELVE_MONTHS,overall_rating:NEGATIVE&commentType=NEGATIVE"
driver.get(url)
time.sleep(3)
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'html.parser')
s=soup.find_all('span', attrs={'data-test-id': re.compile('^fdbk-item-.*$')})
time.sleep(3)
for i in s:
print(i)
But I'm struggling to get find_all
and find_elements_by_xpath
to work together.
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'html.parser')
time.sleep(3)
item_lst=[]
s=soup.find_all('span', attrs={'data-test-id': re.compile('^fdbk-item-.*$')})
try:
for t in s:
p=item_lst.append(t)
item_fil=[]
i=p.find_elements_by_xpath("//div[contains(text(),'mirror|tray|ceramic',flags=re.IGNORECASE)]")
for j in i:
k=item_fil.append(j)
print(k)
except:
pass
I have no idea how to modify:
find_elements_by_xpath("//div[contains(text(),'mirror|tray|ceramic',flags=re.IGNORECASE)]")
Is it possible to further refine the extracted elements?
CodePudding user response:
Just in case two alternatives to get your goal:
Select your elements with
css selector
and check with:-soup-contains-own()
(yes, this is not case insensitive):soup.select('div:has(>span:-soup-contains-own("Mirror","Tray","Ceramic"))')
Select your elements more specific and use a
list comprehension
to check against matches, so regex is not needed:matches = ['mirror','tray','ceramic'] [e for e in soup.select('[data-feedback-id] .card__item') if any([x in e.text.lower() for x in matches])]
Example
from bs4 import BeautifulSoup
html = '''
<tr data-feedback-id="1638468213026"><td><div ><div ><svg data-test-id="fdbk-rating-13" data-test-type="negative" viewBox="0 0 22 22" height="24" width="24" aria-label="Negative feedback rating" role="img"><path fill="#E0103A" d="M10.969 0C4.911 0 0 4.911 0 10.969s4.911 10.969 10.969 10.969 10.969-4.91 10.969-10.969C21.938 4.912 17.027 0 10.969 0zm4.113 12.339H6.854a1.37 1.37 0 1 1 0-2.74h8.227a1.37 1.37 0 1 1 0 2.74h.001z"></path></svg></div><div ><div ><span data-test-id="fdbk-comment-13" aria-label="Such a shame, the pack contained 6 swallows, but 5 were broken.">Such a shame, the pack contained 6 swallows, but 5 were broken.</span></div><div ><span data-test-id="fdbk-item-13">1PC Swallow Mirror Non-toxic Peel and Stick Creative Wall Stickers for Bedroom (#384870311555)</span></div></div></div></td><td><div ><span data-test-id="fdbk-context-13" aria-label="Feedback left by buyer.">2***w</span><span > (<span data-test-id="fdbk-rating-score-13">340</span><span data-test-id="fdbk-rating-icon-13"></span>)</span></div><div ><span data-test-id="fdbk-price-13">GBP 5.62</span></div></td><td><div><span data-test-id="fdbk-time-13" aria-label="Past year">Past year</span></div><div ></div></td></tr>
'''
soup = BeautifulSoup(html)
print(soup.select('tr:has(span:-soup-contains("Mirror","Tray","Ceramic"))'))
matches = ['mirror','tray','ceramic']
[e for e in soup.select('[data-feedback-id] .card__item') if any([x in e.text.lower() for x in matches])]
Output
[<div ><span data-test-id="fdbk-item-13">1PC Swallow Mirror Non-toxic Peel and Stick Creative Wall Stickers for Bedroom (#384870311555)</span></div>]