How do I find data of parent that includes specific child?-CodePudding

I want to find a way that if the parent include certain child with p, then it extract data from that specific div.

This is what I did but it will still give me the data of other <div> where it doesn't include the <p> as well.

import requests
from bs4 import BeautifulSoup
import smtplib
import os
from csv import writer

for n in range (1, 2):
  url = f'https://www.musinsa.com/app/reviews/lists?type=&year_date=2022&month_date=&day_date=&max_rt=2022&min_rt=2009&brand=&page={n}'
  headers = {"User-Agent": ""}
  page = requests.get(url, headers=headers)
  Soup1 = BeautifulSoup(page.content, "html.parser")
  Soup2 = BeautifulSoup(Soup1.prettify(), "html.parser")
  if Soup2.find_all('p', {'class':'review-profile__body_information'}):
    ss = Soup2.find_all('p', {'class':'review-profile__body_information'})
    product_name = Soup2.find_all('a',{'class':'review-goods-information__name'})
    rating = Soup2.find_all('span',{'class':'review-list__rating__active'})
    comment = Soup2.find_all('div',{'class','review-contents__text '})
    eval = Soup2.find_all('li',{'class':'review-evaluation__item'})
    images = Soup2.find_all('li',{'class':'review-content-photo__item'})
    allinfo = [ss, product_name, rating, comment, eval, images]
    print(allinfo)

How do I need to write so that it will only give me data of 'div' that includes specific <p>? thanks

CodePudding user response：

I recommend you to use XPath. more details about it here: https://www.zyte.com/blog/an-introduction-to-xpath-with-examples/

so what about code? I recommend you to use the lxml module (https://pypi.org/project/lxml/) which supports using xpathes

# pip install lxml
from lxml import html
import requests


url = "your url you want to get"
request = requests.get(url)

# there we convert request content(page) to html class object
page_tree = html.fromstring(request.content)

# to use xpath you need to use this
element = page_tree.xpath("xpath to object")

#to get text of an element
text = element.text

#to get specific value use this
value = element.get("href") #or enother property of an element

So how to get a child? To get a child of an element you need to add "/child::" to the end of your parent XPath for example:

<div class='main_div'>
    <div class='parent'>
        <p class='content'>
            "text you need"
        </p>
    </div>
</div>

to get "text you need" you can use following xpath: //div[@class='content']/child::p

You need to learn more abouth xpath, it is more simple that using beautiful soup to find elements

CodePudding user response：

One approach could be to select your elements mor specific e.g. with css selectors:

soup.select('div.review-list:has(p.review-profile__body_information)')

You should also change your strategy from generating a lot of lists, better would be to itereate each item, check and pick the needed info.

Example

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0'}

data = []
for n in range (1,3):
    url = f'https://www.musinsa.com/app/reviews/lists?type=&year_date=2022&month_date=&day_date=&max_rt=2022&min_rt=2009&brand=&page={n}'
    res = requests.get(url, headers=headers)
    soup = BeautifulSoup(res.text)
    for e in soup.select('div.review-list:has(p.review-profile__body_information)'):
        data.append({
            'product_name':e.select_one('.review-goods-information__name').text.strip(),
            'comment':e.select_one('.review-contents__text').text.strip(),
            'something':'more ...'
        })
data

Output

[{'product_name': '[패키지] DUMBY BEAR 2PACK T-SHIRTS',
  'comment': '중학생 둘째아들 입힐려고 구매했습니다 \n편하고 이쁘네요. 잘 입을거 같아요',
  'something': 'more ...'},
 {'product_name': '컴피 워셔 후드 숏 자켓_차콜',
  'comment': '주문한지 얼마 되지 않아서 바로 도착을 하였네요',
  'something': 'more ...'},
 {'product_name': '컬러플러스 미니로고 버뮤다 숏팬츠 MINT GREEN',
  'comment': '색상도 이뿌고 반바지 기장도 넉넉하니 부담없이\n자주 입게 되겠네요~',
  'something': 'more ...'},
 {'product_name': 'CARGO STRING PANTS _ OLIVE',
  'comment': '검은색 사고 재구매인데 너무 편하고 이쁩니다 \n만족!',
  'something': 'more ...'}]