Home > Back-end >  Parse data with no class
Parse data with no class

Time:07-19

I have a code in python

from bs4 import BeautifulSoup
import requests
data0 = []
data1 = []
response = requests.get(
    "https://www.comicshoplocator.com/StoreLocatorPremier?query=75077&showCsls=true"
)
soup = BeautifulSoup(response.text, "html.parser")
for tag in soup.find_all('div', class_="LocationName"):
    title = tag.text
    data0.append({
        'title': title
    })

for button in soup.find_all('div', class_="LocationDetails"):
for childdiv in button.find_all('div', class_="LocationShopProfile"):
    for zb in childdiv.find_all('a'):
        if zb.get_text() == 'Shop Profile':
            website = zb.get('href')
            forsite = requests.get('https://www.comicshoplocator.com/'   website)
            soup = BeautifulSoup(forsite.text, "html.parser")
            for tag in soup.find_all('div', class_="StoreWeb"):
                site = tag.text.replace('Web: http://', '')
                data7.append({
                    'site': site
                })
df = pd.DataFrame(columns=['Name', 'Website'])

df[df.columns[0]] = pd.DataFrame(data0)
df[df.columns[1]] = pd.DataFrame(data1)

My print is:

                        Name                         Website
0       TWENTY ELEVEN COMICS      WWW.TWENTYELEVENCOMICS.COM
1                READ COMICS         www.boomerangcomics.com
2           BOOMERANG COMICS  www.facebook.com/morefuncomics
3  MORE FUN COMICS AND GAMES   www.madnesscomicsandgames.com
4     MADNESS COMICS & GAMES                             NaN
5  SANCTUARY BOOKS AND GAMES                             NaN

Correct print should be:

                        Name                         Website
0       TWENTY ELEVEN COMICS      WWW.TWENTYELEVENCOMICS.COM
1                READ COMICS                             NaN
2           BOOMERANG COMICS         www.boomerangcomics.com
3  MORE FUN COMICS AND GAMES  www.facebook.com/morefuncomics
4     MADNESS COMICS & GAMES   www.madnesscomicsandgames.com
5  SANCTUARY BOOKS AND GAMES                             NaN

Some stores may not have a "LocationShopProfile" or "StoreWeb" class. That is why second column have a wrong order

How can I fix that?

Thanks

CodePudding user response:

Try:

import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup


url = "https://www.comicshoplocator.com/StoreLocatorPremier?query=75077&showCsls=true"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_data = []
for shop in soup.select(".CslsLocationItem"):
    name = shop.select_one(".LocationName").text
    u = shop.select_one(".LocationShopProfile a")
    if u:
        s = BeautifulSoup(
            requests.get(
                "https://www.comicshoplocator.com"   u["href"]
            ).content,
            "html.parser",
        )
        u = s.select_one(".StoreWeb a")

    all_data.append((name, u["href"] if u else np.nan))

df = pd.DataFrame(all_data, columns=["Name", "Website"])
print(df.to_markdown(index=False))

Prints:

Name Website
TWENTY ELEVEN COMICS http://WWW.TWENTYELEVENCOMICS.COM
READ COMICS nan
BOOMERANG COMICS http://www.boomerangcomics.com
MORE FUN COMICS AND GAMES http://www.facebook.com/morefuncomics
MADNESS COMICS & GAMES http://www.madnesscomicsandgames.com
SANCTUARY BOOKS AND GAMES nan
  • Related