Home > Enterprise >  using beautiful soup to get consolidated data from a list of urls instead of just the first url
using beautiful soup to get consolidated data from a list of urls instead of just the first url

Time:12-06

I'm trying get the data of three states, based on the same url format.

states = ['123', '124', '125']

urls = []
for state in states:
    url = f'www.something.com/geo={state}'
    urls.append(url)

and from there I have three separate urls, each containing different state ID.

However when I get to processing it via BS, the output only showed data from the state 123.

for url in urls:
    client = ScrapingBeeClient(api_key="API_KEY")
    response = client.get(url)
    doc = BeautifulSoup(response.text, 'html.parser')

subsequently I extracted the columns I wanted using this:

listings = doc.select('.is-9-desktop')

rows = []

for listing in listings:
    row = {}
    try:
        row['name'] = listing.select_one('.result-title').text.strip()
    except:
        print("no name")
    try:
        row['add'] = listing.select_one('.address-text').text.strip()
    except:
        print("no add")
    try:
        row['mention'] = listing.select_one('.review-mention-block').text.strip()
    except:
        pass
    
    rows.append(row)

But as mentioned it only showed data for state 123. Hugely appreciate it if anyone could let me know where I went wrong, thank you!

EDIT

I added the URL output into a list, and was able to get the data for all three states.

doc = []
for url in urls:
    client = ScrapingBeeClient(api_key="API_KEY")
    response = client.get(url)
    docs = BeautifulSoup(response.text, 'html.parser')
    doc.append(docs)

However when I ran it through BS it resulted in the error message:

Attribute Error: 'list' object has no attribute select.

Do I run it through another loop?

CodePudding user response:

It do not need all of these loops - Just iterate over the states and get the listings to append to rows.

Most importend thing is, that rows=[] is placed outside the for loops to stop overwriting itsself.

Example

states = ['123', '124', '125']

rows = []

for state in states:
    url = f'www.something.com/geo={states}'
    client = ScrapingBeeClient(api_key="API_KEY")
    response = client.get(url)
    doc = BeautifulSoup(response.text, 'html.parser')

    listings = doc.select('.is-9-desktop')

    for listing in listings:
        row = {}
        try:
            row['name'] = listing.select_one('.result-title').text.strip()
        except:
            print("no name")
        try:
            row['add'] = listing.select_one('.address-text').text.strip()
        except:
            print("no add")
        try:
            row['mention'] = listing.select_one('.review-mention-block').text.strip()
        except:
            pass

        rows.append(row)

CodePudding user response:

As noted in the comments there are some errors in your code. Try this version with the changes.

states = ['123', '124', '125']

urls = []
for state in states:
    url = f'www.something.com/geo={state}'
    urls.append(url)

rows = []
for url in urls:
    client = ScrapingBeeClient(api_key="API_KEY")
    response = client.get(url)
    doc = BeautifulSoup(response.text, 'html.parser')
    listings = doc.select('.is-9-desktop')
    for listing in listings:
        row = {}
        try:
            row['name'] = listing.select_one('.result-title').text.strip()
        except:
            print("no name")
        try:
            row['add'] = listing.select_one('.address-text').text.strip()
        except:
            print("no add")
        try:
            row['mention'] = listing.select_one('.review-mention-block').text.strip()
        except:
            pass
        
        rows.append(row)
  • Related