Home > database >  Parsing identical elements from separate blocks. Error: Repeating an entry in an array output
Parsing identical elements from separate blocks. Error: Repeating an entry in an array output

Time:12-19

I wrote a crooked parser code. I'm trying to write a parser to get several identical elements from the site, located in separate blocks. Here is the parser code

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

URL_TEMPLATE = "http://127.0.0.1:5500/rr.html"
FILE_NAME = "img.csv"


def parse(url=URL_TEMPLATE):
    result_list = {'id': []}
    result_l = {'id': []}
    r = requests.get(url)
    soup = bs(r.text, "html.parser")
    vacancies_names = soup.find_all('ul', class_='PhotoListSmall')
    vacancies_li = soup.find_all('li')
    for name in vacancies_names:
        for i in vacancies_li:
            result_list['id'].append(i.a['href'])
        result_l['id'].append(result_list['id'])
        result_list['id'] = []

    return result_l


df = pd.DataFrame(data=parse())
df.to_csv(FILE_NAME)

Here is the page the parser is processing

<ul >
        <li >
            <a href="one_1"></a>
        </li>
        <li >
            <a href="one_2"></a>
        </li>
    </ul>
    <ul >
        <li >
            <a href="two_1"></a>
        </li>
        <li >
            <a href="two_2"></a>
        </li>
    </ul>

And this is what I get

,id
0,"['one_1', 'one_2', 'two_1', 'two_2']"
1,"['one_1', 'one_2', 'two_1', 'two_2']"

And here is what I want to get

,id
0,"['one_1', 'one_2']"
1,"['two_1', 'two_2']"

What have I done wrong?

Do not judge strictly, I'm just a beginner developer

CodePudding user response:

You used the soup which is having all the li elements from both of the PhotoListSmall class in it. That's why it's appending the same array twice.

To get the li from only one class at a time, create another soup inside the loop, instead of using the original soup.

for name in vacancies_names:
    li_soup = bs(str(name), "html.parser")

the new soup should have the name as its data, so that it can only have access to the li elements of the first PhotoListSmall class. now create the vacancies_li and find all li elements from the li_soup.

The final code should look something like this.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

URL_TEMPLATE = "http://192.168.100.196:5500/test.html"
FILE_NAME = "img.csv"


def parse(url=URL_TEMPLATE):
    result_list = {'id': []}
    result_l = {'id': []}

    r = requests.get(url)
    soup = bs(r.text, "html.parser")

    vacancies_names = soup.find_all('ul', class_='PhotoListSmall')
    # vacancies_li = soup.find_all('li')

    for name in vacancies_names:
        li_soup = bs(str(name), "html.parser")
        vacancies_li = li_soup.find_all('li')

        for i in vacancies_li:
            result_list['id'].append(i.a['href'])

        result_l['id'].append(result_list['id'])
        result_list['id'] = []

    return result_l


df = pd.DataFrame(data=parse())
df.to_csv(FILE_NAME)

I hope it helped you :)

  • Related