I wrote a crooked parser code. I'm trying to write a parser to get several identical elements from the site, located in separate blocks. Here is the parser code
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
URL_TEMPLATE = "http://127.0.0.1:5500/rr.html"
FILE_NAME = "img.csv"
def parse(url=URL_TEMPLATE):
result_list = {'id': []}
result_l = {'id': []}
r = requests.get(url)
soup = bs(r.text, "html.parser")
vacancies_names = soup.find_all('ul', class_='PhotoListSmall')
vacancies_li = soup.find_all('li')
for name in vacancies_names:
for i in vacancies_li:
result_list['id'].append(i.a['href'])
result_l['id'].append(result_list['id'])
result_list['id'] = []
return result_l
df = pd.DataFrame(data=parse())
df.to_csv(FILE_NAME)
Here is the page the parser is processing
<ul >
<li >
<a href="one_1"></a>
</li>
<li >
<a href="one_2"></a>
</li>
</ul>
<ul >
<li >
<a href="two_1"></a>
</li>
<li >
<a href="two_2"></a>
</li>
</ul>
And this is what I get
,id
0,"['one_1', 'one_2', 'two_1', 'two_2']"
1,"['one_1', 'one_2', 'two_1', 'two_2']"
And here is what I want to get
,id
0,"['one_1', 'one_2']"
1,"['two_1', 'two_2']"
What have I done wrong?
Do not judge strictly, I'm just a beginner developer
CodePudding user response:
You used the soup
which is having all the li
elements from both of the PhotoListSmall
class in it. That's why it's appending the same array twice.
To get the li
from only one class at a time, create another soup inside the loop, instead of using the original soup.
for name in vacancies_names:
li_soup = bs(str(name), "html.parser")
the new soup should have the name
as its data, so that it can only have access to the li elements of the first PhotoListSmall
class. now create the vacancies_li
and find all li
elements from the li_soup
.
The final code should look something like this.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
URL_TEMPLATE = "http://192.168.100.196:5500/test.html"
FILE_NAME = "img.csv"
def parse(url=URL_TEMPLATE):
result_list = {'id': []}
result_l = {'id': []}
r = requests.get(url)
soup = bs(r.text, "html.parser")
vacancies_names = soup.find_all('ul', class_='PhotoListSmall')
# vacancies_li = soup.find_all('li')
for name in vacancies_names:
li_soup = bs(str(name), "html.parser")
vacancies_li = li_soup.find_all('li')
for i in vacancies_li:
result_list['id'].append(i.a['href'])
result_l['id'].append(result_list['id'])
result_list['id'] = []
return result_l
df = pd.DataFrame(data=parse())
df.to_csv(FILE_NAME)
I hope it helped you :)