BeautifulSoup in python get a multi-tr tag from the multi-div tag?
this is the code HTML :
<div id="compare">
<div >
<tbody>
<tr>
<td >
<a href="/overview">sami</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first1</td>
<td >second1</td>
<td >third1</td>
</tr>
<tr>
<td >
<a href="/overview">fadi</a></td>
<td rid="number1">10</td>
<td rid="number">36</td>
<td >first2</td>
<td >second2</td>
<td >third2</td>
</tr>
<tr>
<td >
<a href="/overview">achraf</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first3</td>
<td >second3</td>
<td >third3</td>
</tr>
<div >
<tbody>
<tr>
<td >
<a href="/overview">john</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first1</td>
<td >second1</td>
<td >third1</td>
</tr>
<tr>
<td >
<a href="/overview">noor</a></td>
<td rid="number1">10</td>
<td rid="number">36</td>
<td >first2</td>
<td >second2</td>
<td >third2</td>
</tr>
<tr>
<td >
<a href="/overview">dadi</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first3</td>
<td >second3</td>
<td >third3</td>
</tr>
<div >
<tbody>
<tr>
<td >
<a href="/overview">ham</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first1</td>
<td >second1</td>
<td >third1</td>
</tr>
<tr>
<td >
<a href="/overview">fathe</a></td>
<td rid="number1">10</td>
<td rid="number">36</td>
<td >first2</td>
<td >second2</td>
<td >third2</td>
</tr>
<tr>
<td >
<a href="/overview">kali</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first3</td>
<td >second3</td>
<td >third3</td>
</tr>
I try to get value from every <td
but just give me the first <td
from each <div >
and I can't get <td >
because the tag is duplicate.
this is my code python:
contents = BeautifulSoup(response_url.content, "lxml")
table_body = contents.find('div', {'id': 'compare'})
print(len(table_body))
rowss = table_body.find_all('div', {'class': 'students'})
print(len(rowss))
for section in rowss:
print('--------------------------')
try:
name = section.find_next('td', {'class': 'argaam-font company-short-name'}).text
print(name)
name1 = section.find_next('td', {'class': 'center', 'rid': 'number1'}).text
print(name1)
name2 = section.find_next('td', {'rid': 'number'}).text
print(name2)
name3 = section.find_next('td', {'class': 'center'}).text
print(name3)
name4 = section.find_next('td', {'class': 'center'}).text
print(name4)
name15 = section.find_next('td', {'class': 'center'}).text
print(name5)
except:
name = ''
print('error')
Resutl:
--------------------------
sami
32
20
first1
second1
third1
--------------------------
john
32
20
first1
second1
third1
--------------------------
ham
32
20
first1
second1
third1
but I want results like that:
--------------------------
sami
32
20
first1
second1
third1
fadi
10
36
first2
second2
third2
achraf
32
20
first3
second3
third3
--------------------------
john
32
20
first1
second1
third1
noor
10
36
first2
second2
third2
dadi
32
20
first3
second3
third3
--------------------------
ham
32
20
first1
second1
third1
fathe
10
36
first2
second2
third2
kali
32
20
first3
second3
third3
I using python to scraping data by BeautifulSoup
CodePudding user response:
I hope,You can do that using css selector and stripped_strings
html_doc='''
<div id="compare">
<div >
<tbody>
<tr>
<td >
<a href="/overview">sami</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first1</td>
<td >second1</td>
<td >third1</td>
</tr>
<tr>
<td >
<a href="/overview">fadi</a></td>
<td rid="number1">10</td>
<td rid="number">36</td>
<td >first2</td>
<td >second2</td>
<td >third2</td>
</tr>
<tr>
<td >
<a href="/overview">achraf</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first3</td>
<td >second3</td>
<td >third3</td>
</tr>
<div >
<tbody>
<tr>
<td >
<a href="/overview">john</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first1</td>
<td >second1</td>
<td >third1</td>
</tr>
<tr>
<td >
<a href="/overview">noor</a></td>
<td rid="number1">10</td>
<td rid="number">36</td>
<td >first2</td>
<td >second2</td>
<td >third2</td>
</tr>
<tr>
<td >
<a href="/overview">dadi</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first3</td>
<td >second3</td>
<td >third3</td>
</tr>
<div >
<tbody>
<tr>
<td >
<a href="/overview">ham</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first1</td>
<td >second1</td>
<td >third1</td>
</tr>
<tr>
<td >
<a href="/overview">fathe</a></td>
<td rid="number1">10</td>
<td rid="number">36</td>
<td >first2</td>
<td >second2</td>
<td >third2</td>
</tr>
<tr>
<td >
<a href="/overview">kali</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first3</td>
<td >second3</td>
<td >third3</td>
</tr>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
for tr in soup.select('.students tbody tr'):
print(list(tr.stripped_strings))
Output:
['sami', '32', '20', 'first1', 'second1', 'third1']
['fadi', '10', '36', 'first2', 'second2', 'third2']
['achraf', '32', '20', 'first3', 'second3', 'third3']
['john', '32', '20', 'first1', 'second1', 'third1']
['noor', '10', '36', 'first2', 'second2', 'third2']
['dadi', '32', '20', 'first3', 'second3', 'third3']
['ham', '32', '20', 'first1', 'second1', 'third1']
['fathe', '10', '36', 'first2', 'second2', 'third2']
['kali', '32', '20', 'first3', 'second3', 'third3']
CodePudding user response:
Main issue here is the ill-fated nesting - So to get your expected output, you have to select your elements more specific. One possible strategy is to select the first direct <tr>
and so only its direct <tr>
siblings:
for d in soup.select('div.students tbody > tr:nth-of-type(1)'):
print('---------------------------')
l = []
l.append(list(d.stripped_strings))
l.extend([list(n.stripped_strings) for n in d.find_next_siblings('tr')])
print(*[e for lst in l for e in lst], sep='\n')
Example
html = '''
<div id="compare">
<div >
<tbody>
<tr>
<td >
<a href="/overview">sami</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first1</td>
<td >second1</td>
<td >third1</td>
</tr>
<tr>
<td >
<a href="/overview">fadi</a></td>
<td rid="number1">10</td>
<td rid="number">36</td>
<td >first2</td>
<td >second2</td>
<td >third2</td>
</tr>
<tr>
<td >
<a href="/overview">achraf</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first3</td>
<td >second3</td>
<td >third3</td>
</tr>
<div >
<tbody>
<tr>
<td >
<a href="/overview">john</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first1</td>
<td >second1</td>
<td >third1</td>
</tr>
<tr>
<td >
<a href="/overview">noor</a></td>
<td rid="number1">10</td>
<td rid="number">36</td>
<td >first2</td>
<td >second2</td>
<td >third2</td>
</tr>
<tr>
<td >
<a href="/overview">dadi</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first3</td>
<td >second3</td>
<td >third3</td>
</tr>
<div >
<tbody>
<tr>
<td >
<a href="/overview">ham</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first1</td>
<td >second1</td>
<td >third1</td>
</tr>
<tr>
<td >
<a href="/overview">fathe</a></td>
<td rid="number1">10</td>
<td rid="number">36</td>
<td >first2</td>
<td >second2</td>
<td >third2</td>
</tr>
<tr>
<td >
<a href="/overview">kali</a></td>
<td rid="number1">32</td>
<td rid="number">20</td>
<td >first3</td>
<td >second3</td>
<td >third3</td>
</tr>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for d in soup.select('div.students tbody > tr:nth-of-type(1)'):
print('---------------------------')
l = []
l.append(list(d.stripped_strings))
l.extend([list(n.stripped_strings) for n in d.find_next_siblings('tr')])
print(*[e for lst in l for e in lst], sep='\n')
Output
---------------------------
sami
32
20
first1
second1
third1
fadi
10
36
first2
second2
third2
achraf
32
20
first3
second3
third3
---------------------------
john
32
20
first1
second1
third1
noor
10
36
first2
second2
third2
dadi
32
20
first3
second3
third3
---------------------------
ham
32
20
first1
second1
third1
fathe
10
36
first2
second2
third2
kali
32
20
first3
second3
third3