Hello everyone I have the information I want pulled using BeautiuflSoup
but I can't seem to get it printed out correctly to send to pandas
and excel
.
html_f ='''
<li >
<div>
<div >
<p >
07/01/2022 Date
<span > </span>
</p>
</div>
<div style="display: block; overflow: hidden;">
<p >
<span >Comment</span><br>
[1] Comments
</p>
</div>
</div>
</li>'''
My code used to pull the data I want:
soup = BeautifulSoup(html_f,'html.parser')
for child in soup.findAll('li',class_='list-group-item')[0]:
print (child.text)
Here is the info it pulls But it prints it out weird with tons of spacing
07/01/2022 Date
Comment
[1] Comments
Ideally, I only need the top portion of (date and File Date) printed out but at the very least I need help getting it into a list format like:
07/01/2022 Date
Comment
[1] Comments
CodePudding user response:
So far so good, it's my trying
doc='''
<li >
<div>
<div >
<p >
07/01/2022 Date
<span >
</span>
</p>
</div>
<div style="display: block; overflow: hidden;">
<p >
<span >
Comment
</span>
<br/>
[1] Comments
</p>
</div>
</div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(doc, 'html.parser')
text=[' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
print(text)
Output:
['07/01/2022, Comments']
Try this ways,must work
text=' '.join([' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]).strip()
#Or
text= [' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
final_text= text[1] ',' text[2]
final_text= text[1] text[2].split()#if you want to make list
CodePudding user response:
To get your information printed as expected in your question, you could use stripped_strings
and iterate over its elements:
for e in soup.find_all('li',class_='list-group-item'):
for t in list(e.stripped_strings):
print(t)
Note: In new code use find_all()
instead of old syntax findAll()
.
Example
html='''
<li >
<div>
<div >
<p >
07/01/2022 Date
<span >
</span>
</p>
</div>
<div style="display: block; overflow: hidden;">
<p >
<span >
Comment
</span>
<br/>
[1] Comments
</p>
</div>
</div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for e in soup.find_all('li',class_='list-group-item'):
for t in list(e.stripped_strings):
print(t)
Output
07/01/2022 Date
Comment
[1] Comments
Not sure cause you are talking about pandas
, you also could pick each information, clean it up and append to a list of dicts:
data = []
for e in soup.find_all('li',class_='list-group-item'):
data.append({
'date': e.p.text.strip().replace(' Date',''),
'comment': e.select_one('.tyler-toggle-container br').next_sibling.strip()
})
pd.DataFrame(data)
Output
date | comment |
---|---|
07/01/2022 | [1] Comments |