I am currently trying to scrape the information I want from a website.
The information that I want is contained within a ul>li>em
. I have scraped tables before, but I have never scraped lists.
How should I scrape the information I want?
In addition, I want to know if there is a way to make all the innertexts in <em>
and put them in a dataframe
.
The <ul>
basically looks like this.
<ul >
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
......
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
</ul>
CodePudding user response:
Just select your <ul>
and use in this case stripped_strings
to get all text in a list:
data = soup.select_one('ul.reportData').stripped_strings
or more specific with list comprehension
from all em
data = [e.text for e in soup.select('ul.reportData em')]
Example
import pandas as pd
from bs4 import BeautifulSoup
html='''
<ul >
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
</ul>
'''
soup = BeautifulSoup(html)
data = soup.select_one('ul.reportData').stripped_strings
pd.DataFrame(data, columns=['date'])
Output
date |
---|
2015-12-28 |
2015-12-28 |
2015-12-28 |
2015-12-28 |
2015-12-28 |
CodePudding user response:
find_all
returns a list, which you can directly import in pandas:
from bs4 import BeautifulSoup
import pandas as pd
html = '''<ul >
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
</ul>'''
soup = BeautifulSoup(html)
df = pd.DataFrame([i.get_text() for i in soup.find('ul', class_='reportData').find_all('em')], columns=['date'])