Home > OS >  How should I scrape all <em> tag innertexts within a <ul> and make them into a pandas da
How should I scrape all <em> tag innertexts within a <ul> and make them into a pandas da

Time:03-19

I am currently trying to scrape the information I want from a website.

The information that I want is contained within a ul>li>em. I have scraped tables before, but I have never scraped lists.

How should I scrape the information I want?

In addition, I want to know if there is a way to make all the innertexts in <em> and put them in a dataframe.

The <ul> basically looks like this.

<ul >
        <li><em>2015-12-28</em></li>
        <li><em>2015-12-28</em></li>

                   ......

        <li><em>2015-12-28</em></li>
        <li><em>2015-12-28</em></li>
        <li><em>2015-12-28</em></li>
</ul>

CodePudding user response:

Just select your <ul> and use in this case stripped_strings to get all text in a list:

data = soup.select_one('ul.reportData').stripped_strings

or more specific with list comprehensionfrom all em

data = [e.text for e in soup.select('ul.reportData em')]

Example

import pandas as pd
from bs4 import BeautifulSoup

html='''
<ul >
        <li><em>2015-12-28</em></li>
        <li><em>2015-12-28</em></li>
        <li><em>2015-12-28</em></li>
        <li><em>2015-12-28</em></li>
        <li><em>2015-12-28</em></li>
</ul>
'''

soup = BeautifulSoup(html)

data = soup.select_one('ul.reportData').stripped_strings

pd.DataFrame(data, columns=['date'])

Output

date
2015-12-28
2015-12-28
2015-12-28
2015-12-28
2015-12-28

CodePudding user response:

find_all returns a list, which you can directly import in pandas:

from bs4 import BeautifulSoup
import pandas as pd

html = '''<ul >
        <li><em>2015-12-28</em></li>
        <li><em>2015-12-28</em></li>
        <li><em>2015-12-28</em></li>
        <li><em>2015-12-28</em></li>
        <li><em>2015-12-28</em></li>
</ul>'''

soup = BeautifulSoup(html)
df = pd.DataFrame([i.get_text() for i in soup.find('ul', class_='reportData').find_all('em')], columns=['date'])
  • Related