Home > Software engineering >  How to get value of span tag using BeautifulSoup and organizing them as dict
How to get value of span tag using BeautifulSoup and organizing them as dict

Time:02-15

I'm trying to scrape a page where I have the following code:

<li>
<span >Person One</span>
<span >Mall</span>
<span >-Yes<span >Yes</span></span>
</li>

<li>
<span >Person Two</span>
<span >Market</span>
<span >-Yes<span >Yes</span></span>
</li>

<li>
<span >Person Three</span>
<span >Mall</span>
<span >-Yes<span >No</span></span>
</li>

I already found the class where those tags are located with bs4. My goal is to get each of these span classes and then order that into a dict to transform it into a dataframe.

I'm really stuck at this point! Any help would be nice

CodePudding user response:

Extract key from span class and value from its text with dict comprehension:

{x.get('class')[0]: x.text for x in li.select('span')} 

Cause x.get('class') will result in a list, we have to pick the first of its element(s) to make list comprehension work.

Adjustment to values of employee, caused of nested <span>s:

df['employee'] = df.apply(lambda x: ''.join(x['employee'].strip('-').split(x['contractor'], 1)), axis=1)

Example

from bs4 import BeautifulSoup
import pandas as pd

html='''
<li>
<span >Person One</span>
<span >Mall</span>
<span >-Yes<span >Yes</span></span>
</li>

<li>
<span >Person Two</span>
<span >Market</span>
<span >-Yes<span >Yes</span></span>
</li>

<li>
<span >Person Three</span>
<span >Mall</span>
<span >-Yes<span >No</span></span>
</li>
'''

soup = BeautifulSoup(html)

data = []
for li in soup.select('li'):
    data.append({x.get('class')[0]: x.text for x in li.select('span')})

df = pd.DataFrame(data)
df['employee'] = df.apply(lambda x: ''.join(x['employee'].strip('-').split(x['contractor'], 1)), axis=1)
df

Output

name organization employee contractor
Person One Mall Yes Yes
Person Two Market Yes Yes
Person Three Mall Yes No
  • Related