I'm trying to scrape a page where I have the following code:
<li>
<span >Person One</span>
<span >Mall</span>
<span >-Yes<span >Yes</span></span>
</li>
<li>
<span >Person Two</span>
<span >Market</span>
<span >-Yes<span >Yes</span></span>
</li>
<li>
<span >Person Three</span>
<span >Mall</span>
<span >-Yes<span >No</span></span>
</li>
I already found the class where those tags are located with bs4
. My goal is to get each of these span classes and then order that into a dict to transform it into a dataframe.
I'm really stuck at this point! Any help would be nice
CodePudding user response:
Extract key from span class and value from its text with dict comprehension
:
{x.get('class')[0]: x.text for x in li.select('span')}
Cause x.get('class')
will result in a list, we have to pick the first of its element(s) to make list comprehension
work.
Adjustment to values of employee, caused of nested <span>
s:
df['employee'] = df.apply(lambda x: ''.join(x['employee'].strip('-').split(x['contractor'], 1)), axis=1)
Example
from bs4 import BeautifulSoup
import pandas as pd
html='''
<li>
<span >Person One</span>
<span >Mall</span>
<span >-Yes<span >Yes</span></span>
</li>
<li>
<span >Person Two</span>
<span >Market</span>
<span >-Yes<span >Yes</span></span>
</li>
<li>
<span >Person Three</span>
<span >Mall</span>
<span >-Yes<span >No</span></span>
</li>
'''
soup = BeautifulSoup(html)
data = []
for li in soup.select('li'):
data.append({x.get('class')[0]: x.text for x in li.select('span')})
df = pd.DataFrame(data)
df['employee'] = df.apply(lambda x: ''.join(x['employee'].strip('-').split(x['contractor'], 1)), axis=1)
df
Output
name | organization | employee | contractor |
---|---|---|---|
Person One | Mall | Yes | Yes |
Person Two | Market | Yes | Yes |
Person Three | Mall | Yes | No |