Home > Back-end >  How can I get my python code to scrape the correct part of a website?
How can I get my python code to scrape the correct part of a website?

Time:07-21

I am trying to get python to scrape a page on Mississippi's state legislature website. My goal is scrape a page and add what I've scraped into a new csv. My command prompt doesn't give me errors, but I am only scraping a " symbol and that is it. Here is what I have so far:

import requests
from bs4 import BeautifulSoup
import pandas as pd

list = ['http://www.legislature.ms.gov/legislation/all-measures/']

temp_dict = {}

for page in list:
   r = requests.get(page)
    soup = BeautifulSoup(r.content, 'html.parser')

    temp_dict = [item.text for item in soup.select('tbody')]

df = pd.DataFrame.from_dict(temp_dict, orient='index').transpose()
df.to_csv('3-New Bills.csv')

I believe the problem is with line 13:

    temp_dict = [item.text for item in soup.select('tbody')]

What should I replace 'tbody' with in this code to see all of the bills? Thank you so much for your help.

CodePudding user response:

EDIT: Please see Sergey K' comment below, for a more elegant solution.

That table is being loaded in an xframe, so you would have to scrape that xframe's source for data. The following code will return a dataframe with 3 columns (measure, shorttitle, author):

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}

list_for_df = []
r = requests.get('http://billstatus.ls.state.ms.us/2022/pdf/all_measures/allmsrs.xml', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
for x in soup.select('msrgroup'):
    list_for_df.append((x.measure.text.strip(), x.shorttitle.text.strip(), x.author.text.strip()))

df = pd.DataFrame(list_for_df, columns = ['measure', 'short_title', 'author'])
df

Result:

    measure short_title author
0   HB 1    Use of technology portals by those on probatio...   Bell (65th)
1   HB 2    Youth court records; authorize judge to releas...   Bell (65th)
2   HB 3    Sales tax; exempt retail sales of severe weath...   Bell (65th)
3   HB 4    DPS; require to establish training component r...   Bell (65th)
4   HB 5    Bonds; authorize issuance to assist City of Ja...   Bell (65th)
... ... ... ...

You can add more data to that table, like measurelink, authorlink, action, etc - whatever is available in the xml document tags.

CodePudding user response:

Try get_text instead

https://beautiful-soup-4.readthedocs.io/en/latest/#get-text

temp_dict = [item.get_text() for item in soup.select('tbody')]

IIRC The .text only shows the direct child text, not including the text of descendant tags. See XPath - Difference between node() and text() (which I think applies here for .text as well - it is the child text node, not other child nodes)

  • Related