I am trying to get python to scrape a page on Mississippi's state legislature website. My goal is scrape a page and add what I've scraped into a new csv. My command prompt doesn't give me errors, but I am only scraping a " symbol and that is it. Here is what I have so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['http://www.legislature.ms.gov/legislation/all-measures/']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict = [item.text for item in soup.select('tbody')]
df = pd.DataFrame.from_dict(temp_dict, orient='index').transpose()
df.to_csv('3-New Bills.csv')
I believe the problem is with line 13:
temp_dict = [item.text for item in soup.select('tbody')]
What should I replace 'tbody' with in this code to see all of the bills? Thank you so much for your help.
CodePudding user response:
EDIT: Please see Sergey K' comment below, for a more elegant solution.
That table is being loaded in an xframe, so you would have to scrape that xframe's source for data. The following code will return a dataframe with 3 columns (measure, shorttitle, author):
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
list_for_df = []
r = requests.get('http://billstatus.ls.state.ms.us/2022/pdf/all_measures/allmsrs.xml', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
for x in soup.select('msrgroup'):
list_for_df.append((x.measure.text.strip(), x.shorttitle.text.strip(), x.author.text.strip()))
df = pd.DataFrame(list_for_df, columns = ['measure', 'short_title', 'author'])
df
Result:
measure short_title author
0 HB 1 Use of technology portals by those on probatio... Bell (65th)
1 HB 2 Youth court records; authorize judge to releas... Bell (65th)
2 HB 3 Sales tax; exempt retail sales of severe weath... Bell (65th)
3 HB 4 DPS; require to establish training component r... Bell (65th)
4 HB 5 Bonds; authorize issuance to assist City of Ja... Bell (65th)
... ... ... ...
You can add more data to that table, like measurelink, authorlink, action, etc - whatever is available in the xml document tags.
CodePudding user response:
Try get_text instead
https://beautiful-soup-4.readthedocs.io/en/latest/#get-text
temp_dict = [item.get_text() for item in soup.select('tbody')]
IIRC The .text only shows the direct child text, not including the text of descendant tags. See XPath - Difference between node() and text() (which I think applies here for .text as well - it is the child text node, not other child nodes)