Home > other >  BeautifulSoup / How to extract a specific paragraph of text?
BeautifulSoup / How to extract a specific paragraph of text?

Time:11-20

I'm using Beautifulsoup to extract information from individual MP pages, e.g. https://publications.parliament.uk/pa/cm/cmregmem/211115/cox_geoffrey.htm

I want to extract the text under each numbered bold heading (e.g. '1. Employment and earnings') and save them individually. The headings change for each different MP (e.g. some declare '3. Gifts, benefits and hospitality from UK sources' and some do not) - and I want a script which works for any MP's page.

At the moment I'm getting into a terrible mess trying to do it with loops. I'm quite new to BS (and python) so I feel I might be missing a trick. Does anyone have any ideas?

import requests
from bs4 import BeautifulSoup

#urls
home_url = "https://publications.parliament.uk/pa/cm/cmregmem/211101/"

#extracting list of mp names and links   save as tuples in list (mp_list)
home_page = requests.get(home_url 'contents.htm')
home_soup = BeautifulSoup(home_page.content, "html.parser")

mp_list = []
mp_elements = home_soup.find_all("p", attrs={'class':None, 'xmlns':'http://www.w3.org/1999/xhtml'})

for mp_element in mp_elements:
    try:
        mp_name = list(mp_element.children)[1].text.strip()
        mp_url = list(mp_element.children)[1]['href']
        mp_list.append((mp_name,mp_url))
    except:
        pass

#extract text from mp page
mp_url = home_url mp_list[115][1] ##this is just to pick out an example MP page to test with
print(mp_url)
mp_page = requests.get(mp_url)
mp_soup = BeautifulSoup(mp_page.content, "html.parser")
mp_text_all = mp_soup.find_all("p")

mp_text_list = []
for item in mp_text_all:
    mp_text_list.append(item.text)

CodePudding user response:

You can do like this.

  • The text you need is present inside <p> tags with class=indent. Select all those <p> tags using .find_all().
  • If you want the heading, then you need to select the <p> before the above selected <p> tag. I have used here .findPreviousSibling() to do that.

Here is the full code that works for any MP's page. You just need to call the function get_data() by passing in the MP's url.

import requests
from bs4 import BeautifulSoup

def get_data(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    p = soup.find_all('p', class_='indent')

    for i in p:
        heading = i.findPreviousSibling('p').find('strong')
        if heading:
            heading = heading.text.strip()
            print(heading)
        print(f'{i.text.strip()}\n')


url1 = 'https://publications.parliament.uk/pa/cm/cmregmem/211101/bridgen_andrew.htm'
url2 = 'https://publications.parliament.uk/pa/cm/cmregmem/211101/robinson_mary.htm'

print(' URL-1 '.center(50, '*'))
get_data(url1)
print(' URL-2 '.center(50, '*'))
get_data(url2)

This works for any MP's page. Here is the output of two different MP's links.

********************* URL-1 **********************
1. Employment and earnings
From 6 May 2020 to 5 May 2022, Adviser to Mere Plantations Ltd of Unit 1 Cherry Tree Farm, Cherry Tree Lane, Rostherne WA14 3RZ; a company which grows teak in Ghana. I provide advice on business and international politics. I will be paid £12,000 a year for an expected monthly commitment of 8 hrs. (Registered 17 June 2020; updated 23 December 2020)

Payments from Open Dialogus Ltd, 14 London Street, Andover SP11 6UA, for writing articles:

7. (i) Shareholdings: over 15% of issued share capital
AB Produce PLC; processing and distribution of fresh vegetables.

AB Produce Trading Ltd; holding company.

Bridgen Investments Ltd; investment company, investing in shares, property, building projects.

From 6 February 2017, AB Farms Ltd; potato production and storage. (Registered 21 March 2017)

********************* URL-2 **********************
2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation
Name of donor: IX Wireless LtdAddress of donor: 4 Lockside Office Park, Lockside Road, Riversway, Preston PR2 2YSAmount of donation or nature and value if donation in kind: £2,000 to my local associationDonor status: company, registration 11008144(Registered 30 July 2021)

7. (i) Shareholdings: over 15% of issued share capital
Mary Felicity Design Ltd; clothing design company. (Registered 03 June 2015)

8. Miscellaneous
From 31 January 2020, member of Cheadle Towns Fund Board. This is an unpaid role. (Registered 28 January 2020)

From 20 June 2021, unpaid director of the Northern Research Group Ltd, a shared services company for northern MPs. (Registered 04 August 2021)

CodePudding user response:

So far,the desired solution is as follows:

import pandas as pd
import requests
from bs4 import BeautifulSoup
data=[]
def get_data(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    h1 =[x.get_text(strip=True) for x in soup.select('p[xmlns="http://www.w3.org/1999/xhtml"]')]
    print(h1)
    


url1 = 'https://publications.parliament.uk/pa/cm/cmregmem/211101/bridgen_andrew.htm'
url2 = 'https://publications.parliament.uk/pa/cm/cmregmem/211101/robinson_mary.htm'

print(' URL-1 '.center(50, '*'))
get_data(url1)
print(' URL-2 '.center(50, '*'))
get_data(url2)

cols = ["heading", "details"]

df = pd.DataFrame(data, columns= cols)
#print(df)
#df.to_csv('info.csv',index = False)

Output:

['Bridgen, Andrew (North West Leicestershire)', '1. Employment and earnings', 'From 6 May 2020 to 5 May 2022, Ady, building projects.', 'From 6 February 2017, AB Farms Ltd; potato production and storage. (Registered 21 March 2017)', '']
********************* URL-2 **********************
['Robinson, Mary (Cheadle)', '2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation', 'Name of donor: IX Wireless LtdAddress of donor: 4 Lockside Office Park, Lockside Road, Riversway, Preston PR2 2YSAmount of donation or nature and value if donation in kind: £2,000 to my local associationDonor status: company, registration 11008144(Registered 30 July 2021)', '7. (i) Shareholdings: over 15% of issued share capital', 'Mary Felicity Design Ltd; clothing design company. (Registered 03 June 2015)', '8. Miscellaneous', 'From 31 January 2020, member of Cheadle Towns Fund Board. This is an unpaid role. (Registered 28 January 2020)', 'From 20 June 2021, unpaid director of the Northern Research Group Ltd, a shared services company for northern MPs. (Registered 04 August 2021)', '']
  • Related