How can I dynamically split the text based on multiple sub-titles in a given text for every new text-CodePudding

I have an original text that looks like this:

We are AMS. We are a global total workforce solutions firm; we enable organisations to thrive in an age of constant change by building, re-shaping, and optimising workforces. Our Contingent Workforce Solutions (CWS) is one of our service offerings; we act as an extension of our clients' recruitment team and provide professional interim and temporary resources.

We are currently working with our client, Royal London.

Royal London is a financial services company with a difference. As the UK's largest mutual life, pensions and investment company, we're owned by our members and work for their benefit, not for shareholder profits. We've grown rapidly and have been recognised as one of the UK's top-rated places to work.

Today, Royal London has over £114 billion of funds under management, and around 3,500 employees working in six offices across the UK and Ireland. We've worked hard to become experts in our specialist markets, building a trusted brand - and our teams have plenty of awards to show for it. Whatever team you're interested in joining and whatever role you play; we'll help you to make a difference.

We are looking for a Business Analyst for a 6-month contract based in London.

Purpose of the Role:

You will be working with the internal data squad looking at new functionality within the business and associated reporting. Part of project will involve system upgrades

As the Business Analyst, you will be responsible for:

Looking at data sets, extracting the information and be able to look at SQL scripts, write report sequences, analyse data. Be able to understand and deliver data, ask questions and challenge requirements, understand the data journey/mapping documents.

The skills, attributes and capabilities we are seeking from you include:

Strong communication both verbal and written
Strong teamworking within the scrum team and with other BAs and directly with business users
Significant asset management experience
Working knowledge of the key data sets that are used by an asset manager
Experience of Master Data Management tools, ideally IHS Markit EDM
Agile working experience
Ability to write user stories to detail the requirements that both the development team and the * QA team will use
Strong SQL skills, ideally using Microsoft SQL Server
Experience of managing data interface mapping documentation
Familiarity with data modelling concepts
Project experience based on ETL and Data Warehousing advantageous
Technical (development) background advantageous
Have an asset management background.
Thinkfolio and Murex would be ideal, EDM platform knowledge would be desirable. This client will only accept workers operating via a engagement model.

If you are interested in applying for this position and meet the criteria outlined above, please click the link to apply and speak to one of our sourcing specialists now.

AMS, a Recruitment Process Outsourcing Company, may in the delivery of some of its services be deemed to operate as an Employment Agency or an Employment Business

I have used the below to split and extract text based on the sub-titles from their original html using beautiful soup. Basically, the aim is to:

Separate the html extract by bold text.
From this list of bold texts, extract those that are both bold and have ':' in them to represent it being a legitimate sub-title
Then find out the positions of the first and last legitimate sub-titles from the list of bold texts. This will help to split the text if there are other bold texts that lack the ':' within them below the last sub-title's text.
Conduct a split based on the condition that the last sub-title truly is the last element in the list of bold texts, if not then split the text further to separate the sub-title's texts from other texts.

The code below demonstrates this:

from fake_useragent import UserAgent
import requests
def headers():
    ua = UserAgent()
    chrome_header = ua.chrome
    headers = {'User-Agent': chrome_header}
    return headers

headers = headers()

r5 = requests.get("https://www.reed.co.uk/jobs/business-analyst/46819093?source=searchResults&filter=/jobs/business-jobs-in-london?agency=True&direct=True", headers=headers, timeout=20)

soup_description = BS(r5.text, 'html.parser')
j_description = soup_description.find('span', {'itemprop':'description'})
j_description_subtitles = [j.text for j in j_description.find_all('strong')]
sub_titles_in_description = [el for el in j_description_subtitles if ":" in el]

total_length_of_sub_titles = len(sub_titles_in_description)
total_length_of_strong_tags = len(j_description_subtitles)
Position_of_first_sub_title = j_description_subtitles.index(sub_titles_in_description[0])
Position_of_last_sub_title = j_description_subtitles.index(sub_titles_in_description[-1])

# If the position of the last subtitle text does not equal the total number of strong tags, then split the final output by the next indexed position in the list.
if Position_of_last_sub_title != total_length_of_strong_tags:
    text_after_sub_t= re.split(f'{sub_titles_in_description[0]}|{sub_titles_in_description[1]}|{sub_titles_in_description[-1]}| {j_description_subtitles[Position_of_last_sub_title 1]}',j_description.text)[1:Position_of_last_sub_title]
else:
    text_after_sub_t= re.split(f'{sub_titles_in_description[0]}|{sub_titles_in_description[1]}|{sub_titles_in_description[-1]}',j_description.text)[1:]

final_dict_with_sub_t_n_prec_txt= {
    sub_titles_in_description[0]: text_after_sub_t[0],
    sub_titles_in_description[1]: text_after_sub_t[1],
    sub_titles_in_description[2]: text_after_sub_t[2]
    
}

The problem is the splitting of text based on the sub-title. Its too manual and have tried other methods to no avail to make this dynamic. How would I go about making this part dynamic because in future texts, the number of sub-titles will differ.

CodePudding user response：

You could simplify or make it more generic by using css selectors for selecting your elements e.g. p:has(strong:-soup-contains(":")) would select all <p> that has a child <strong> with an :. Getting the additional information use find_next_sibling():

dict((e.text,e.find_next_sibling().get_text('|',strip=True)) for e in soup.select('[itemprop="description"] p:has(strong:-soup-contains(":"))'))

Note: Added | as seperator to get_text(), so in this case you are able to split the list elements later. You can also replace it with whitespace get_text(' ',strip=True)

Example

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get("https://www.reed.co.uk/jobs/business-analyst/46819093?source=searchResults&filter=/jobs/business-jobs-in-london?agency=True&direct=True", headers=headers, timeout=20)

soup = BeautifulSoup(r.text, 'html.parser')

data = dict((e.text,e.find_next_sibling().get_text('|',strip=True)) for e in soup.select('[itemprop="description"] p:has(strong:-soup-contains(":"))'))

print(data)

Output

{'Purpose of the Role:': 'You will be working with the internal data squad looking at new functionality within the business and associated reporting. Part of project will involve system upgrades',
 'As the Business Analyst, you will be responsible for:': 'Looking at data sets, extracting the information and be able to look at SQL scripts, write report sequences, analyse data. Be able to understand and deliver data, ask questions and challenge requirements, understand the data journey/mapping documents.',
 'The skills, attributes and capabilities we are seeking from you include:': 'Strong communication both verbal and written|Strong teamworking within the scrum team and with other BAs and directly with business users|Significant asset management experience|Working knowledge of the key data sets that are used by an asset manager|Experience of Master Data Management tools, ideally IHS Markit EDM|Agile working experience|Ability to write user stories to detail the requirements that both the development team and the QA team will use|Strong SQL skills, ideally using Microsoft SQL Server|Experience of managing data interface mapping documentation|Familiarity with data modelling concepts|Project experience based on ETL and Data Warehousing advantageous|Technical (development) background advantageous|Have an asset management background.|Thinkfolio and Murex would be ideal, EDM platform knowledge would be desirable.'}