Home > Blockchain >  How to extract paragraph content and mark the text embedded to a link using BeautifulSoup
How to extract paragraph content and mark the text embedded to a link using BeautifulSoup

Time:08-16

Given a paragraph on a Wikipedia page (i.e. Lana Del Ray) using BeautifulSoup I need to extract the text from the paragraph and mark the word or set of words that have enclosed a link.

For example, consider the first sentences of the first paragraph:

Elizabeth Woolridge Grant (born June 21, 1985), known professionally as Lana Del Rey, is an American singer and songwriter. Her music is noted for its cinematic quality and exploration of tragic romance, glamour, and melancholia, containing references to contemporary pop culture and 1950s–1960s Americana.

I need to have it in this format:

Elizabeth Woolridge Grant (born June 21, 1985), known professionally as Lana Del Rey, is an American singer and songwriter. Her music is noted for its cinematic quality and exploration of tragic romance, START_A glamour END_A, and START_A melancholia END_A, containing references to contemporary START_A pop culture END_A and START_A 1950s–1960s Americana END_A.

Until now I could extract either one paragraph or link separately using something like this:

from urllib import request
from bs4 import BeautifulSoup
soup = BeautifulSoup(request.urlopen("https://en.wikipedia.org/wiki/Lana_Del_Rey").read())

 for tag in soup.select('p a[href]'):
     if tag['href'].startswith('/wiki/'):
         text = tag.text.strip()
         print(text)

for tag in soup.select('p'):
        text = tag.text.strip()
        print(text)

Is it possible to extract the text of the paragraph

tag and identify which words have a link attached to them?

CodePudding user response:

You could use replaceWith() to convert the <a> and its text to your expected output:

a.replaceWith('START_A ' a.text ' END_A')

Example

from urllib import request
from bs4 import BeautifulSoup
soup = BeautifulSoup(request.urlopen("https://en.wikipedia.org/wiki/Lana_Del_Rey").read())

for tag in soup.select('p'):
    for a in tag.select('a'):
        a.replaceWith('START_A ' a.text ' END_A')
    print(tag.text)
Output

Elizabeth Woolridge Grant (born June 21, 1985), known professionally as Lana Del Rey, is an American singer and songwriter. Her music is noted for its cinematic quality and exploration of tragic romance, START_A glamour END_A, and START_A melancholia END_A, containing references to contemporary START_A pop culture END_A and START_A 1950s END_A–START_A 1960s END_A START_A Americana END_A.START_A [1] END_A She is the recipient of START_A various accolades END_A, including two START_A Brit Awards END_A, two START_A MTV Europe Music Awards END_A, and a START_A Satellite Award END_A, in addition to nominations for six START_A Grammy Awards END_A and a START_A Golden Globe Award END_A.START_A [2] END_A START_A Variety END_A honored her at their START_A Hitmakers Awards END_A for being "one of the most influential singer-songwriters of the 21st century."START_A [3] END_ASTART_A [4] END_A

...

  • Related