Home > Enterprise >  soup: extract all paragraphs with a specific class excluding those that are in tables
soup: extract all paragraphs with a specific class excluding those that are in tables

Time:10-17

I have a messy old MCQ word document that I converted to HTML to extract the MCQ in a beautiful manner to make it useful & Easy to create a Microsoft forms.

The question sets that I want to extract MCQ from could be obtained enter image description here

I wrote the following code to extract the paragraphs I need, but it is also extracting the paragraphs from the tables which is not useful to create list for question and list for potential solutions to each question. My code is as follow for now:

from bs4 import BeautifulSoup
import os
from nltk.tokenize import RegexpTokenizer

# Read .docx file in the CWD
file=[x for x in os.listdir() if '.htm' in x][0]

# Create a soup to parse information
soup = BeautifulSoup(open(file), "html.parser")

# Find all paragraph elements that contains required information
results = soup.find_all("p", class_="MsoNormal")

# Check number of words
tokenizer = RegexpTokenizer(r'\w ')

# Extract questions
Extract_questions=[x.text for x in results if len(tokenizer.tokenize(x.text))>1]

May you please help me to create the required docx file that I want? I really do not know where to start.

CodePudding user response:

This is by no means complete code but you it can give you a start:

import pandas as pd
from itertools import groupby
from bs4 import BeautifulSoup
from textwrap import wrap


with open("page.html", "r") as f_in:
    soup = BeautifulSoup(f_in.read(), "html.parser")

results = soup.select("body > div > .MsoNormal, body > div > .MsoNormalTable")

groups = [group := []]
for r in results:
    if r.text.startswith("Question "):
        groups.append(group := [r])
    else:
        group.append(r)

for g in groups:
    for p in g:
        if p["class"] == ["MsoNormalTable"]:
            df = pd.read_html(str(p))[0].fillna("")
            print()
            print(df.to_csv(index=False, header=None, sep="\t"))
        else:
            t = p.get_text(strip=True).replace("\n", " ").strip()
            if (
                t
                and "Question " not in t
                and "L1EC" not in t
                and "Lesson " not in t
            ):
                print("\n".join(wrap(t, 70)))
    print("-" * 80)

Prints:

--------------------------------------------------------------------------------
The price of ABC Financial News is increased from $2.00 to $2.50; this
leads to an increase in the sales of a competing financial
magazine, XYZ Finance, which now sells 120,000 copies a week, up from
100,000 copies a week. The cross-price elasticity of demand is closest
to:

        0.8
        1.22
        1.25

--------------------------------------------------------------------------------
The following table lists the market shares of three major firms in an
industry. The industry's three-firm Herfindahl-Hirschman Index
is closest to:

Firms   Market Share
X       20%
Y       30%
Z       10%


        0.14
        0.33
        0.6

--------------------------------------------------------------------------------
Over a period of 1 year, a country’s real GDP increases from $168
billion to $179 billion, and the GDP deflator increases from 115 to
122.
The increase in the country’s nominal GDP over the year is closest to:

        6.55%
        13.03%
        4.34%

--------------------------------------------------------------------------------
Consider the following statements:
Statement 1: A government is said to have a trade deficit if its
expenditure exceeds net taxes.
Statement 2: An economy must finance a trade deficit by borrowing from
the rest of the world.
Which of the following is most likely?

        Only Statement 1 is incorrect.
        Only Statement 2 is incorrect.
        Both statements are correct.

--------------------------------------------------------------------------------

CodePudding user response:

I want to thank @Andrej kesely, he really helped me go through this.

I was able to finally do it as so:

from bs4 import BeautifulSoup
import os
# from nltk.tokenize import RegexpTokenizer
from textwrap import wrap
import pandas as pd
from itertools import groupby

#  Creating docx
from docx import Document
from docx.shared import Inches


# Read .docx file in the CWD
file=[x for x in os.listdir() if '.htm' in x][1]

# Create a soup to parse information
soup = BeautifulSoup(open(file), "html.parser")

# Find all paragraph elements that contains required information
results = soup.find_all("p", class_="MsoNormal")

results = soup.select("body > div > .MsoNormal, body > div > .MsoNormalTable")
groups = [group := []]
for r in results:
    if r.text.startswith("Question "):
        groups.append(group := [r])
    else:
        group.append(r)


# Extract questions and solution of each question
Questions=[Question:=[]]
Solutions=[Solution:=[]]


not_welcomed_phrases=["Question ","L1EC","Lesson ","L1R","L100"]

for g in groups:
    # Get each question 
    q=[]
    # length of the tables
    Numberoftables=len([p1 for p1 in g if "<table" in str(p1)])
    i_table=0
    for p in g:
        # Ensure that you are not parsing MCQ
        if p["class"] != ["MsoNormalTable"]:
            t=p.text.replace("\n", " ")
            if ( t and not any(word in t for word in not_welcomed_phrases)):
                q.append(t)
        else:
            # Check if you have two tables
            if Numberoftables==1:
                # t1=p.text
                t1=[x.text for x in p.select("td > p") if len(x.text)>1]
                # print(t1)
                if t1:
                    Solutions.append(t1)
            else:
                if i_table==0:
                    # Get tables
                    Tables=[p1 for p1 in g if "<table" in str(p1)]

                    # Extract the first table into the question
                    Table1=Tables[0]
                    df = pd.read_html(str(Table1))[0].fillna("")
                    q.append(df.to_csv(index=False, header=None, sep="\t"))
                    i_table=1
                else:
                    # Extract the second table into the question
                    Table1=Tables[1]
                    t1=[x.text for x in Table1.select("td > p") if len(x.text)>1]
                    if t1:
                        Solutions.append(t1)

    Questions.append(Question := ["\n".join(q)])


# Print them into a new word document
document = Document()
for i in range(2,len(Questions)):
    document.add_paragraph(Questions[i], style='List Number')
    document.add_paragraph('\t a. ' Solutions[i-1][0])
    document.add_paragraph('\t b. ' Solutions[i-1][1])
    document.add_paragraph('\t c. ' Solutions[i-1][2])
document.save('NewStyleMCQ.docx')

Now one can simply do the trick of using MS form to convert this file into a form for students to use.

  • Related