Regular expression match / break-CodePudding

I am doing text analysis on SEC filings (e.g., 10-K), and the documents I have are the complete submission. The complete filing submission includes the 10-K, plus several other documents. Each document resides within the tags ‘<DOCUMENT>’ and ‘</DOCUMENT>’.

What I want: To count the number of words in the 10-K only before the first instance of ‘</DOCUMENT>’

How I want to accomplish it: I want to use a for loop, with a regex (regex_end10k) to indicate where to stop the for loop.

What is happening: No matter where I put my regex match break, the program counts all of the words in the entire document. I have no error, however I cannot get the desired results.

How I know this: I have manually trimmed one filing, while retaining the full document (results below). When I manually remove the undesired documents after the first instance of ‘</DOCUMENT>’, I yield about 750,000 fewer words.

Current output

Note: Apparently I don't have enough SO reputation to embed a screenshot in my post; it defaults to a link.

What I have tried: several variations of where to put the regex match break. No matter what, it almost always counts the entire document. I believe that the two functions may be performed over the entire document. I have tried putting the break statement within get_text_from_html() so that count_words() only performs on the 10-K, but I have had no luck.

The code below is a snippet from a larger function. It's purpose is to (1) strip html tags and (2) count the number of words in the text. If I can provide any additional information, please let me know and I'll update my post.

The remaining code (not shown) extracts firm and report identifiers, (e.g., ‘file’ or ‘cik’) from the header section between tags ‘<SEC-HEADER>’ and ‘</SEC-HEADER>’. Using the same logic, when extracting header information, I use a regex match break logic and it works perfectly. I need help trying to understand why this same logic isn’t working when I try to count the number of words and how to correct my code. Any help is appreciated.

regex_end10k = re.compile(r'</DOCUMENT>', re.IGNORECASE)

       for line in f:
                
            def get_text_from_html(html:str):
                doc = lxml.html.fromstring(html)
                for table in doc.xpath('.//table'):   # optional: removes tables from HTML source code
                    table.getparent().remove(table)
                for tag in ["a", "p", "div", "br", "h1", "h2", "h3", "h4", "h5"]:
                    for element in doc.findall(tag):
                        if element.text:
                            element.text = element.text   "\n"
                        else:
                            element.text = "\n"
                return doc.text_content() 
            
            
            to_clean = f.read()
            clean = get_text_from_html(to_clean)
            #print(clean[:20000])
            
            def count_words(clean):
                words = re.findall(r"\b[a-zA-Z\'\-] \b",clean)
                word_count = len(words)
                return word_count

            header_vars["words"] = count_words(clean)
            
            match = regex_end10k.search(line) # This should do it, but it doesn't.
            if match:
                break

CodePudding user response：

You dont need regx, just split your orginal string, and then in the part before count the words, simple example above:

text = 'Text before <DOCUMENT> text after'
splited_text = text.split('<DOCUMENT>')
splited_text_before = splited_text[0]
count_words = len(splited_text_before.split())
print(splited_text_before)
print(count_words)

output

Text before 
2