Home > OS >  Is there a way to manipulate numbered paragraphs in Python to remove certain paragraphs which do not
Is there a way to manipulate numbered paragraphs in Python to remove certain paragraphs which do not

Time:10-15

I have a string of text with numbered paragraphs from '1.' to '221.', however, there are certain paragraphs that do not follow the order and I want to remove them. Here is how the data looks:

text = """1. Shares of Paras Defence and Space Technologies gained 2.85 times. 
2. The company, engaged in manufacturing and testing of defence and space engineering products. 
"3. Its stock ended at Rs 499 versus issue price of Rs 175 per share.
42. On July 23, Zomato NSE 0.00 % Ltd. listed on the Indian stock exchanges.
43. That was exactly a week after the food-delivery and restaurant discovery platform's initial public offering went live. 
4. Paras Defence’s IPO, which closed on September 23, had generated bids worth Rs 38,021 crore. 
5. It surpassed the previous record of Salasar Technologies’ IPO. 
14. NBFCs are betting big time on the IPO. 
6. Paras Defence is one of the few players having an edge in defence deals."""

From the above text, I want to remove the content of the paragraphs which aren't in order ie. '42.', '43.' and '14'.

Output Desired:

relevant_text = '1. Shares of Paras Defence and Space Technologies gained 2.85 times. 
2. The company, engaged in manufacturing and testing of defence and space engineering products. 
3. Its stock ended at Rs 499 versus issue price of Rs 175 per share. 
4. Paras Defence’s IPO, which closed on September 23, had generated bids worth Rs 38,021 crore. 
5. It surpassed the previous record of Salasar Technologies’ IPO. 
6. Paras Defence is one of the few players having an edge in defence deals.'

I tried to match the pattern but don't know how to proceed forward. Also, I'm not sure if the regex pattern is correct as it matches '1.', '2.' etc. but not '"3.'. Here's what I came up with:

text_sequence = []

pattern = re.compile('(\s|["])[0-9]{1,3}\.\s')
matches = pattern.finditer(text)


for match in matches:
  for r in range(1, 999):
    if str(r) in match.group():
      text_sequence.append(match.span())
      text_sequence.append(match.group())

print(text_sequence)

Is there a way to get the desired output? I am a beginner and any help/suggestion/advice is much appreciated. Thanks in advance. :)

P.S: The matches I am getting from this code have repeated results.

CodePudding user response:

You can do something like this:

text = """1. Shares of Paras Defence and Space Technologies gained 2.85 times. 
2. The company, engaged in manufacturing and testing of defence and space engineering products. 
3. Its stock ended at Rs 499 versus issue price of Rs 175 per share.
42. On July 23, Zomato NSE 0.00 % Ltd. listed on the Indian stock exchanges.
43. That was exactly a week after the food-delivery and restaurant discovery platform's initial public offering went live. 
4. Paras Defence’s IPO, which closed on September 23, had generated bids worth Rs 38,021 crore. 
5. It surpassed the previous record of Salasar Technologies’ IPO. 
14. NBFCs are betting big time on the IPO. 
6. Paras Defence is one of the few players having an edge in defence deals."""
lines = text.split("\n")
output = ""
i = 0
for l in lines:
    if (l.startswith("{}. ".format(i 1))):
        output =l "\n"
        i =1
        
print(output)

Provided you get rid of the extra " in line 3. You can also consider using "in" instead of "startswith" if you can guarantee that the line number followed by a dot is not in the string.

CodePudding user response:

If you can match all these bullet points with a single pattern like

(?s)((\d )\. .*?)[^\w!?.…]*(?=\d \. |\Z)

(see this regex demo) and assuming they come in the ascending order, then it can be solved with

import re
pattern = r"((\d )\. .*?)[^\w!?.…]*(?=\d \. |\Z)"
text = "1. Shares of Paras Defence and Space Technologies gained 2.85 times. 2. The company, engaged in manufacturing and testing of defence and space engineering products. \"3. Its stock ended at Rs 499 versus issue price of Rs 175 per share. 42. On July 23, Zomato NSE 0.00 % Ltd. listed on the Indian stock exchanges. 43. That was exactly a week after the food-delivery and restaurant discovery platform's initial public offering went live. 4. Paras Defence’s IPO, which closed on September 23, had generated bids worth Rs 38,021 crore. 5. It surpassed the previous record of Salasar Technologies’ IPO. 14. NBFCs are betting big time on the IPO. 6. Paras Defence is one of the few players having an edge in defence deals."
result = []
idx = 1
for sent, num in re.findall(pattern, text, re.S):
    if int(num) == idx:
        result.append(sent)
        idx  = 1

print("\n".join(result))

See this Python demo. The regex matches

  • ((\d )\. .*?) - Group 1:
  • [^\w!?.…]* - any zero or more chars other than word, and final sentence punctuation
  • (?=\d \. |\Z) - a positive lookahead that requires either the end of string (\Z) or (|) one or more digits

Output:

1. Shares of Paras Defence and Space Technologies gained 2.85 times.
2. The company, engaged in manufacturing and testing of defence and space engineering products.
3. Its stock ended at Rs 499 versus issue price of Rs 175 per share.
4. Paras Defence’s IPO, which closed on September 23, had generated bids worth Rs 38,021 crore.
5. It surpassed the previous record of Salasar Technologies’ IPO.
6. Paras Defence is one of the few players having an edge in defence deals.

NOTE: This can be tweaked in case you have bullet points in non-ascending order if you first sort by the num first.

  • Related