Python XML finding the specific location of a tag-CodePudding

I am currently using parsing through an xml file using the built in lxml.etree in python. I am running into some issued regarding the extraction of the text within the element tags.

The following is example code of my current problem.

<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>

<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>

<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body> 

<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>

My conflict is the following:

I am using the first P tag to capture title of each body tag if there is a title. The title is (in most cases) the first P tag right after body tag (hence example code line 1 and line 4). I don't have a certain list of title names which is why I am using this method to capture titles.

The problem is when no titles exist within the body but there is P tag somewhere within the body tag that is not right after the body tag ( hence code line 2 and 3 ) the program takes that first P tag and the text within as a title. In this scenario that corresponding P tag is not title and shouldn't be treated as one, but since it is treated as one any text before that P tag is disregarded and not written over to the new text file.

For further clarification the following is what is written over to the text file.

Title 1 : This is a sample text after the the p tag that contains the title.
not a title : This is sample text after a p tag that does not contain a title.
not a title : This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.

Desired output to text file

Title 1 : This is a sample text after the the p tag that contains the title.
sample text sample text sample text sample text not a title This is sample text after a p tag that does not contain a title.
sample text sample text not a title This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.

Possible Solution:

1. Is there any way I can find the location of the first P tag. If the first P tag exist right after the body tag I would like to keep it. Any other P tag I would like to strip but keep the text. I can do this by using a built in function in lxml.etree

strip_tags()

Any insight on this problem or another possible solution is greatly appreciated ... thank you in advance!

CodePudding user response：

I was able to identify the titles with BeautifulSoup and a regular expression.

from bs4 import BeautifulSoup as soup
from lxml import etree
import re


markup = """<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>

<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>

<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body> 

<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>"""


soup = soup(markup,'html.parser')

titles = soup.select('body')

for title in titles:
    
    groups = re.search('<body> *<p>', str(title))
    has_title = groups != None
    if has_title:
        print(title.p.text)