find the paragraph and url in text using beautfulsoup and put in the dataframe-CodePudding

I want to extract the paragraph in this XML file (between

tags)

'<p>jo\\section{$430 \\quad$ Part $4 \\quad$ The First Inequality, from Physical Causes}</p>\n<p>center $A$, with radius $A M$, let the arc $M N$ be described, intersecting $P V$ at $O$, and $P B$ at $N$. So, by analogy with the above, just as we are taking I in place of F, let us now take $N$ for $O$, and let us consider that just as $A N$ is the correct distance in length, it is also correct in position. Now the points $I, N$, and the like do indeed make the planet\'s path puff-cheeked. For the arcs $G D$ and $Q P$ are equal. And $B D, B P$, projected from a common center, intersect the lunule cut off. But DI and PN, the breadths of the lunule extended towards the center are unequal. And DI is smaller breadihs of the lunule extended towaras the center, are unequal. And DI is smaller, and $P N$ greater. For since $E D$ and $M P$ are equal, and $E D I, M P N$ are right, while EI is a greater circle, since its radius $A E$ is greater, and $M N$ is a smaller circle, since its radius AM is smaller, therefore, PN will definitely be greater, and DI smaller. Therefore the lunule cut off is narrower above at $D$ and broader belowe at $P$. In Therefore, the lunule cut off is narrower above, at $D$, and broader below, at $P$. In the ellipse, in contrast, this iunule is of equal breadth at points equally removed from the apsides $G$ and $Q .$ So it is clear that the path is puff-cheeked, so it is not an ellipse. And since the ellipse gives the correct equations, this puff-cheeked path should by rights give incorrect ones.</p>\n<p><img src="https://cdn.mathpix.com/cropped/2022_01_11_c20225c6c8f29d2e0e54g-1.jpg?height=608&width=332&top_left_y=642&top_left_x=440" alt="" />

and put them in one column of dataframe, in the second column I want to collect the URL that existed in the text

here is my file which I read it so

with open('../AN_markdown/kapitel_{0}.mmd'.format(ch), 'r',encoding='utf-8') as fin:
    rendered = mistletoe.markdown(fin)

I used beautifulsoup but still can not add link, here is my code

from bs4 import BeautifulSoup
text = BeautifulSoup(rendered,'html.parser')
para=[]
for p in text.find_all("p"):
     para.append(p)

df = pd.DataFrame({"Paragraphs":para})

it only find the paragraph, I do not know how to extract the link and add them as column behind the paragraph

thank you in advance

CodePudding user response：

Try the following approach. It spots the 3 <p> tags and extracts the text, the last one has no text but does have the image:

import pandas as pd
from bs4 import BeautifulSoup

data = []

xml = """<p>jo\\section{$430 \\quad$ Part $4 \\quad$ The First Inequality, from Physical Causes}</p>\n<p>center $A$, with radius $A M$, let the arc $M N$ be described, intersecting $P V$ at $O$, and $P B$ at $N$. So, by analogy with the above, just as we are taking I in place of F, let us now take $N$ for $O$, and let us consider that just as $A N$ is the correct distance in length, it is also correct in position. Now the points $I, N$, and the like do indeed make the planet\'s path puff-cheeked. For the arcs $G D$ and $Q P$ are equal. And $B D, B P$, projected from a common center, intersect the lunule cut off. But DI and PN, the breadths of the lunule extended towards the center are unequal. And DI is smaller breadihs of the lunule extended towaras the center, are unequal. And DI is smaller, and $P N$ greater. For since $E D$ and $M P$ are equal, and $E D I, M P N$ are right, while EI is a greater circle, since its radius $A E$ is greater, and $M N$ is a smaller circle, since its radius AM is smaller, therefore, PN will definitely be greater, and DI smaller. Therefore the lunule cut off is narrower above at $D$ and broader belowe at $P$. In Therefore, the lunule cut off is narrower above, at $D$, and broader below, at $P$. In the ellipse, in contrast, this iunule is of equal breadth at points equally removed from the apsides $G$ and $Q .$ So it is clear that the path is puff-cheeked, so it is not an ellipse. And since the ellipse gives the correct equations, this puff-cheeked path should by rights give incorrect ones.</p>\n<p><img src="https://cdn.mathpix.com/cropped/2022_01_11_c20225c6c8f29d2e0e54g-1.jpg?height=608&width=332&top_left_y=642&top_left_x=440" alt="" />"""
soup = BeautifulSoup(xml, 'html.parser')

for p in soup.find_all('p'):
    text = p.get_text(strip=True)
    
    if p.find('img'):
        url = p.img['src']
    else:
        url = ''
        
    data.append([text, url])
    
df = pd.DataFrame(data, columns=['Paragraphs', 'URLs'])
print(df)

Giving you a dataframe as:

                                          Paragraphs                                               URLs
0  jo\section{$430 \quad$ Part $4 \quad$ The Firs...                                                   
1  center $A$, with radius $A M$, let the arc $M ...                                                   
2                                                     https://cdn.mathpix.com/cropped/2022_01_11_c20...

CodePudding user response：

It turns out I can do it using "BeautifulSoup" and "mistletoe", here is my code:

from pathlib import Path 
import pandas as pd
from bs4 import BeautifulSoup
import mistletoe
with open(my_file, 'r',encoding='utf-8') as fin:
    rendered = mistletoe.markdown(fin)
XML = BeautifulSoup(rendered,'html.parser')
para=XML.find_all("p")
df = pd.DataFrame({"Paragraphs":para})
df = df.rename(columns={"0":'Paragraph'})

def urlFinder(text):
    links=[]
    for link in text.find_all('img'):
        #print(link.get('src'))
        links.append(link.get('src'))
    return(links)

df['Links'] = df['Paragraphs'].apply(lambda x: urlFinder(x))

however, the solution of @Martin Evans is also perfect!

here is my result: