folks
I'm working on extracting some sentences from a document and tying to make a dataframe with BeautifulSoup and pandas as follows. There are some iterations so I think it would be written in a better way like a pro. Could you help with developing these lines of code? Thank you!
import pandas as pd
from bs4 import BeautifulSoup
bs = BeautifulSoup(html, 'html.parser')
t1 = bs.find_all('h1')[1].text.replace('_room1',"")
t2 = bs.find_all('h1')[2].text.replace('_room1',"")
t3 = bs.find_all('h1')[3].text.replace('_room1',"")
t4 = bs.find_all('h1')[4].text.replace('_room1',"")
p1 = bs.find_all('p')[3].text
p2 = bs.find_all('p')[4].text bs.find_all('p')[5].text bs.find_all('p')[6].text bs.find_all('p')[7].text
p3 = bs.find_all('p')[8].text
p4 = bs.find_all('p')[9].text
data = {t1: p1,
t2: p2,
t3: p3,
t4: p4}
df = pd.DataFrame(data, index=[0])
df
CodePudding user response:
How about just getting the text from your H1's and P's in one go:
h1s = [h1.text for h1 in bs.select('h1')[:4]]
ps = [p.text for p in bs.select('p')]
df = pd.DataFrame({
h1: p
for h1, p in zip(h1s, [ps[3], ''.join(ps[4:7]), ps[8], ps[9])
}).T