I have a task, I'm using jupyter and I have to combine or merge multiple html files into one html file.
Any ideas how?
I did this with excel but didn't work with html files:
import os
import pandas as pd
data_folder='C:\\Users\\hhhh\Desktop\\test'
df = []
for file in os.listdir(data_folder):
if file.endswith('.xlsx'):
print('Loading file {0}...'.format(file))
df.append(pd.read_excel(os.path.join(data_folder , file), sheet_name='sheet1'))
CodePudding user response:
Sounds like a task for Beautiful Soup.
You would get anything inside the <body>
tag of each HTML document, I assume, and then combine them.
Maybe something like:
import os
from bs4 import BeautifulSoup
output_doc = BeautifulSoup()
output_doc.append(output_doc.new_tag("html"))
output_doc.html.append(output_doc.new_tag("body"))
for file in os.listdir(data_folder):
if not file.lower().endswith('.html'):
continue
with open(file, 'r') as html_file:
output_doc.body.extend(BeautifulSoup(html_file.read(), "html.parser").body)
print(output_doc.prettify())