Home > other >  Combine multiple HTML files into one html file Using Python
Combine multiple HTML files into one html file Using Python

Time:08-25

I have a task, I'm using jupyter and I have to combine or merge multiple html files into one html file.

Any ideas how?

I did this with excel but didn't work with html files:

import os
import pandas as pd

data_folder='C:\\Users\\hhhh\Desktop\\test'


df = []
for file in os.listdir(data_folder):
    if file.endswith('.xlsx'):
        print('Loading file {0}...'.format(file))
        df.append(pd.read_excel(os.path.join(data_folder , file), sheet_name='sheet1'))

CodePudding user response:

Sounds like a task for Beautiful Soup.

You would get anything inside the <body> tag of each HTML document, I assume, and then combine them.

Maybe something like:

import os
from bs4 import BeautifulSoup

output_doc = BeautifulSoup()
output_doc.append(output_doc.new_tag("html"))
output_doc.html.append(output_doc.new_tag("body"))

for file in os.listdir(data_folder):
    if not file.lower().endswith('.html'):
        continue

    with open(file, 'r') as html_file:
        output_doc.body.extend(BeautifulSoup(html_file.read(), "html.parser").body)

print(output_doc.prettify())
  • Related