I have log files, which have many lines in the form of :
<log uri="Brand" t="2017-01-24T11:33:54" u="Rohan" a="U" ref="00000000-2017-01" desc="This has been updated."></log>
I am trying to convert each line in the log file into a Data frame and store it in csv or excel format. I want only values of uri, t is nothing but time u for username and desc for description
Something like this
Columns :- uri Date Time User Description
Brand 2017-01-24 11:33:54 Rohan This has been updated.
and so on.
CodePudding user response:
As mentionned by @Corralien in the comments, you can use some of beautifulsoup functions (Beautifulsoup
and find_all
) to parse each line in your logfile separately, then use pandas.DataFrame
constructor with a listcomp to make a DataFrame for each line :
import pandas as pd
import bs4 #pip install beautifulsoup4
with open("/tmp/logfile.txt", "r") as f:
logFile = f.read()
soupObj = bs4.BeautifulSoup(logFile, "html5lib")
dfList = [pd.DataFrame([(x["uri"], *x["t"].split("T"), x["u"], x["desc"])],
columns=["uri", "Date", "Time", "User", "Description"])
for x in soupObj.find_all("log")]
#this bloc creates an Excel file for each df
for lineNumber, df in enumerate(dfList, start=1):
df.to_excel(f"logfile_{lineNumber}.xlsx", index=False)
Output :
print(dfList[0])
uri Date Time User Description
0 Brand 2017-01-24 11:33:54 Rohan This has been updated.
Update : If you need a single dataframe/spreadsheet for the all the lines, use this :
with open("/tmp/logfile.txt", "r") as f:
soupObj = bs4.BeautifulSoup(f, "html5lib")
df = pd.DataFrame([(x["uri"], *x["t"].split("T"), x["u"], x["desc"])
for x in soupObj.find_all("log")],
columns=["uri", "Date", "Time", "User", "Description"])
df.to_excel("logfile.xlsx", index=False)