I'm building a program that searches through many different file formats looking for specified keywords and I'm having some issues finding information on how to read .doc files. I was able to get .docx files to work with the following in a function:
def docx_file(root: str, file: str, keywords: list) -> list:
hits = []
keywords = keywords.copy()
try:
document = zipfile.ZipFile(os.path.join(root,file))
docXml = xml.dom.minidom.parseString(document.read('word/document.xml')).toprettyxml()
for keyword in keywords:
if bool(re.search(keyword, docXml)):
hits.append(keyword)
except:
logger.warning(f"failed to open {root}\\{file}")
return []
return hits
Unfortunately, .doc files don't work the same way. I'm looking for a lightweight solution, and would really prefer not to import additional libraries for this functionality.
Thank you in advance for your assistance.
I took a look through the output of the following:
document = zipfile.ZipFile(os.path.join(root,file))
document.filelist
I then ran:
for doc in document.filelist:
docXml = xml.dom.minidom.parseString(document.read(doc.filename)).toprettyxml(indent=" ")
print(docXml)
Based on the output, I don't think the .docx solution will work with .doc files.
Edit: Additionally I'm currently looking into how to possibly use the output from.
document = open(os.path.join(root,file), 'rb')
for line in document.readlines():
print(line)
I've tried decoding the output, which just gives:
☻☺àùOh«
'³Ù0l☺◄☺↑Microsoft Office Word@@€ћs¶юШ☺@€ћs¶юШ☺♥☻♥♥.♥юя
for any encoding I've tried so far.
CodePudding user response:
I believe I found a suitable solution for my problem it's a slight variation from a snippet posted here, written by Viktor.
special_chars = {
"b'\\t'": '\t',
"b'\\r'": '\n',
"b'\\x07'": '|',
"b'\\xc4'": 'Ä',
"b'\\xe4'": 'ä',
"b'\\xdc'": 'Ü',
"b'\\xfc'": 'ü',
"b'\\xd6'": 'Ö',
"b'\\xf6'": 'ö',
"b'\\xdf'": 'ß',
"b'\\xa7'": '§',
"b'\\xb0'": '°',
"b'\\x82'": '‚',
"b'\\x84'": '„',
"b'\\x91'": '‘',
"b'\\x93'": '“',
"b'\\x96'": '-',
"b'\\xb4'": '´'
}
def doc_strings(path: Path) -> str:
output_string = ''
with open(path, 'rb') as stream:
stream.seek(2560)
current_stream = stream.read(1)
while not (str(current_stream) == "b'\\xfa'"):
if str(current_stream) in special_chars.keys():
output_string = special_chars[str(current_stream)]
else:
try:
char = current_stream.decode('UTF-8')
if char.isalnum() or char == ' ':
output_string = char
except UnicodeDecodeError:
output_string = ''
current_stream = stream.read(1)
return output_string
CodePudding user response:
While .docx
is XML-in-a-Zip, which is pretty easy to read, the .doc
files (without X) are in a binary format which is much more difficult to parse, specially without a third-party library doing exactly that.
You can implement part of a parser for this file format (based on the spec linked by @micromoses), but it may be tedious.
Another solution, much quicker, but also much less reliable, is to just grep the file. I just tested it : writing my own name in plain in a file, saving it as a .doc
, and I can grep it. That's because the word is probably byte-aligned in the file format, so that it can be matched.
Here is a quick demo :
from pathlib import Path
filepath = Path("/home/stack_overflow/parse_me.doc")
word = "Pinjon"
with open(filepath, "rb") as doc_file:
binary_data = doc_file.read()
if word.encode("utf-16-le") in binary_data:
print("word found (LE)")
elif word.encode("utf-16-be") in binary_data:
print("word found (BE)")
else:
print("word not found")
word found (LE)
My computer is Little Endian but I covered both cases (LE/BE).