We have a stack of roughly 3000 files that are a mix of Office, PDF, zips and seems like some .DB and some file types that I have not seen before (.bfa, .ses) files and was asked to review and confirm if all files are viewable on a standard browser. Does anyone know of a smart way to check this Vs having resources opening files one at a time?
I do not have much experience writing code but have used existing sql and shell scripts in the past.
CodePudding user response:
You can do a simple "Magic" test as first pass, so for PDFs here we can see that there is a suspected rogue b4ascii.pdf
>dir /B *.pdf >pdfs.txt
>findstr /B /M "%PDF-" *.pdf>match.txt
>fc pdfs.txt match.txt
Comparing files pdfs.txt and MATCH.TXT
***** pdfs.txt
b4.pdf
b4ascii.pdf
bad2.pdf
***** MATCH.TXT
b4.pdf
bad2.pdf
*****
You can be more critical of suspect pdfs by using
>findstr /B /M "%%EOF" *.pdf>match.txt
This is likely to weed out bad downlooads but the files may still be valid, just more suspect. So in my typical top 100 PDFs that first one suspect is now joined by three others. On testing those 3 seem ok just perhaps not exactly standard but the bad one in both lists will not display, and turns out to be a badly named HTML.htm file.
You can do similar with Zips which include all Office.extX files, note it does not matter if one type is not found the match should be equal.
>dir /B *.zip *.docx *.xlsx >zips.txt
>findstr /B /M "PK" *.zip *.docx *.xlsx>match.txt
FINDSTR: Cannot open *.docx
FINDSTR: Cannot open *.xlsx
>fc zips.txt match.txt
Comparing files zips.txt and MATCH.TXT
FC: no differences encountered
>
Generally there is no way without a dedicated checking utility to run through every part of a pdf or every part of a zip to know if there is a rogue one. Simplest method is to run say an text extractor that includes an errorlevel output to indicate it had a problem.