Home > other >  Batch search multiple keywords from the multiple documents and return keyword text before and after
Batch search multiple keywords from the multiple documents and return keyword text before and after

Time:11-18

Consult everybody bosses: I have 2 million PDF document and 20 key words, and now wants batch search these keywords in the PDF and return around 50 words in the text, keywords, then export to excel,, see a lot of posts to see don't understand the , please expert help under the guidance of the

CodePudding user response:

First word segmentation, then statistics

CodePudding user response:

reference 1st floor tianfang response:
word first, then just statistics
who can help to write down the code

CodePudding user response:

First to ensure the PDF is readable text, actually have to do is find a traversal, try to get the PDF of the full text out first

CodePudding user response:

reference weixin_45903952 reply: 3/f
first can guarantee PDF is readable text, actually have to do is find a traversal, try to get the PDF of the full text out
yes, I use foxit PDF can search out, but can't export results

CodePudding user response:

Let me give you an idea,
First read the PDF content into a string STR
Set a substring keywords
The index=STR. Find (keywords)
This function will find the substring in a string for the first time in the position, if found, will return to the position of the substring, if not found, will return 1, can judge the
A substring, string section can be directly before and after the substring 50 words, the content of the
If an article and many of the same keyword (substring)
Can be executed in STR., for the first time the find () again, because it only to find the location of the first substring, so in the back several times to perform, need to he begins searching for the location specified,
Index_2=STR. Find (keywords, index + len (keyword)) among them, the index + len (keywords) began to find the location of the,

CodePudding user response:

As for how to read the PDF content, https://blog.csdn.net/weixin_42812527/article/details/90166966 you can see this article
  • Related