Home > Blockchain >  Searching text in pdf using php
Searching text in pdf using php

Time:11-14

I am having a big database roughly it has 5 lakh (500K) entries now all those entries also have some document associated with them (i.e. every id has at least pdf file). Now I need a robust method to search for a particular text in those pdf files and if I find it, it should return the respective 'id'

kindly share some fast and optimized ways to search text in a pdf using PHP. Any idea will be appreciated.

note: Changing the pdf to text and then searching is not what I am looking for obviously, it will take a longer time.

In one line I need the best way to search for text in pdf using PHP

CodePudding user response:

If this is a one-time task, there is probably no 'fast' solution.

If this is a recurring task,

  1. Extract the text via some tool. (Sorry, I don't know of a tool.)
  2. Store that text in a database table.
  3. Apply a FULLTEXT index to that table.

Now the searching will be fast.

CodePudding user response:

I myself wrote a website in ReactJS to search for info in PDF files (indexed books), which I indexed using Apache SOLR search engine.

What I did in React is, in essence:

queryValue = "("   queryValueTerms.join(" OR ")   ")"

    let query = "http://localhost:8983/solr/richText/select?q="
    let queryElements = []

    
    if(searchValue){
      queryElements.push("text:"   queryValue)
    }

...

 fetch(query)
      .then(res => res.json())
      .then((result) =>{
        setSearchResults(prepareResults(result.response.docs, result.highlighting))
        setTotal(result.response.numFound)
        setHasContent(result.response.numFound > 0)
      })

Which results in a HTTP call:

http://localhost:8983/solr/richText/select?q=text:(chocolate OR cake)

Since this is ReactJS and just parts of code, it is of little value to you in terms of PHP, but I just wanted to demonstrate what the approach was. I guess you'd be using Curl or whatever.

Indexing itself I did in a separate service, using SolrJ, i.e. I wrote a rather small Java program that utilizes SOLR's own SolrJ library to add PDF files to SOLR index.

If you opt for indexing using Java and SolrJ (was the easiest option for me, and I didn't do Java in years previously), here are some useful resources and examples, which I collected following extensive search for my own purposes:

https://solr.apache.org/guide/8_5/using-solrj.html#using-solrj

I basically copied what's here: https://lucidworks.com/post/indexing-with-solrj/ and tweaked it for my needs.

Tip: Since I was very rusty with Java, instead of setting classpaths etc, quick solution for me was to just copy ALL libraries from SOLR's solrj folder, to my Java project. And possibly some other libraries. May be ugly, but did the job for me.

  • Related