I want to grab query parameters from a client, and query through a large list of files (stored on the server computer), so that if the words are found in those files, they are recommended to the client, who can then choose to load those files into a .
My original plan was MySQL php, and I started learning those, but I have realized this is probably not possible, for the following reason:
I have to stores files (.docx, .txt, .html), and query THROUGH their content, not merely through their names.
So what technology should I use to perform this task? I thought of using a python bot which runs on the server PC and reads files, based on the query information, saved by php to a mySQL database, and then returns the target files, which are then loaded to the app with php.
CodePudding user response:
that's not a project i would suggest for a beginner but thats the way I would do it:
2 folders on a webserver: 1 for new documents, one for processed ones. If you have a new document (like docx, txt) you put it in there.
Then you need a cronjob, that runs let's say every 5 minutes, checking if there are files within the "new" folder. If there are, the cron processes them and after a file is done, it is moved to the "processed" folder.
Now to the processing part:
You build a word index for every file, getting all words that are in the document (explode the content with " " space, the you got an array, array_unique it to get every word just once). with this word list you go to your database.
there you got 3 tables: one with every document you got (table_files
). one with every word that exsits in any of your documents (table_words
) and the last table is to put words in relation to the files where they can be found
text-files are easy - just read the content of the file. (table_words_to_files
) which has following cols: files_id, words_id
.
Lets say your document "test_doc.txt" contains the word "test", then you have one entry in table_files
with lets say the filename (test_doc.txt) and a unique id (211 for exaple.). The word "test" is stored in your index, also having an ID (3323) and at last you add table_words_to_files
like this: files_id=211
, words_id=3323
. Of course, when you add words to table_words
check if they are not already in the table.
when a user searches for "test" you search the table_words
for it. If you find it (ID 3323) you go to table_words_to_files
and search for all entries which have words_id=3323
. With these entries you find the files containing that word.
Actually this is the easy part of the program.
The difficult part is to read all kind of files and extract their content in order to build that word index. text files are easy, because the whole filecontent is the actuall content you need to index. For word, there is a class that can do it (never used it), for html I dont know, there might be one, you need to google it ;) otherwise you can still parse the html as xml and extract the content yourself, but that's a lot of work.
of corse you also need a frontend, where you let the user search for words and display the results. But thats the easy part.
You can make the index better but not onyl telling which word in which file, but also hw often each word is in each file or even in which line you find the words. But the basic idea, described above, stays the same.
You can skip the "building an index" Part and just use something like elasticsearch but where is the fun in using something, instead of writing it yourself. Beside, it can be more challangeing learning how elasticsearch works, instead of building such a simple search index yourself...
I hope this roadmap helps you, but as I said: its not to difficult, but I don't think it's something an absolute beginner can handle.
Stuff you need:
HTML/CSS for the frontend (JS/ajax to make it fancy)
Backend: PHP and MySQL. If you are new, try xampp - thats an easy to use webserver that gives you PHP and MySQL
IDE: PHPStorm comes with a 30day eval licence. Its the best IDE for PHP (my Opinion)