i have a mysql database where i add news articles, and before adding to it it try to compere that article with 100 last articles if it has any similarity.
so if is 95% similar i can tag it as same as article 122 or if it is 70-95% similar i can tag it as similar to article 133,
Which is best way to do this:
is there a way or a function that mysql can do it
do i need to use python to compare that article in a while loop with other 100 articles
as i read in forums python is the best way, but i tried some library to compare string1(article1) with string2(article2) and even if its totally different article it tell me it is 70% same
i think it is because of some same words like : and , he ,she, will, news,text,or,and, the, i
CodePudding user response:
If you are using Linux you can call from python the diff command and play with the parameters, a teacher a few years ago did this to detect copy in a programing exam, it worked even after reformatting the code
CodePudding user response:
as i read in forums python is the best way, but i tried some library to compare string1(article1) with string2(article2) and even if its totally different article it tell me it is 70% same
i think it is because of some same words like : and , he ,she, will, news,text,or,and, the, i
I will suggest to remove stopwords, might help.
SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
default mysql STOPWORDS. For Information look at the MYSQL Full-Text Stopwords Documentation & Fine-Tuning MySQL Full-Text Search.