I am currently programming a news API. To fetch news I am using java to parse XML from a List of RSS Feeds (URLs) and write them to a mysql database. I am doing this at a regular interval i.e. every 5 minutes.
As these news feeds often are identical or similar to the prior time fetching I currently get a lot of Duplicate Entry Exceptions.
2021-10-08 11:29:10.296 WARN 51007 --- [ scheduling-1] o.h.engine.jdbc.spi.SqlExceptionHelper : SQL Error: 1062, SQLState: 23000
2021-10-08 11:29:10.296 ERROR 51007 --- [ scheduling-1] o.h.engine.jdbc.spi.SqlExceptionHelper : (conn=1850) Duplicate entry 'https://www.bild.de/regional/nuernberg/nuernberg-news/verwaltung' for key 'article.UK_o0bdhqfwhuu9g9y35687dmqhq'
I could check if the Entries already exist in the Database using a guid, however checking for every single Article seems bad performance-wise.
I also thought of retrieving all articles and building a Map of guid at runtime in order to tell if an Article exists in the database and avoid a lot of database calls. But having quickly over 100 000 Articles made me neglect this option.
Would be happy to hear what your approach would be to solve this issue and if my assumptions on performance are false. Thanks, in advance!
This is my current implementation
for(SyndEntry syndEntry:feed.getEntries()){
Article article = customMappingSyndEntryImplToArticle(syndEntry, rssFeed);
try {
articleRepository.save(article);
} catch (DataIntegrityViolationException e) {
log.error("Duplicate Record found while saving data {}", e.getLocalizedMessage());
} catch (Exception e) {
log.error("Error while saving data {}", e.getLocalizedMessage());
}
}
CodePudding user response:
Can you really tell if two documents are duplicates? For example, I have seen two identical articles with different headlines.
So, assuming you can say what part(s) need to be checked to dup, make a UNIQUE
index in the table containing the news article.
But, there is a problem -- UNIQUE
is limited in size. In particular, the text of any article is likely to exceed that limit.
So... Take a "hash" or "digest" of the string. Put that in the unique column. Then, when you try to insert the same article again, you will get an error.
Well, the "error" can be avoided by saying INSERT IGNORE ...
.
A simple, and adequate hash for this task is the function MD5()
-- available in SQL and most application languages. It generates a constant length string that is almost guaranteed to be as unique as the text it is 'digesting'.