Finding links from the website using threads-CodePudding

Im working on program which gets all the links from the webiste and searches for input word. Then enters each of this links and search again and etc. Program does this 3 times (thats why n is 3). Code below does it with recursion method and seems to be working just fine.

However i would like to speed up this process by using threads. How can i implement this? From what i heard I can propably use fork/join for that.

 public static void getLinks(String url, Set<String> urls, String word, int n) {
    if(url.contains(word)) {
        System.out.println("Found: "   url);
    }

    if (urls.contains(url)) {
        return;
    }
    urls.add(url);

    if(n<3) {
        try {
            Document doc = Jsoup.connect(url).get();
            Elements elements = doc.select("a[href]");
            for (Element element : elements) {
                System.out.println(element.absUrl("href"));
                getLinks(element.absUrl("href"), urls, word, n   1);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    } else return;
}

public static void main(String[] args) {
    Set<String> links = new HashSet<>();
    String word = "root";
    getLinks("https://example.com", links, word, 0);
}

PS in the final version of the program links thats matching with input word will be printed in GUI.

CodePudding user response：

You can use a worker queue in which you submit runnables to be executed. As you discover links, you submit tasks for underlying pages to crawl.

Basically have a producer of work and consumer of work.

https://www.baeldung.com/java-blocking-queue

CodePudding user response：

The simple way is to submit getLinks to a thread pool while iterating through Elements:

    static ExecutorService executorService = Executors.newCachedThreadPool();
    public static void getLinks(String url, Set<String> urls, String word, int n) {
        if(n<3) {
            try {
                for (Element element : new ArrayList<Element>()) {
                    executorService.submit(() -> element.absUrl("href"), urls, word, n   1);
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        } else return;
    }