Home > Software engineering >  Finding links from the website using threads
Finding links from the website using threads

Time:11-23

Im working on program which gets all the links from the webiste and searches for input word. Then enters each of this links and search again and etc. Program does this 3 times (thats why n is 3). Code below does it with recursion method and seems to be working just fine.

However i would like to speed up this process by using threads. How can i implement this? From what i heard I can propably use fork/join for that.

 public static void getLinks(String url, Set<String> urls, String word, int n) {
    if(url.contains(word)) {
        System.out.println("Found: "   url);
    }

    if (urls.contains(url)) {
        return;
    }
    urls.add(url);

    if(n<3) {
        try {
            Document doc = Jsoup.connect(url).get();
            Elements elements = doc.select("a[href]");
            for (Element element : elements) {
                System.out.println(element.absUrl("href"));
                getLinks(element.absUrl("href"), urls, word, n   1);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    } else return;
}

public static void main(String[] args) {
    Set<String> links = new HashSet<>();
    String word = "root";
    getLinks("https://example.com", links, word, 0);
}

PS in the final version of the program links thats matching with input word will be printed in GUI.

CodePudding user response:

You can use a worker queue in which you submit runnables to be executed. As you discover links, you submit tasks for underlying pages to crawl.

Basically have a producer of work and consumer of work.

https://www.baeldung.com/java-blocking-queue

CodePudding user response:

The simple way is to submit getLinks to a thread pool while iterating through Elements:

    static ExecutorService executorService = Executors.newCachedThreadPool();
    public static void getLinks(String url, Set<String> urls, String word, int n) {
        if(n<3) {
            try {
                for (Element element : new ArrayList<Element>()) {
                    executorService.submit(() -> element.absUrl("href"), urls, word, n   1);
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        } else return;
    }
  • Related