I'm trying to create a web crawler. I've created a class to handle all urls visited and to visit. This class has to be accessed by multiple threads for retrieving and updating those lists. The problem I'm facing, or at least I think, is in nextRandom() and probably also in next(). I think what is happening is the threads are interfering with each other since the function is somewhat synchronized but not atomic. Is there a way to make so this block of code is executed without any interruption by other threads?
Here is the code: The url handler
import java.util.*;
import java.util.concurrent.ThreadLocalRandom;
public class UrlHandler {
private volatile Set<String> visited = new HashSet<String>();
private volatile List<String> toVisit = new ArrayList<String>();
public void addToVisit(String url) {
synchronized (this){
if (!visited.contains(url)) toVisit.add(url);
}
}
public void addToVisit(Collection<String> urls) {
synchronized (this){
for (String url : urls)
if (!visited.contains(url)) toVisit.add(url);
}
}
public void addVisited(String url){
synchronized (this){
visited.add(url);
}
}
public void addVisited(Collection<String> urls){
synchronized (this){
visited.addAll(urls);
}
}
public String next() {
while (toVisit.size() == 0) {
try {
Thread.sleep(200);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
synchronized (this){
String url = toVisit.get(0);
toVisit.remove(0);
return url;
}
}
public String nextRandom() {
synchronized (this){
int n = 0;
if (toVisit.size() > 1){
n = ThreadLocalRandom.current().nextInt(toVisit.size());
}
String url = toVisit.get(n);
toVisit.remove(n);
return url;
}
}
public List<String> getToVisit() {
synchronized (this){
return toVisit;
}
}
public Set<String> getVisited() {
synchronized (this){
return visited;
}
}
}
Web Crawler
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class WebCrawler {
private final ExecutorService executor;
public WebCrawler(int nOfThreads) {
this.executor = Executors.newFixedThreadPool(nOfThreads);
}
public void add(Runnable runnable) {
this.executor.execute(runnable);
}
//Used to shut down safely and wait also 5 of seconds for not finished tasks
public void shutdown() {
this.executor.shutdown();
try {
this.executor.awaitTermination(5, TimeUnit.SECONDS);
if (!this.executor.isTerminated()) {
System.err.println("Timed out waiting for executor to terminate cleanly. Shutting down.");
this.executor.shutdownNow();
}
} catch (final InterruptedException e) {
System.err.println("Interrupted while waiting for executor shutdown.");
Thread.currentThread().interrupt();
}
}
}
Failing test example
import org.junit.jupiter.api.Test;
import java.util.ArrayList;
import java.util.List;
import static org.junit.jupiter.api.Assertions.assertEquals;
public class UrlHandlerTest {
List<String> testList = new ArrayList<>(List.of("test1", "test2", "test3", "test3"));
List<String> uniqueTestList = new ArrayList<>(List.of("test1", "test2", "test3"));
UrlHandler urlHandler = new UrlHandler();
@Test
public void concurrentAccess(){
urlHandler.addToVisit(testList);
WebCrawler webCrawler = new WebCrawler(10);
for (int i = 0; i < urlHandler.getToVisit().size(); i ) {
webCrawler.add(new Runnable() {
@Override
public void run() {
String url = urlHandler.nextRandom();
urlHandler.addVisited(url);
System.out.println("Here thread " Thread.currentThread().getId() " working on: " url);
}
});
}
webCrawler.shutdown();
System.out.println(urlHandler.getVisited());
assertEquals(true, urlHandler.getVisited().containsAll(uniqueTestList));
}
}
CodePudding user response:
In the next method this code is a problem:
while (toVisit.size() == 0) {
try {
Thread.sleep(200);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
The lock isn't held for this part, so size can be stale. Instead of this, try something like
while (toVisit.size() == 0)
wait();
Do this in a synchronized block so you have the lock held while checking the collection size. Code that adds to the collection should notify in order to wake up the waiting threads.
CodePudding user response:
This piece of code is problematic:
for (int i = 0; i < urlHandler.getToVisit().size(); i ) {
webCrawler.add(new Runnable() {
// ...
});
}
The urlHandler.getToVisit().size()
is always changing during the traversal, and there is uncertainty (because the size will be changed asynchronously).
Change to:
int size = urlHandler.getToVisit().size();
for (int i = 0; i < size; i ) {
webCrawler.add(new Runnable() {
// ...
});
}