Given a theoretical system in which files are downloaded from the Web if not found in the local system and assuming:
- Download mechanism and retrieval/placement from/in cache (local files system) are already taken care of.
- Single-threaded and single request per URL.
I wrote a method that uses getFileFromLocalFS() and getFileFromWeb() to implement a simplified logic of caching:
public InputStream getFile(String url) { // #1
InputStream retStream = getFileFromLocalFS(url);
if (retStream != null) {
return retStream;
}
else {
retStream = getFileFromLocalFS(url);
if (retStream == null) {
return getFileFromWeb(url);
}
}
return retStream;
}
This schematic solution then needed to be improved to accommodate concurrent requests to download from same URL...and limit actual "from Web" to a single download (i.e. all other requests will get it from the local file system). So, I synchronized the entire method:
public synchronized InputStream getFile(String url) { // #2
InputStream retStream = getFileFromLocalFS(url);
if (retStream != null) {
return retStream;
}
else {
retStream = getFileFromLocalFS(url);
if (retStream == null) {
return getFileFromWeb(url);
}
}
return retStream;
}
This essentially meets the request but has a performance issue as it prevents the entire method from being run by another thread until it finishes. That is, even if the file can be fetched form the local FS, getFileFromLocalFS(url)
cannot be accessed while the method is run by another thread.
A performance improvement suggested by my interviewer was to synchronize the getFileFromLocalFS(url)
block:
public synchronized InputStream getFile(String url) { // #3
InputStream retStream = getFileFromLocalFS(url);
if (retStream != null) {
return retStream;
}
else {
synchronized (this) {
retStream = getFileFromLocalFS(url);
if (retStream == null) {
return getFileFromWeb(url);
}
}
}
return retStream;
}
I said "fine, but for that optimization to work, the method synchronization needs to be removed", i.e.:
public InputStream getFile(String url) { // #4
InputStream retStream = getFileFromLocalFS(url);
if (retStream != null) {
return retStream;
}
else {
synchronized (this) {
retStream = getFileFromLocalFS(url);
if (retStream == null) {
return getFileFromWeb(url);
}
}
}
return retStream;
}
The interviewer disagreed and insisted on leaving both synchronized
in place.
Which one would perform better in a concurrency environment? #3 or #4? Why?
CodePudding user response:
There seems to be Cargo Cult Programming going on. The third variant resembles Double-Checked Locking but doesn’t get the point. But even more striking, the first variant also resembles Double-Checked Locking, by calling getFileFromLocalFS(url)
twice without a reason. And that’s the starting point…
Your suspicion against the interviewer’s “improvement” is right. Nested synchronization serves no purpose and in the best case, the JVM’s optimizer will be able to eliminate it.
However, neither solution is recommended. There’s a much deeper problem here. First, we should focus on correctness, before discussing performance.
Apparently, getFileFromLocalFS(url)
is supposed to find a local copy if it exists, but not to create it. This would be pointless if no-one ever creates such copies and the whole question about overlapping caching attempts would be moot, if this method doesn’t perform the caching attempt. So the only other method involved, getFileFromWeb(url)
must be the one which creates the local copy.
This implies that there is a hidden protocol between these two methods. getFileFromLocalFS(url)
somehow knows, how to get the cached file created by getFileFromWeb(url)
.
This raises questions, like, what happens if one thread is in the middle of getFileFromWeb(url)
creating a cached version and another thread calls getFileFromLocalFS(url)
for the same url? Does it return an input stream to the still incomplete local file? There are two possibilities:
This issue has been solved somehow. This would imply that there is already some thread safe construct behind the scenes to prevent such a race condition, hence, it should be used by
getFileFromWeb(url)
to also solve the issue of multiple caching attempts in the first place.Or this issue wasn’t solved and the starting point was a broken method, so the first task should be to fix this, instead of trying to improve the performance. If we really try to fix this outside the two methods, i.e. in
getFile
, only the second variant would be correct, calling both methods insidesynchronized
.
In either case, the getFile
method is the wrong place to address the issue. The concurrency issues should be addressed by the hidden protocol that already exists between the two methods.
For example:
If these two methods use a map from url to local file, they should use a concurrent map und its atomic update operations to solve the concurrency issues.
If these two methods use a mapping scheme to convert the url to a local path where the copy is supposed to be, and the
getFileFromLocalFS(url)
method merely checks for the presence of the file, thegetFileFromWeb(url)
method should create the file with theCREATE_NEW
option to atomically check the presence and create if not present, together with file locking to prevent other threads from reading while the caching is still in progress.