Home > database >  Why is HashSet in Java taking so much memory?
Why is HashSet in Java taking so much memory?

Time:11-13

I'm loading a 1GB ASCII text file with about 38 million rows into a HashSet. Using Java 11, the process takes about 8GB of memory. ​

HashSet<String> addresses = new HashSet<>(38741847);
​try (Stream<String> lines = Files.lines(Paths.get("test.txt"), Charset.defaultCharset())) {
    lines​.forEach(addresses::add);
​}
​System.out.println(addresses.size());
​Thread.sleep(100000);

Why is Java taking so much memory?

In comparison, I've implemented the same thing in Python, which takes only 4GB of memory.

s = set()
with open("test.txt") as file:
for line in file:
    s.add(line)
print(len(s))
time.sleep(1000)

CodePudding user response:

A HashSet has a load factor which defaults to 0.75. That means memory is reallocated once the hashset is 75% full. If your hash set should hold 38741847 elements, you have to initialize it with 38741847/0.75 or set a higher load factor:

new HashSet<>(38741847, 1); // load factor 1 (100%)

CodePudding user response:

Meanwhile I found the answer here, where I also discovered a few alternative HashSet implementations which are part of the trove4j, hppc and Guava libraries. I've tested them with the same code. Here the results:

trove4j (5.5GB)

THashSet<String> s = new THashSet<>(38742847,1);

hppc (4.7GB)

ObjectHashSet <String> s2 = new ObjectHashSet<>(38742847,1, 0.99); 

Guava (5GB)

ImmutableSet<String> s2    
ImmutableSet.Builder<String> b =  ImmutableSet.builder();
lines.forEach(b::add);
s2 =b.build();

I decided for Guava, because it doesn't need to know the exact number of elements to be inserted. So I don't have to count the lines of the file first.

  • Related