Home > database >  Storing a very large set of numbers in Java
Storing a very large set of numbers in Java

Time:11-15

I am trying to store a set of numbers that range from 0 to ~60 billion, where the set starts out empty and gradually becomes denser until it contains every number in the range. The set does not have to be capable of removing numbers. Currently my approach is to represent the set as a very long boolean array and store that array in a text file. I have made a class for this, and have tested both RandomAccessFile and FileChannel with the range of the numbers restricted from 0 to 2 billion, but in both cases the class is much slower at adding and querying numbers than using a regular boolean array. Here is the current state of my class:

import java.io.*;
import java.nio.ByteBuffer;
import java.nio.channels.*;
import java.util.*;
public class FileSet {
    private static final int BLOCK=10_000_000;
    private final long U;
    private final String fname;
    private final FileChannel file;
    public FileSet(long u, String fn) throws IOException {
        U=u;
        fname=fn;
        BufferedOutputStream out=new BufferedOutputStream(new FileOutputStream(fname));
        long n=u/8 1;
        for (long rep=0; rep<n/BLOCK; rep  ) out.write(new byte[BLOCK]);
        out.write(new byte[(int)(n%BLOCK)]);
        out.close();
        file=new RandomAccessFile(fn,"rw").getChannel();
    }
    public void add(long v) throws IOException {
        if (v<0||v>=U) throw new RuntimeException(v " out of range [0," U ")");
        file.position(v/8);
        ByteBuffer b=ByteBuffer.allocate(1); file.read(b);
        file.position(v/8);
        file.write(ByteBuffer.wrap(new byte[] {(byte)(b.get(0)|(1<<(v%8)))}));
    }
    public boolean has(long v) throws IOException {
        if (v<0||v>=U) return false;
        file.position(v/8);
        ByteBuffer b=ByteBuffer.allocate(1); file.read(b);
        return ((b.get(0)>>(v%8))&1)!=0;
    }
    public static void main(String[] args) throws IOException {
        long U=2000_000_000;
        SplittableRandom rnd=new SplittableRandom(1);
        List<long[]> actions=new ArrayList<>();
        for (int i=0; i<1000000; i  ) actions.add(new long[] {rnd.nextInt(2),rnd.nextLong(U)});

        StringBuilder ret=new StringBuilder(); {
            System.out.println("boolean[]:");
            long st=System.currentTimeMillis();
            boolean[] b=new boolean[(int)U];
            System.out.println("init time=" (System.currentTimeMillis()-st));
            st=System.currentTimeMillis();
            for (long[] act:actions)
                if (act[0]==0) b[(int)act[1]]=true;
                else ret.append(b[(int)act[1]]?"1":"0");
            System.out.println("query time=" (System.currentTimeMillis()-st));
        }

        StringBuilder ret2=new StringBuilder(); {
            System.out.println("FileSet:");
            long st=System.currentTimeMillis();
            FileSet fs=new FileSet(U,"FileSet/" U "div8.txt");
            System.out.println("init time=" (System.currentTimeMillis()-st));
            st=System.currentTimeMillis();
            for (long[] act:actions) {
                if (act[0]==0) fs.add(act[1]);
                else ret2.append(fs.has(act[1])?"1":"0");
            }
            System.out.println("query time=" (System.currentTimeMillis()-st));
            fs.file.close();
        }
        if (!ret.toString().equals(ret2.toString())) System.out.println("MISMATCH");
    }
}

and the output:

boolean[]:
init time=1248
query time=148
FileSet:
init time=269
query time=3014

Additionally, when increasing the range from 2 billion to 10 billion, there is a large jump in total running time for the queries, even though in theory the total running time should stay roughly constant. When I use the class by itself (since a boolean array no longer works for this big of a range), the query time goes from ~3 seconds to ~50 seconds. When I increase the range to 60 billion, the time increases to ~240 seconds. My questions are: is there a faster way of accessing and modifying very large files at arbitrary indices? and is there an entirely different approach to storing large integer sets that is faster than my current approach?

CodePudding user response:

Boolean arrays are a very inefficient way to store information as each boolean takes up 8 bits. You should use a BitSet instead. But BitSets also have the 2 billion limit as it uses primitive int values as parameters (and Integer.MAX_VALUE limits the size of the internal long array).

A space efficient in-memory alternative that spans beyond 2 billion entries would be to create your own BitSet wrapper that splits the data into subsets and does the indexing for you:

public class LongBitSet {
    // TODO: Initialize array and add error checking.
    private final BitSet bitSets = new BitSet[64];
    public void set(long index) {
        bitSets[(int) (index / Integer.MAX_VALUE)]
            .set((int) (index % Integer.MAX_VALUE));
    }
}

But there are other alternatives too. If you have a very dense data, using run length encoding would be a cheap way to increase memory capacity. But that would likely involve a B-tree structure to make accessing it more efficient. These are just pointers. A lot of what makes the correct answer depend solely on how you actually use the data structure.

CodePudding user response:

Yeah, you can store a very large number of set data.

  • Related