Optimize SQLITE to store large number of words-CodePudding

I'm trying to replace a big text file that we keed in memory by an SQLite DB.

We have this large file, which consists of common passwords that we refuse when a user creates a new password. It's about 4M lines, and 41MB on disk (uncompressed, compressed is 11MB). We historically loaded it at startup in a Java Set and would query it. The issue is that it takes a whopping 400MB of ram once loaded.

So I figured, I will put these words in a sqlite DB and query it at runtime. To my surprise, the database is humongous, even after a VACUUM;:

-rw-r--r--  1 guillaume  admin   137M 11 mar 18:47 commonPasswords.db

I'm using a very simple schema:

CREATE TABLE common_passwords (
    id INTEGER PRIMARY KEY ,
    password VARCHAR(50) NOT NULL UNIQUE
);

I figured that SQLite would be able to compress that textual list to somewhere between 40MB and the GZIPed size (11MB). GZiping the DB shrinks it to 63MB but at runtime I'll still have to unzip it (so have it in memory..) so it's useless.

Do you have any idea on how to store that list in a more efficient way so that it is queryable in Java? Or how to tune SQLite to optimize for size? I'm not interested in perf, since this DB is only queried a few times per hour.

CodePudding user response：

If performance is really no problem, what about a sorted array where you apply a binary search on? This is taking memory space closest to the raw size of the data, and once sorted, the access to the entries of the array is relatively fast.

final String [] passwords = … // Array is already sorted!
final var index = Arrays.binarySearch( passwords, password );
if( index < 0 )
{
  out.printf( "Password '%1$s' is discouraged!%n", password );
}
else
{
  out.printf( "Password '%1$s' seems to be ok!%n", password );
}

Confessed, updating the password list would be a nightmare, but I understood that this is not your issue here.

And of course, this is not the answer on your question on how to optimise your SQLite database.

CodePudding user response：

Dont use sqlite or any database for your purpose. Creating a table with just 2 columns for the sake of compressing an long list of data defeats the purpose of Structure in SQL and you also have to deal with the extra communication overhead.

High Level Overview

In your case i am assuming you have an unencrypted text file of 4 million lines where each password is stored in an new line.

3 simple steps for processing any large amount of sequential data

Pagify : Divide your file up into chunks or pages. Out of 4 million lines try to process every 100,1000 or N number of lines and write them as seperate chunks inside your file

Process : Processing means taking the above chunks and Compress/Deflate them.

The java Deflater class has an setDictionary method where you pass the most frequently occuring byte sequence inside your target chunk which improves overall compression quality.

Finding what value to pass to this parameter is a little complicated as we will see in the coming sections.

Lazy Read : The problem with your previous approach is you treated your text file as an single unit which you loaded fully into your memory. But now since your file is in chunks we read,check,free each chunk at a time. There are 2 advantages to this approach

After reading an chunk you can free it from memory if it does not match your criteria
If the chunk contains the password you are looking for then you can stop processing there itself and don't have to process further chunks thus saving computation time . Chunks towards the end of the file can be skipped if what you are looking for is in the beginning of the file.

Test Case

I use this code to generate your 4 million lines of text for testing purposes

public static void main(String[] args) throws Exception
 {
  try(PrintWriter writer=new PrintWriter(new File("E:/Passwords.txt")))
  {
   //for random password generation
   Random random=new Random();
   int 
   ch, 
   passlength;
   String password;
   
   //4 million lines[comes upto 40MB]
   for(int i=0;i<4 * 1000 * 1000;i  )
   {
    password="";
    
    //generate any password whose length is between 5 and 15 characters
    passlength=random.nextInt(5,15);
    for(int j=0;j<passlength;j  )
    {
     //65->122 is the range of all alphabetic characters in US ASCII
     ch=random.nextInt(65,122);
     
     //exclude characters like [ ( and so on 
     if(ch>=91 && ch<=96){continue;}
     
     password =(char)(ch);
    }
    
    //write password to file
    writer.println(password);
   }
  }
 }

So we have our test file now lets get started

Finding the longest & most frequently occuring string in your list of passwords[Optional operation]

As mentioned the setDictionary accepts the above required bytes for proper compression. For most of the cases you don't need to use this as the Compressor most of the time is smart enough to figure this out but if you pass it the exact pattern occurring in your list of passwords then compression speed and performance is improved. Now comes the complicated part

Figuring out the sequence Lets say we have an example string

  ABC/123/ABC/456/ABC/789

The most frequently occuring string is /ABC/ including the slashes how we find this string is as follows

Find the length of the longest password in your list which will also be longest word to check for it's occurence in the list[length call it N]

Create an for loop n which goes from 2->N where n is the length of the substring from i->i n which we should create and then check for its occurence in the password list

Remember the longer the string we find the better for compression for example lets say we have 2 passwords

ABC
ABCD

Longest Password=length 3[2nd line]

for i=0->length of full text
   for n=2->N
    Generate  substring from i->i n
     check for occurences

For each of these strings generated we find the number of occurences in the password list and we store the one with the highest occurence.

But if we encounter an password with an longer length than the previous one we override the previous longest word irrespective of its occurences as it is always favoured more in compression

The Dictionary class does this

import java.util.List;
import java.util.stream.Collectors;

final class Dictionary
{
 private static final class Info
 {
  private final int longestLine;
  
  private int highestOccurence;
  private String longestWord="";
  
  public Info(int longestLine) {this.longestLine = longestLine;}
 } 
 
 private static Info getInfo(List<String> lines)
 {
  int 
  currentLine, 
  longestLine=0;
  
  for(String line:lines)
  {
   //find longest line in the password list
   if((currentLine=line.length())>longestLine)
   {
    longestLine=currentLine;
   }
  }
  
  return new Info(longestLine);
 }
 
 static String getLongestOccurence(List<String> lines,int minLength,int maxLength)
 {
  //user dosent want dictionary hence return empty string
  if(minLength<0){return "";}
  
  Info info=getInfo(lines);
 
  minLength=Math.max(2,minLength);
  maxLength=maxLength<=0?info.longestLine:Math.min(maxLength,info.longestLine);
  
  String fullText=lines.stream().collect(Collectors.joining("\n"));
  int 
  startIndex=0, 
  count;
  
  int length=fullText.length();
  outer : for(int i=0;i<length;i  )
  {
   for(int n=minLength;n<maxLength;n  )
   {
    //if substring is longer than text length break
    if((i n 1)>length){break outer;}
    //if combination length is lesser than previous longestWord length then continue as we dont care about shorter words
    if(n<info.longestWord.length()){continue;}   
    //get combination
    String combination=fullText.substring(i,i n 1);  
    
    //count occurences
    count=0;
    do
    {
     startIndex=fullText.indexOf(combination,startIndex);
     if(startIndex!=-1)
     {
      startIndex =combination.length();
      
      count  ;
     }
    } 
    while(startIndex!=-1);
    
    //we only care about more than one occurences
    if(count>1)
    {   
     if(length>info.longestWord.length()) //if new combination is longer than previous one give it more precedence
     {
      info.longestWord=combination;
      info.highestOccurence=count;
     }
     else if(count>info.highestOccurence)  //else if both are equal length then check counts
     {
      info.longestWord=combination;       
      info.highestOccurence=count;      
     }
    }
   }
  }
  
  return info.longestWord;
 }
}

Here we are looking for an frequently occuring word which is minimum length 2 and maximum length 4

 public static void main(String[] args)
 {
  System.out.println
  (
   "Dictionary=" Dictionary.getLongestOccurence
   (
    List.of
    (
      "Pass A"
     ,"Pass B"
     ,"Pass C"
     ,"Pass D"
    )
    ,2,4
   )
  );
 }

We get

Dictionary=Pass

Difference in compression?

With all that complex logic what compression do we get? lets see

public static void main(String[] args)throws Exception
 {
  String text="Pass A\nPass B\n Pass C\nPass D";
  
  Deflater def=new Deflater(9);
  def.setDictionary(Dictionary.getLongestOccurence(List.of(text),2,6).getBytes());
  
  byte[] original=text.getBytes();
  try(ByteArrayOutputStream baos=new ByteArrayOutputStream())
  {
   try(DeflaterOutputStream dos=new DeflaterOutputStream(baos,def))
   {
    dos.write(original);
    dos.flush();
   }
   System.out.println(original.length "/" baos.toByteArray().length);
  }    
 }

Output

28/24

So using dictionary the ratio of the uncompressed to the compressed data is 28/24

If you comment out the line
//def.setDictionary(Dictionary.getLongestOccurence(List.of(text),2,6).getBytes());

We get

    28/26

Not so great but we saved 2 whole bytes of data by using the dictionary. Off course this is an small test case but if you have sorted your passwords such that very similar passwords are close to each other then savings become much greater.

Off course the down side to this is that finding the dictionary takes a LONG time depending on your passwords. If you are processing this list only once then maybe you would be ok waiting for 5 to 10 minutes to process your whole 4 million lines else you can skip this part

Pagifying your data[Main Core]

This is the most important part of the operation where all your savings come into play.

It's pretty simple as the code will explain it all

import java.io.BufferedReader;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.stream.Collectors;
import java.util.zip.Deflater;
import java.util.zip.DeflaterOutputStream;


final class Compressor
{
 public static void compress(File textFile,int linesPerChunk,int minWord,int maxWord,File output)throws Exception
 {
  try(BufferedReader reader=new BufferedReader(new FileReader(textFile));
      FileOutputStream fos=new FileOutputStream(output))
  {
   ArrayList<String> lines=new ArrayList();
   Deflater compresser=new Deflater(9);
   ByteArrayOutputStream baos=new ByteArrayOutputStream();
   DataOutputStream dos=new DataOutputStream(fos);
   int linesDone=0;
   
   String line;   
   while(true)
   {
    line=reader.readLine();
    
    //add only if more input is available
    if(line!=null){lines.add(line);}

    //parse if we have linePerChunk number of lines or if this was the last input
    if(line==null || lines.size()>=linesPerChunk)
    {
     //occurs if this was the last input and there are no more left over lines to parse
     if(lines.isEmpty()){break;}
     
     byte[] dictionary=Dictionary.getLongestOccurence(lines,minWord,maxWord).getBytes();
     
     //Compress chunk
     baos.reset();      
     compresser.reset();
     if(dictionary.length!=0){compresser.setDictionary(dictionary);}    
     try(DeflaterOutputStream deos=new DeflaterOutputStream(baos,compresser))
     {
      deos.write(lines.stream().collect(Collectors.joining("\n")).getBytes());
      deos.flush();
     }
     byte[] compressed=baos.toByteArray();
     
     //write chunk
     dos.writeInt(dictionary.length);
     dos.write(dictionary);
     dos.writeInt(compressed.length);
     dos.write(compressed);
     dos.flush();
     
     System.out.println("Compressed Lines=" (linesDone =lines.size()));
     
     //clear to read next chunk
     lines.clear();
    }
    
   
    if(line==null){break;}
   }
   
   //used while reading to check for end point
   dos.writeInt(-1);
  }
 }
}

Basically we divide your file into pages of N lines and compress each page and write it as a chunk into an new file. This method also accepts the min and max word length to use in your dictionary . You can pass -1 for either value if you wish to skip using dictionary. Performance benefits of dictionary largely depends on your password data and is your preference

Lazy Reading[Final]

Finally we read each chunk and if the chunk contains the password we are looking for we return the method there only and don't proceed further

public class Validater 
{
 public static boolean isAllowed(String password,File compressed)throws Exception
 {
  Inflater inf=new Inflater();
  ByteArrayOutputStream baos=new ByteArrayOutputStream();
  
  try(DataInputStream dis=new DataInputStream(new FileInputStream(compressed));)
  {
   int size;
   while((size=dis.readInt())!=-1)
   {
     byte[] dictionary=new byte[size];
     dis.read(dictionary);
     byte[] data=new byte[dis.readInt()];
     dis.read(data);

     inf.reset();
     try(InflaterInputStream iis=new InflaterInputStream(new ByteArrayInputStream(data),inf))
     {
      baos.reset();
      
      int length;
      byte[] output=new byte[1000];   
      while(true)
      {
       length=iis.read(output);
       if(inf.needsDictionary()){inf.setDictionary(dictionary);}
       else
       {
        if(length==-1){break;}
        baos.write(output,0,length);
       }
      }
     }
     
     //the actual validation
     if(new String(baos.toByteArray()).contains(password)){return false;}  
   }
  }
  return true;
 }
}

Performance

 public static void main(String[] args)throws Exception
 {
  //beginning of the file
  long start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("ELIdjajPKsmeZ",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
  
  //middle of the file
  start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("kSIvmu",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
  
  //end of the file
  start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("CcFylRChZDyd",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
 }

The above test case is some random passwords i copied from each part of the notepad passwords file

Before executing you must compress your file using

public static void main4(String[] args)throws Exception
 {
  //i used -1 to skip dictionary but you can specify whatever value and i create each chunk of length 100 lines
  Compressor.compress(new File("E:/Passwords.txt"),100,-1,4,new File("E:/Compressed.bin"));
 }

Here are the results

false
5
false
713
false
1554

As you can see the worst case took just 1554 milliseconds to execute over 4 million lines of passwords and the memory usage didn't even exceed 1 MB in my potato system

The Full Code

Dictionary.java

import java.util.List;
import java.util.stream.Collectors;

final class Dictionary
{
 private static final class Info
 {
  private final int longestLine;
  
  private int highestOccurence;
  private String longestWord="";
  
  public Info(int longestLine) {this.longestLine = longestLine;}
 } 
 
 private static Info getInfo(List<String> lines)
 {
  int 
  currentLine, 
  longestLine=0;
  
  for(String line:lines)
  {
   //find longest line in the password list
   if((currentLine=line.length())>longestLine)
   {
    longestLine=currentLine;
   }
  }
  
  return new Info(longestLine);
 }
 
 static String getLongestOccurence(List<String> lines,int minLength,int maxLength)
 {
  //user dosent want dictionary hence return empty string
  if(minLength<0){return "";}
  
  Info info=getInfo(lines);
 
  minLength=Math.max(2,minLength);
  maxLength=maxLength<=0?info.longestLine:Math.min(maxLength,info.longestLine);
  
  String fullText=lines.stream().collect(Collectors.joining("\n"));
  int 
  startIndex=0, 
  count;
  
  int length=fullText.length();
  outer : for(int i=0;i<length;i  )
  {
   for(int n=minLength;n<maxLength;n  )
   {
    //if substring is longer than text length break
    if((i n 1)>length){break outer;}
    //if combination length is lesser than previous longestWord length then continue as we dont care about shorter words
    if(n<info.longestWord.length()){continue;}   
    //get combination
    String combination=fullText.substring(i,i n 1);  
    
    //count occurences
    count=0;
    do
    {
     startIndex=fullText.indexOf(combination,startIndex);
     if(startIndex!=-1)
     {
      startIndex =combination.length();
      
      count  ;
     }
    } 
    while(startIndex!=-1);
    
    //we only care about more than one occurences
    if(count>1)
    {   
     if(length>info.longestWord.length()) //if new combination is longer than previous one give it more precedence
     {
      info.longestWord=combination;
      info.highestOccurence=count;
     }
     else if(count>info.highestOccurence)  //else if both are equal length then check counts
     {
      info.longestWord=combination;       
      info.highestOccurence=count;      
     }
    }
   }
  }
  
  return info.longestWord;
 }
}

Compressor.java

import java.io.BufferedReader;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.stream.Collectors;
import java.util.zip.Deflater;
import java.util.zip.DeflaterOutputStream;


final class Compressor
{
 public static void compress(File textFile,int linesPerChunk,int minWord,int maxWord,File output)throws Exception
 {
  try(BufferedReader reader=new BufferedReader(new FileReader(textFile));
      FileOutputStream fos=new FileOutputStream(output))
  {
   ArrayList<String> lines=new ArrayList();
   Deflater compresser=new Deflater(9);
   ByteArrayOutputStream baos=new ByteArrayOutputStream();
   DataOutputStream dos=new DataOutputStream(fos);
   int linesDone=0;
   
   String line;   
   while(true)
   {
    line=reader.readLine();
    
    //add only if more input is available
    if(line!=null){lines.add(line);}

    //parse if we have linePerChunk number of lines or if this was the last input
    if(line==null || lines.size()>=linesPerChunk)
    {
     //occurs if this was the last input and there are no more left over lines to parse
     if(lines.isEmpty()){break;}
     
     byte[] dictionary=Dictionary.getLongestOccurence(lines,minWord,maxWord).getBytes();
     
     //Compress chunk
     baos.reset();      
     compresser.reset();
     if(dictionary.length!=0){compresser.setDictionary(dictionary);}    
     try(DeflaterOutputStream deos=new DeflaterOutputStream(baos,compresser))
     {
      deos.write(lines.stream().collect(Collectors.joining("\n")).getBytes());
      deos.flush();
     }
     byte[] compressed=baos.toByteArray();
     
     //write chunk
     dos.writeInt(dictionary.length);
     dos.write(dictionary);
     dos.writeInt(compressed.length);
     dos.write(compressed);
     dos.flush();
     
     System.out.println("Compressed Lines=" (linesDone =lines.size()));
     
     lines.clear();
    }
    //clear to read next chunk
   
    if(line==null){break;}
   }
   
   //used while reading to check for end point
   dos.writeInt(-1);
  }
 }
}

Validater.java

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.util.zip.Inflater;
import java.util.zip.InflaterInputStream;


public class Validater 
{
 public static boolean isAllowed(String password,File compressed)throws Exception
 {
  Inflater inf=new Inflater();
  ByteArrayOutputStream baos=new ByteArrayOutputStream();
  
  try(DataInputStream dis=new DataInputStream(new FileInputStream(compressed));)
  {
   int size;
   while((size=dis.readInt())!=-1)
   {
     byte[] dictionary=new byte[size];
     dis.read(dictionary);
     byte[] data=new byte[dis.readInt()];
     dis.read(data);

     inf.reset();
     try(InflaterInputStream iis=new InflaterInputStream(new ByteArrayInputStream(data),inf))
     {
      baos.reset();
      
      int length;
      byte[] output=new byte[1000];   
      while(true)
      {
       length=iis.read(output);
       if(inf.needsDictionary()){inf.setDictionary(dictionary);}
       else
       {
        if(length==-1){break;}
        baos.write(output,0,length);
       }
      }
     }
     
     if(new String(baos.toByteArray()).contains(password)){return false;}  
   }
  }
  return true;
 }
}

All Test cases rename any method to main and test

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.PrintWriter;
import java.util.List;
import java.util.Random;
import java.util.zip.Deflater;
import java.util.zip.DeflaterOutputStream;

final class Test 
{
 public static void main1(String[] args)throws Exception
 {
  String text="Pass A\nPass B\n Pass C\nPass D";
  
  Deflater def=new Deflater(9);
  //def.setDictionary(Dictionary.getLongestOccurence(List.of(text),2,6).getBytes());
  
  byte[] original=text.getBytes();
  try(ByteArrayOutputStream baos=new ByteArrayOutputStream())
  {
   try(DeflaterOutputStream dos=new DeflaterOutputStream(baos,def))
   {
    dos.write(original);
    dos.flush();
   }
   System.out.println(original.length "/" baos.toByteArray().length);
  }    
 }
 
 public static void main2(String[] args) throws Exception
 {
  try(PrintWriter writer=new PrintWriter(new File("E:/Passwords.txt")))
  {
   Random random=new Random();
   int 
   ch, 
   passlength;
   String password;
   
   for(int i=0;i<4 * 1000 * 1000;i  )
   {
    password="";
    
    passlength=random.nextInt(5,15);
    for(int j=0;j<passlength;j  )
    {
     ch=random.nextInt(65,122);
     
     if(ch>=91 && ch<=96){continue;}
     
     password =(char)(ch);
    }
    if(password.isEmpty()){continue;}
    
    writer.println(password);
   }
  }
 }
 


 public static void main(String[] args)
 {
  System.out.println
  (
   "Dictionary=" Dictionary.getLongestOccurence
   (
    List.of
    (
      "ABC"
     ,"ABCD"
    )
    ,2,3
   )
  );
  
  System.out.println
  (
   "Dictionary=" Dictionary.getLongestOccurence
   (
    List.of
    (
      "Pass A1"
     ,"Pass B2"
     ,"Pass C3"
     ,"Pass D4"
    )
    ,2,4
   )
  );
  
  System.out.println
  (
   "Dictionary=" Dictionary.getLongestOccurence
   (
    List.of
    (
      "123/ABC/345/ABC/678/ABC/908"
    )
    ,2,5
   )
  );
  
  System.out.println
  (
   "Dictionary=" Dictionary.getLongestOccurence
   (
    List.of
    (
      "HHH123HHH456HHH789HHH092"
    )
    ,2,4
   )
  );
 }
 
 public static void main4(String[] args)throws Exception
 {
  Compressor.compress(new File("E:/Passwords2.txt"),2,2,4,new File("E:/Compressed2.bin"));
 }
 
 public static void main5(String[] args)throws Exception
 {
  //beginning of the file
  long start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("ELIdjajPKsmeZ",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
  
  //middle of the file
  start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("kSIvmu",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
  
  //end of the file
  start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("CcFylRChZDyd",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
 }
}

Conclusion

Well this was very insightful . All the classes used here are all present in java 7 except streams which is java 8 which most android versions support so you should be ok on that end. If you cant be bothered to read my poorly documented explanations then you can go right ahead and copy the full code after the FullCode Section. This took pretty much the whole day. One down side is while before your compressed file was 11 MB the final compressed file in my system came up to 26.9 MB. for my randomly generated passwords but as you have seen from the outputs memory and speed is overall very good. You can play around with the settings to get your preferred sizes Also if you have a better approach for finding the dictionary please do edit this answer. Don't bother using SqLite.

You have a good day now :)

EDIT Dictionary class is much faster and improved. Hope it helps