I'm trying to replace a big text file that we keed in memory by an SQLite DB.
We have this large file, which consists of common passwords that we refuse when a user creates a new password. It's about 4M lines, and 41MB on disk (uncompressed, compressed is 11MB). We historically loaded it at startup in a Java Set
and would query it. The issue is that it takes a whopping 400MB of ram once loaded.
So I figured, I will put these words in a sqlite DB and query it at runtime. To my surprise, the database is humongous, even after a VACUUM;
:
-rw-r--r-- 1 guillaume admin 137M 11 mar 18:47 commonPasswords.db
I'm using a very simple schema:
CREATE TABLE common_passwords (
id INTEGER PRIMARY KEY ,
password VARCHAR(50) NOT NULL UNIQUE
);
I figured that SQLite would be able to compress that textual list to somewhere between 40MB and the GZIPed size (11MB). GZiping the DB shrinks it to 63MB but at runtime I'll still have to unzip it (so have it in memory..) so it's useless.
Do you have any idea on how to store that list in a more efficient way so that it is queryable in Java? Or how to tune SQLite to optimize for size? I'm not interested in perf, since this DB is only queried a few times per hour.
CodePudding user response:
If performance is really no problem, what about a sorted array where you apply a binary search on? This is taking memory space closest to the raw size of the data, and once sorted, the access to the entries of the array is relatively fast.
final String [] passwords = … // Array is already sorted!
final var index = Arrays.binarySearch( passwords, password );
if( index < 0 )
{
out.printf( "Password '%1$s' is discouraged!%n", password );
}
else
{
out.printf( "Password '%1$s' seems to be ok!%n", password );
}
Confessed, updating the password list would be a nightmare, but I understood that this is not your issue here.
And of course, this is not the answer on your question on how to optimise your SQLite database.
CodePudding user response:
Dont use sqlite or any database for your purpose. Creating a table with just 2 columns for the sake of compressing an long list of data defeats the purpose of Structure in SQL and you also have to deal with the extra communication overhead.
High Level Overview
In your case i am assuming you have an unencrypted text file of 4 million lines where each password is stored in an new line.
3 simple steps for processing any large amount of sequential data
Pagify : Divide your file up into chunks or pages. Out of 4 million lines try to process every 100,1000 or N number of lines and write them as seperate chunks inside your file
Process : Processing means taking the above chunks and Compress/Deflate them.
The java Deflater class has an setDictionary method where you pass the most frequently occuring byte sequence inside your target chunk which improves overall compression quality.
Finding what value to pass to this parameter is a little complicated as we will see in the coming sections.
Lazy Read : The problem with your previous approach is you treated your text file as an single unit which you loaded fully into your memory. But now since your file is in chunks we read,check,free each chunk at a time. There are 2 advantages to this approach
After reading an chunk you can free it from memory if it does not match your criteria
If the chunk contains the password you are looking for then you can stop processing there itself and don't have to process further chunks thus saving computation time . Chunks towards the end of the file can be skipped if what you are looking for is in the beginning of the file.
Test Case
I use this code to generate your 4 million lines of text for testing purposes
public static void main(String[] args) throws Exception
{
try(PrintWriter writer=new PrintWriter(new File("E:/Passwords.txt")))
{
//for random password generation
Random random=new Random();
int
ch,
passlength;
String password;
//4 million lines[comes upto 40MB]
for(int i=0;i<4 * 1000 * 1000;i )
{
password="";
//generate any password whose length is between 5 and 15 characters
passlength=random.nextInt(5,15);
for(int j=0;j<passlength;j )
{
//65->122 is the range of all alphabetic characters in US ASCII
ch=random.nextInt(65,122);
//exclude characters like [ ( and so on
if(ch>=91 && ch<=96){continue;}
password =(char)(ch);
}
//write password to file
writer.println(password);
}
}
}
So we have our test file now lets get started
Finding the longest & most frequently occuring string in your list of passwords[Optional operation]
As mentioned the setDictionary accepts the above required bytes for proper compression. For most of the cases you don't need to use this as the Compressor most of the time is smart enough to figure this out but if you pass it the exact pattern occurring in your list of passwords then compression speed and performance is improved. Now comes the complicated part
Figuring out the sequence Lets say we have an example string
ABC/123/ABC/456/ABC/789
The most frequently occuring string is /ABC/ including the slashes how we find this string is as follows
Find the length of the longest password in your list which will also be longest word to check for it's occurence in the list[length call it N]
Create an for loop n which goes from 2->N where n is the length of the substring from i->i n which we should create and then check for its occurence in the password list
Remember the longer the string we find the better for compression for example lets say we have 2 passwords
ABC
ABCD
Longest Password=length 3[2nd line]
for i=0->length of full text
for n=2->N
Generate substring from i->i n
check for occurences
For each of these strings generated we find the number of occurences in the password list and we store the one with the highest occurence.
But if we encounter an password with an longer length than the previous one we override the previous longest word irrespective of its occurences as it is always favoured more in compression
The Dictionary class does this
import java.util.List;
import java.util.stream.Collectors;
final class Dictionary
{
private static final class Info
{
private final int longestLine;
private int highestOccurence;
private String longestWord="";
public Info(int longestLine) {this.longestLine = longestLine;}
}
private static Info getInfo(List<String> lines)
{
int
currentLine,
longestLine=0;
for(String line:lines)
{
//find longest line in the password list
if((currentLine=line.length())>longestLine)
{
longestLine=currentLine;
}
}
return new Info(longestLine);
}
static String getLongestOccurence(List<String> lines,int minLength,int maxLength)
{
//user dosent want dictionary hence return empty string
if(minLength<0){return "";}
Info info=getInfo(lines);
minLength=Math.max(2,minLength);
maxLength=maxLength<=0?info.longestLine:Math.min(maxLength,info.longestLine);
String fullText=lines.stream().collect(Collectors.joining("\n"));
int
startIndex=0,
count;
int length=fullText.length();
outer : for(int i=0;i<length;i )
{
for(int n=minLength;n<maxLength;n )
{
//if substring is longer than text length break
if((i n 1)>length){break outer;}
//if combination length is lesser than previous longestWord length then continue as we dont care about shorter words
if(n<info.longestWord.length()){continue;}
//get combination
String combination=fullText.substring(i,i n 1);
//count occurences
count=0;
do
{
startIndex=fullText.indexOf(combination,startIndex);
if(startIndex!=-1)
{
startIndex =combination.length();
count ;
}
}
while(startIndex!=-1);
//we only care about more than one occurences
if(count>1)
{
if(length>info.longestWord.length()) //if new combination is longer than previous one give it more precedence
{
info.longestWord=combination;
info.highestOccurence=count;
}
else if(count>info.highestOccurence) //else if both are equal length then check counts
{
info.longestWord=combination;
info.highestOccurence=count;
}
}
}
}
return info.longestWord;
}
}
Here we are looking for an frequently occuring word which is minimum length 2 and maximum length 4
public static void main(String[] args)
{
System.out.println
(
"Dictionary=" Dictionary.getLongestOccurence
(
List.of
(
"Pass A"
,"Pass B"
,"Pass C"
,"Pass D"
)
,2,4
)
);
}
We get
Dictionary=Pass
Difference in compression?
With all that complex logic what compression do we get? lets see
public static void main(String[] args)throws Exception
{
String text="Pass A\nPass B\n Pass C\nPass D";
Deflater def=new Deflater(9);
def.setDictionary(Dictionary.getLongestOccurence(List.of(text),2,6).getBytes());
byte[] original=text.getBytes();
try(ByteArrayOutputStream baos=new ByteArrayOutputStream())
{
try(DeflaterOutputStream dos=new DeflaterOutputStream(baos,def))
{
dos.write(original);
dos.flush();
}
System.out.println(original.length "/" baos.toByteArray().length);
}
}
Output
28/24
So using dictionary the ratio of the uncompressed to the compressed data is 28/24
If you comment out the line
//def.setDictionary(Dictionary.getLongestOccurence(List.of(text),2,6).getBytes());
We get
28/26
Not so great but we saved 2 whole bytes of data by using the dictionary. Off course this is an small test case but if you have sorted your passwords such that very similar passwords are close to each other then savings become much greater.
Off course the down side to this is that finding the dictionary takes a LONG time depending on your passwords. If you are processing this list only once then maybe you would be ok waiting for 5 to 10 minutes to process your whole 4 million lines else you can skip this part
Pagifying your data[Main Core]
This is the most important part of the operation where all your savings come into play.
It's pretty simple as the code will explain it all
import java.io.BufferedReader;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.stream.Collectors;
import java.util.zip.Deflater;
import java.util.zip.DeflaterOutputStream;
final class Compressor
{
public static void compress(File textFile,int linesPerChunk,int minWord,int maxWord,File output)throws Exception
{
try(BufferedReader reader=new BufferedReader(new FileReader(textFile));
FileOutputStream fos=new FileOutputStream(output))
{
ArrayList<String> lines=new ArrayList();
Deflater compresser=new Deflater(9);
ByteArrayOutputStream baos=new ByteArrayOutputStream();
DataOutputStream dos=new DataOutputStream(fos);
int linesDone=0;
String line;
while(true)
{
line=reader.readLine();
//add only if more input is available
if(line!=null){lines.add(line);}
//parse if we have linePerChunk number of lines or if this was the last input
if(line==null || lines.size()>=linesPerChunk)
{
//occurs if this was the last input and there are no more left over lines to parse
if(lines.isEmpty()){break;}
byte[] dictionary=Dictionary.getLongestOccurence(lines,minWord,maxWord).getBytes();
//Compress chunk
baos.reset();
compresser.reset();
if(dictionary.length!=0){compresser.setDictionary(dictionary);}
try(DeflaterOutputStream deos=new DeflaterOutputStream(baos,compresser))
{
deos.write(lines.stream().collect(Collectors.joining("\n")).getBytes());
deos.flush();
}
byte[] compressed=baos.toByteArray();
//write chunk
dos.writeInt(dictionary.length);
dos.write(dictionary);
dos.writeInt(compressed.length);
dos.write(compressed);
dos.flush();
System.out.println("Compressed Lines=" (linesDone =lines.size()));
//clear to read next chunk
lines.clear();
}
if(line==null){break;}
}
//used while reading to check for end point
dos.writeInt(-1);
}
}
}
Basically we divide your file into pages of N lines and compress each page and write it as a chunk into an new file. This method also accepts the min and max word length to use in your dictionary . You can pass -1 for either value if you wish to skip using dictionary. Performance benefits of dictionary largely depends on your password data and is your preference
Lazy Reading[Final]
Finally we read each chunk and if the chunk contains the password we are looking for we return the method there only and don't proceed further
public class Validater
{
public static boolean isAllowed(String password,File compressed)throws Exception
{
Inflater inf=new Inflater();
ByteArrayOutputStream baos=new ByteArrayOutputStream();
try(DataInputStream dis=new DataInputStream(new FileInputStream(compressed));)
{
int size;
while((size=dis.readInt())!=-1)
{
byte[] dictionary=new byte[size];
dis.read(dictionary);
byte[] data=new byte[dis.readInt()];
dis.read(data);
inf.reset();
try(InflaterInputStream iis=new InflaterInputStream(new ByteArrayInputStream(data),inf))
{
baos.reset();
int length;
byte[] output=new byte[1000];
while(true)
{
length=iis.read(output);
if(inf.needsDictionary()){inf.setDictionary(dictionary);}
else
{
if(length==-1){break;}
baos.write(output,0,length);
}
}
}
//the actual validation
if(new String(baos.toByteArray()).contains(password)){return false;}
}
}
return true;
}
}
Performance
public static void main(String[] args)throws Exception
{
//beginning of the file
long start=System.currentTimeMillis();
System.out.println(Validater.isAllowed("ELIdjajPKsmeZ",new File("E:/Compressed.bin")));
System.out.println(System.currentTimeMillis()-start);
//middle of the file
start=System.currentTimeMillis();
System.out.println(Validater.isAllowed("kSIvmu",new File("E:/Compressed.bin")));
System.out.println(System.currentTimeMillis()-start);
//end of the file
start=System.currentTimeMillis();
System.out.println(Validater.isAllowed("CcFylRChZDyd",new File("E:/Compressed.bin")));
System.out.println(System.currentTimeMillis()-start);
}
The above test case is some random passwords i copied from each part of the notepad passwords file
Before executing you must compress your file using
public static void main4(String[] args)throws Exception
{
//i used -1 to skip dictionary but you can specify whatever value and i create each chunk of length 100 lines
Compressor.compress(new File("E:/Passwords.txt"),100,-1,4,new File("E:/Compressed.bin"));
}
Here are the results
false
5
false
713
false
1554
As you can see the worst case took just 1554 milliseconds to execute over 4 million lines of passwords and the memory usage didn't even exceed 1 MB in my potato system
The Full Code
Dictionary.java
import java.util.List;
import java.util.stream.Collectors;
final class Dictionary
{
private static final class Info
{
private final int longestLine;
private int highestOccurence;
private String longestWord="";
public Info(int longestLine) {this.longestLine = longestLine;}
}
private static Info getInfo(List<String> lines)
{
int
currentLine,
longestLine=0;
for(String line:lines)
{
//find longest line in the password list
if((currentLine=line.length())>longestLine)
{
longestLine=currentLine;
}
}
return new Info(longestLine);
}
static String getLongestOccurence(List<String> lines,int minLength,int maxLength)
{
//user dosent want dictionary hence return empty string
if(minLength<0){return "";}
Info info=getInfo(lines);
minLength=Math.max(2,minLength);
maxLength=maxLength<=0?info.longestLine:Math.min(maxLength,info.longestLine);
String fullText=lines.stream().collect(Collectors.joining("\n"));
int
startIndex=0,
count;
int length=fullText.length();
outer : for(int i=0;i<length;i )
{
for(int n=minLength;n<maxLength;n )
{
//if substring is longer than text length break
if((i n 1)>length){break outer;}
//if combination length is lesser than previous longestWord length then continue as we dont care about shorter words
if(n<info.longestWord.length()){continue;}
//get combination
String combination=fullText.substring(i,i n 1);
//count occurences
count=0;
do
{
startIndex=fullText.indexOf(combination,startIndex);
if(startIndex!=-1)
{
startIndex =combination.length();
count ;
}
}
while(startIndex!=-1);
//we only care about more than one occurences
if(count>1)
{
if(length>info.longestWord.length()) //if new combination is longer than previous one give it more precedence
{
info.longestWord=combination;
info.highestOccurence=count;
}
else if(count>info.highestOccurence) //else if both are equal length then check counts
{
info.longestWord=combination;
info.highestOccurence=count;
}
}
}
}
return info.longestWord;
}
}
Compressor.java
import java.io.BufferedReader;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.stream.Collectors;
import java.util.zip.Deflater;
import java.util.zip.DeflaterOutputStream;
final class Compressor
{
public static void compress(File textFile,int linesPerChunk,int minWord,int maxWord,File output)throws Exception
{
try(BufferedReader reader=new BufferedReader(new FileReader(textFile));
FileOutputStream fos=new FileOutputStream(output))
{
ArrayList<String> lines=new ArrayList();
Deflater compresser=new Deflater(9);
ByteArrayOutputStream baos=new ByteArrayOutputStream();
DataOutputStream dos=new DataOutputStream(fos);
int linesDone=0;
String line;
while(true)
{
line=reader.readLine();
//add only if more input is available
if(line!=null){lines.add(line);}
//parse if we have linePerChunk number of lines or if this was the last input
if(line==null || lines.size()>=linesPerChunk)
{
//occurs if this was the last input and there are no more left over lines to parse
if(lines.isEmpty()){break;}
byte[] dictionary=Dictionary.getLongestOccurence(lines,minWord,maxWord).getBytes();
//Compress chunk
baos.reset();
compresser.reset();
if(dictionary.length!=0){compresser.setDictionary(dictionary);}
try(DeflaterOutputStream deos=new DeflaterOutputStream(baos,compresser))
{
deos.write(lines.stream().collect(Collectors.joining("\n")).getBytes());
deos.flush();
}
byte[] compressed=baos.toByteArray();
//write chunk
dos.writeInt(dictionary.length);
dos.write(dictionary);
dos.writeInt(compressed.length);
dos.write(compressed);
dos.flush();
System.out.println("Compressed Lines=" (linesDone =lines.size()));
lines.clear();
}
//clear to read next chunk
if(line==null){break;}
}
//used while reading to check for end point
dos.writeInt(-1);
}
}
}
Validater.java
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.util.zip.Inflater;
import java.util.zip.InflaterInputStream;
public class Validater
{
public static boolean isAllowed(String password,File compressed)throws Exception
{
Inflater inf=new Inflater();
ByteArrayOutputStream baos=new ByteArrayOutputStream();
try(DataInputStream dis=new DataInputStream(new FileInputStream(compressed));)
{
int size;
while((size=dis.readInt())!=-1)
{
byte[] dictionary=new byte[size];
dis.read(dictionary);
byte[] data=new byte[dis.readInt()];
dis.read(data);
inf.reset();
try(InflaterInputStream iis=new InflaterInputStream(new ByteArrayInputStream(data),inf))
{
baos.reset();
int length;
byte[] output=new byte[1000];
while(true)
{
length=iis.read(output);
if(inf.needsDictionary()){inf.setDictionary(dictionary);}
else
{
if(length==-1){break;}
baos.write(output,0,length);
}
}
}
if(new String(baos.toByteArray()).contains(password)){return false;}
}
}
return true;
}
}
All Test cases rename any method to main and test
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.PrintWriter;
import java.util.List;
import java.util.Random;
import java.util.zip.Deflater;
import java.util.zip.DeflaterOutputStream;
final class Test
{
public static void main1(String[] args)throws Exception
{
String text="Pass A\nPass B\n Pass C\nPass D";
Deflater def=new Deflater(9);
//def.setDictionary(Dictionary.getLongestOccurence(List.of(text),2,6).getBytes());
byte[] original=text.getBytes();
try(ByteArrayOutputStream baos=new ByteArrayOutputStream())
{
try(DeflaterOutputStream dos=new DeflaterOutputStream(baos,def))
{
dos.write(original);
dos.flush();
}
System.out.println(original.length "/" baos.toByteArray().length);
}
}
public static void main2(String[] args) throws Exception
{
try(PrintWriter writer=new PrintWriter(new File("E:/Passwords.txt")))
{
Random random=new Random();
int
ch,
passlength;
String password;
for(int i=0;i<4 * 1000 * 1000;i )
{
password="";
passlength=random.nextInt(5,15);
for(int j=0;j<passlength;j )
{
ch=random.nextInt(65,122);
if(ch>=91 && ch<=96){continue;}
password =(char)(ch);
}
if(password.isEmpty()){continue;}
writer.println(password);
}
}
}
public static void main(String[] args)
{
System.out.println
(
"Dictionary=" Dictionary.getLongestOccurence
(
List.of
(
"ABC"
,"ABCD"
)
,2,3
)
);
System.out.println
(
"Dictionary=" Dictionary.getLongestOccurence
(
List.of
(
"Pass A1"
,"Pass B2"
,"Pass C3"
,"Pass D4"
)
,2,4
)
);
System.out.println
(
"Dictionary=" Dictionary.getLongestOccurence
(
List.of
(
"123/ABC/345/ABC/678/ABC/908"
)
,2,5
)
);
System.out.println
(
"Dictionary=" Dictionary.getLongestOccurence
(
List.of
(
"HHH123HHH456HHH789HHH092"
)
,2,4
)
);
}
public static void main4(String[] args)throws Exception
{
Compressor.compress(new File("E:/Passwords2.txt"),2,2,4,new File("E:/Compressed2.bin"));
}
public static void main5(String[] args)throws Exception
{
//beginning of the file
long start=System.currentTimeMillis();
System.out.println(Validater.isAllowed("ELIdjajPKsmeZ",new File("E:/Compressed.bin")));
System.out.println(System.currentTimeMillis()-start);
//middle of the file
start=System.currentTimeMillis();
System.out.println(Validater.isAllowed("kSIvmu",new File("E:/Compressed.bin")));
System.out.println(System.currentTimeMillis()-start);
//end of the file
start=System.currentTimeMillis();
System.out.println(Validater.isAllowed("CcFylRChZDyd",new File("E:/Compressed.bin")));
System.out.println(System.currentTimeMillis()-start);
}
}
Conclusion
Well this was very insightful . All the classes used here are all present in java 7 except streams which is java 8 which most android versions support so you should be ok on that end. If you cant be bothered to read my poorly documented explanations then you can go right ahead and copy the full code after the FullCode Section. This took pretty much the whole day. One down side is while before your compressed file was 11 MB the final compressed file in my system came up to 26.9 MB. for my randomly generated passwords but as you have seen from the outputs memory and speed is overall very good. You can play around with the settings to get your preferred sizes Also if you have a better approach for finding the dictionary please do edit this answer. Don't bother using SqLite.
You have a good day now :)
EDIT Dictionary class is much faster and improved. Hope it helps