I'm learning how to operate with files and decided to go big. Generated 30 GB file of random text using small dictionary, file consists of 1 billion rows, every row is 2 strings delimited by space.
My specs: Ryzen 3600, 16gb 3200 RAM, 512 GB ssd
Text sample:
orthodox_jungle sharp_opera
vertical_referee close_steward
express_wheel intermediate_building
painful_damage similar_fly
violent_justification colourful_opposition
So far I was able to chunk the main txt file into many temporary files, that later will be sorted and merged into one whole file again (removing duplicates).
Main Problem
I'm stuck with while ((line = bufferedReader.readLine()) != null)
condition. Because the file is being split STRICTLY after counter goes to 1.5 millions (or any other split range, 1 million or 100k), the remaining text is not processed and saved. Also creating new List
within while
local scope doesnt let me to save that list to file once the loop is over.
Method that does the chunking
public void readFromFileTest(String filePath) throws IOException {
long start = System.nanoTime();
String path = "/home/developer/Downloads/tmpfiles/temporaryToSort%d.txt";
BufferedReader bufferedReader = new BufferedReader(new FileReader(filePath));
String line;
List<String> listToSort = new ArrayList<>();
int currentLineCounter = 0;
int temporaryFileCounter = 0;
while ((line = bufferedReader.readLine()) != null) {
if (currentLineCounter == 1500000) {
String tmpFileLocation = String.format(path, temporaryFileCounter);
sortAndSaveListToFile(tmpFileLocation, listToSort);
currentLineCounter = 0;
temporaryFileCounter ;
listToSort = new ArrayList<>();
}
String[] arrayOfWords = line.split(" ");
for (String word : arrayOfWords) {
listToSort.add(word "\n ");
}
// \n is needed, because otherwise my temporary textfile would be considered
//as one single big String of 50 mb size
//hence i cant use listToSort.addAll(Arrays.asList(line.split(" ")));
//listToSort.addAll(Arrays.asList(line.split(" ")));
currentLineCounter ;
}
long time = System.nanoTime() - start;
System.out.printf("Took %.3f second to read, sort and write to a file%n", time / 1e9);
}
Proof that the last piece of text is not processed
Reading last rows from main and last saved temporary file I get different text samples:
public List<String> getLastLinesFromFile(int numLastLineToRead) throws FileNotFoundException {
File f = new File("/home/developer/Downloads/verylargefile/verylargefile.txt");
// File f = new File("/home/developer/Downloads/tmpfiles/temporaryToSort665.txt");
List<String> result = new ArrayList<>();
try (ReversedLinesFileReader reader = new ReversedLinesFileReader(f, StandardCharsets.UTF_8)) {
String line = "";
while ((line = reader.readLine()) != null && result.size() < numLastLineToRead) {
result.add(line);
}
} catch (IOException e) {
e.printStackTrace();
}
return result;
}
returns last rows for main file and tmp respectfully:
optimistic_direction senior_ghost
exact_admiration influential_cereal
useful_rice charismatic_syndrome
and
main_sweater
general_review
current_passion
afraid_lemon
stunning_garbage
presidential_dialect
low_cathedral
full_accountant
crude_survivor
Possible workaround
Simply calculate the number of lines in that file, then replace while
condition with simple for loop and saving everything when i = lineCount
. But it takes 85 seconds to count all lines on my machine and it stinks.
import org.apache.commons.io.input.ReversedLinesFileReader;
long lineCount;
try (Stream<String> stream = Files.lines(Path.of(filePath), StandardCharsets.UTF_8)) {
lineCount = stream.count();
}
System.out.println(lineCount);
Initial plan was to use RandomAccessFile and read it into byte[] array, allocating 128 MB of memory for that array, and then just saving the remaining of array to last file, but it was too much of a hassle reinventing bufferReaders readLine()
and file pointer repositioning&seeking to newline byte, then saving chopped string piece to new bytearray.
Any other examples how to effectifely chunk big text files would be appreciated, I know my implementation is hot garbage.
Send help :)
CodePudding user response:
I recommend taking a look at split from AT&T's Unix. A simple solution is to change your while to an inverted until (until the readline returns null) and add an OR'ed condition to the write block ( or not line )
CodePudding user response:
IMHO all you need is (after the while ((line = bufferedReader.readLine()) != null) { /*...*/ }
block
if (!listToSort.isEmpty()) {
String tmpFileLocation = String.format(path, temporaryFileCounter);
sortAndSaveListToFile(tmpFileLocation, listToSort);
}
or if you are using a Java version before Java 11:
if (listToSort.size() > 0) {
String tmpFileLocation = String.format(path, temporaryFileCounter);
sortAndSaveListToFile(tmpFileLocation, listToSort);
}