Home > database >  Issues with splitting txt file (1 billion rows) into chunks (1.5 mil rows each), unable to figure ho
Issues with splitting txt file (1 billion rows) into chunks (1.5 mil rows each), unable to figure ho

Time:11-22

I'm learning how to operate with files and decided to go big. Generated 30 GB file of random text using small dictionary, file consists of 1 billion rows, every row is 2 strings delimited by space.

My specs: Ryzen 3600, 16gb 3200 RAM, 512 GB ssd

Text sample:

orthodox_jungle sharp_opera
vertical_referee close_steward
express_wheel intermediate_building
painful_damage similar_fly
violent_justification colourful_opposition

So far I was able to chunk the main txt file into many temporary files, that later will be sorted and merged into one whole file again (removing duplicates).

Main Problem

I'm stuck with while ((line = bufferedReader.readLine()) != null) condition. Because the file is being split STRICTLY after counter goes to 1.5 millions (or any other split range, 1 million or 100k), the remaining text is not processed and saved. Also creating new List within while local scope doesnt let me to save that list to file once the loop is over.

Method that does the chunking

    public void readFromFileTest(String filePath) throws IOException {

        long start = System.nanoTime();

        String path = "/home/developer/Downloads/tmpfiles/temporaryToSort%d.txt";

        BufferedReader bufferedReader = new BufferedReader(new FileReader(filePath));
        String line;
        List<String> listToSort = new ArrayList<>();

        int currentLineCounter = 0;
        int temporaryFileCounter = 0;

        while ((line = bufferedReader.readLine()) != null) {
            if (currentLineCounter == 1500000) {

                String tmpFileLocation = String.format(path, temporaryFileCounter);

                sortAndSaveListToFile(tmpFileLocation, listToSort);

                currentLineCounter = 0;
                temporaryFileCounter  ;

                listToSort = new ArrayList<>();
            }

            String[] arrayOfWords = line.split(" ");

            for (String word : arrayOfWords) {
                listToSort.add(word   "\n ");
            }
            // \n is needed, because otherwise my temporary textfile would be considered
            //as one single big String of 50 mb size

            //hence i cant use listToSort.addAll(Arrays.asList(line.split(" ")));

            //listToSort.addAll(Arrays.asList(line.split(" ")));
            currentLineCounter  ;
        }

        long time = System.nanoTime() - start;
        System.out.printf("Took %.3f second to read, sort and write to a file%n", time / 1e9);
    }

Proof that the last piece of text is not processed

Reading last rows from main and last saved temporary file I get different text samples:

public List<String> getLastLinesFromFile(int numLastLineToRead) throws FileNotFoundException {

        File f = new File("/home/developer/Downloads/verylargefile/verylargefile.txt");
//        File f = new File("/home/developer/Downloads/tmpfiles/temporaryToSort665.txt");

        List<String> result = new ArrayList<>();

        try (ReversedLinesFileReader reader = new ReversedLinesFileReader(f, StandardCharsets.UTF_8)) {
            String line = "";
            while ((line = reader.readLine()) != null && result.size() < numLastLineToRead) {
                result.add(line);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        return result;

    }

returns last rows for main file and tmp respectfully:

optimistic_direction senior_ghost
exact_admiration influential_cereal
useful_rice charismatic_syndrome

and

 main_sweater
 general_review
 current_passion
 afraid_lemon
 stunning_garbage
 presidential_dialect
 low_cathedral
 full_accountant
 crude_survivor

Possible workaround

Simply calculate the number of lines in that file, then replace while condition with simple for loop and saving everything when i = lineCount. But it takes 85 seconds to count all lines on my machine and it stinks.

import org.apache.commons.io.input.ReversedLinesFileReader;

        long lineCount;

        try (Stream<String> stream = Files.lines(Path.of(filePath), StandardCharsets.UTF_8)) {
            lineCount = stream.count();
        }

        System.out.println(lineCount);

Initial plan was to use RandomAccessFile and read it into byte[] array, allocating 128 MB of memory for that array, and then just saving the remaining of array to last file, but it was too much of a hassle reinventing bufferReaders readLine() and file pointer repositioning&seeking to newline byte, then saving chopped string piece to new bytearray.

Any other examples how to effectifely chunk big text files would be appreciated, I know my implementation is hot garbage.

Send help :)

CodePudding user response:

I recommend taking a look at split from AT&T's Unix. A simple solution is to change your while to an inverted until (until the readline returns null) and add an OR'ed condition to the write block ( or not line )

CodePudding user response:

IMHO all you need is (after the while ((line = bufferedReader.readLine()) != null) { /*...*/ } block

if (!listToSort.isEmpty()) {
    String tmpFileLocation = String.format(path, temporaryFileCounter);
    sortAndSaveListToFile(tmpFileLocation, listToSort);
}

or if you are using a Java version before Java 11:

if (listToSort.size() > 0) {
    String tmpFileLocation = String.format(path, temporaryFileCounter);
    sortAndSaveListToFile(tmpFileLocation, listToSort);
}
  • Related