Most efficient way to convert large .txt files (size >30GB) .txt into .csv after pre-processing u-CodePudding

I have data in a .txt file that looks like this (let's name it "myfile.txt"):

28807644'~'0'~'Maun FCU'~'US#@#@#28855353'~'0'~'WNB Holdings LLC'~'US#@#@#29212330'~'0'~'Idaho First Bank'~'US#@#@#29278777'~'0'~'Republic Bank of Arizona'~'US#@#@#29633181'~'0'~'Friendly Hills Bank'~'US#@#@#29760145'~'0'~'The Freedom Bank of Virginia'~'US#@#@#100504846'~'0'~'Community First Fund Federal Credit Union'~'US#@#@#

I have tried a couple of ways to convert this .txt into a .csv, one of them was using CSV library, but since I like Panda's a lot, I used the following:

import pandas as pd
import time
  
#time at the start of program is noted
start = time.time()

# We set the path where our file is located and read it
path = r'myfile.txt'
f =  open(path, 'r')
content = f.read()
# We replace undesired strings and introduce a breakline.
content_filtered = content.replace("#@#@#", "\n").replace("'", "")
# We read everything in columns with the separator "~" 
df = pd.DataFrame([x.split('~') for x in content_filtered.split('\n')], columns = ['a', 'b', 'c', 'd'])
# We print the dataframe into a csv
df.to_csv(path.replace('.txt', '.csv'), index = None)
end = time.time()
  
#total time taken to print the file
print("Execution time in seconds: ",(end - start))

This takes about 35 seconds to process, is a file of 300MB, I can accept that type of performance, but I'm trying to do the same for a way much larger file which size is 35GB and it produces a MemoryError message.

I tried using the CSV library, but the results were similar, I attempted putting everything into a list, and afterward, write it over to a CSV:

import csv
# We write to CSV
with open(path.replace('.txt', '.csv'), "w") as outfile:
    write = csv.writer(outfile)
    write.writerows(split_content)

Results were similar, not a huge improvement. Is there a way or methodology I can use to convert VERY large .txt files into .csv? Likely above 35GB?

I'd be happy to read any suggestions you may have, thanks in advance!

CodePudding user response：

Since your code just does straight up replacement, you could just read through all the data sequentially and detect parts that need replacing as you go:

def process(fn_in, fn_out, columns):
    new_line = b'#@#@#'
    with open(fn_out, 'wb') as f_out:
        # write the header
        f_out.write((','.join(columns) '\n').encode())
        i = 0
        with open(fn_in, "rb") as f_in:
            while (b := f_in.read(1)):
                if ord(b) == new_line[i]:
                    # keep matching the newline block
                    i  = 1
                    if i == len(new_line):
                        # if matched entirely, write just a newline
                        f_out.write(b'\n')
                        i = 0
                    # write nothing while matching
                    continue
                elif i > 0:
                    # if you reach this, it was a partial match, write it
                    f_out.write(new_line[:i])
                    i = 0
                if b == b"'":
                    pass
                elif b == b"~":
                    f_out.write(b',')
                else:
                    # write the byte if no match
                    f_out.write(b)


process('my_file.txt', 'out.csv', ['a', 'b', 'c', 'd'])

That does it pretty quickly. You may be able to improve performance by reading in chunks, but this is pretty quick all the same.

This approach has the advantage over yours that it holds almost nothing in memory, but it does very little to optimise reading the file fast.

Edit: there was a big mistake in an edge case, which I realised after re-reading, fixed now.

CodePudding user response：

I couldn't shake wanting to do this with a command-line util, like sed, but feeding GB of data into sed seemed like a non-starter, so I searched for "sed long lines" and found this very helpful unix.stackexchange Q/A.

Perl & sed

If you have access to Perl, you can set a special variable for the input record separator, $/, that tells Perl where to break a stream. That gets you:

#!/bin/sh

# Write header separate, first
echo 'a,b,c,d' > out.csv

# Process your file
cat my_file.txt                                \
| perl -pe 'BEGIN{ $/="#@#@#" } s/#@#@#/\n/g'  \
| sed "s/\'//g"                                \
| sed 's/~/,/g'                                \
>> out.csv

When I run that on your provided sample, I get:

a,b,c,d
28807644,0,Maun FCU,US
28855353,0,WNB Holdings LLC,US
29212330,0,Idaho First Bank,US
29278777,0,Republic Bank of Arizona,US
29633181,0,Friendly Hills Bank,US
29760145,0,The Freedom Bank of Virginia,US
100504846,0,Community First Fund Federal Credit Union,US

I duplicated your sample one million times, got a 313MB file, ran that... and it runs in 12s and peaked at ~2.5MB:

11.57 real         4.74 user         0.13 sys
...
2405440  peak memory footprint

Running @Grismar's Python3 example against the 313MB file took ~90s and peaked at ~5MB:

93.80 real        92.61 user         0.61 sys
...
4821952

I also made a test file that was ten million times your original, at 3.1GB. The Perl script ran in 117s, exactly 10 times longer... I didn't even try Python.

I answered another unix.stackexchange question about processing large (GB) text files, and Python is just slow; use it if you have to, avoid it if you can.

Go

I also ported @Grismar's Python script to Go. It ran the 3.1GB file in under 30s, which is about 1/4 the time of the shell script.

package main

import (
    "bufio"
    "io"
    "os"
)

const bufferSize = 16 * 1024
const newline = `#@#@#`

func main() {
    fileOut, _ := os.Create("out.csv")
    defer fileOut.Close()
    w := bufio.NewWriter(fileOut)

    fileIn, _ := os.Open("my_file.txt")
    defer fileIn.Close()
    r := bufio.NewReaderSize(fileIn, bufferSize)

    // Write header
    w.Write([]byte("a,b,c,d\n"))

    var newlineIdx int // a positive idx means "in a newline"
    for {
        b, err := r.ReadByte()
        if err != nil {
            if err == io.EOF {
                break
            } else {
                panic(err)
            }
        }

        // Convert newline
        if b == newline[newlineIdx] {
            newlineIdx  = 1
            if newlineIdx == len(newline) {
                w.WriteByte('\n')
                newlineIdx = 0
            }
            continue
        } else if newlineIdx > 0 {
            // There was a string that started looking like newline, but in the end was something else
            w.Write([]byte(newline[:newlineIdx]))
            newlineIdx = 0
        }

        switch b {
        case '\'', '\n':
            // Discard single quotes because they're an unwanted artifact
            // Discard newlines because they shouldn't even be there
        case '~':
            // Convert field delimiter
            w.WriteByte(',')
        default:
            // Write field data
            w.WriteByte(b)
        }
    }
    w.Flush()
}