Restart reading csv file from a defined position-CodePudding

I need to process a big file in Go, so I don't want to load all the rows of my csv file at once but processing them by groups.

To restart the computation of the rows from where I left, I actually use a for cycle to skip the rows already read:

for idx := 0; idx < startAt; idx   {
    //Read rows and do nothing with the returned value
    if _, readErr := reader.Read(); readErr != nil {
        if readErr == io.EOF {
            //File end -> OK
            isEOF = true
            break
        } else {
            //Read failed
            return nil, errors.New(DATA_READ_ERROR)
        }
    }
}

This is a pretty simple solution; however, it is obviously inefficient. After reading the first lines the time to read the following increases exponentially.

To reduce this time I tried different alternatives, but every one of them doesn't work properly and makes the reader fails (rows are not read from the right address).

For instance, I tried to return the current position of the file pointer (using file.Seek(0, io.SeekCurrent) and then, on the new iteration, I tried to move the pointer using file.Seek(oldPosition, io.SeekStart) but it didn't work as expected.

There is a way to avoid the loop above and improve the reading time when restarting from where I left?

Update

The way I used file Seek is very simple.

//compute data

func computeData(nrows int, startAt int64) {
    //Open file
    if csvFile, openErr := os.Open(config.DataSrcFile); openErr == nil {
        //Create a reader
        reader := csv.NewReader(csvFile)
        //Position the file pointer to the start point
        file.Seek(startAt, io.SeekStart)
        //Read n rows
        for idx := 0; idx < *nrows && !isEOF; idx   {
            if csvLine, readErr := reader.Read(); readErr == nil {
                //Do stuff...
            } else {
                //Error registered reading csv
                if readErr == io.EOF {
                    //File end -> OK
                    break
                } else {
                    //Return error
                }
            }
        }
        //Return bytes read (actually simplified, in real case error is not
        // ignored)
        bytesRead, _ := file.Seek(0, io.SeekCurrent)
        return bytesRead
    }
}
func main() {
    var startAt int64 = 0
    nrows := 1000
    for !isMyConditionMatched {
        bytesRead = computeData(nrows, startAt)
        startAt  = bytesRead
    }
}

CodePudding user response：

The problem here is that encoding/csv internally uses a buffered reader, so when you execute file.Seek(0, io.SeekCurrent) you get the position on the underlying file but some data was read and you did not use it.

There are two possible solutions:

one is to use lower level implementations that allow to control exactly where you are
the other is to find out how much buffered data there is.

I'll show you an implementation of the second option (note that this relies on some knowledge of the internal working of the encoding/csv package and may stop working if it is changed)

First you create a new buffered io reader before creating the csv:

        //Position the file pointer to the start point
        file.Seek(startAt, io.SeekStart)
        bReader := bufio.NewReader(file)

        //Create a reader
        reader := csv.NewReader(bReader)

This will allow you to access the buffer. You can use this reader as you already do, but in the end you calculate the final position on the file by doing:

        bufSize := bReader.Buffered()
        filePos, err := file.Seek(0, io.SeekCurrent)
        return filePos - int64(bufSize)

This takes the current position in the file and removes the buffer that was created.

Note that the value returned is the position in the file and not the amount of bytes read in this call to the function.