Home > database >  Cleaning CSV file before reading
Cleaning CSV file before reading

Time:06-10

I'm reading a big CSV file with encoding/csv library.

But this file is a bit non-standard and contains non-escaped quotes " breaking the reader at parser.Read():

2022/06/09 17:33:54 parse error on line 2, column 5: extraneous or missing " in quoted-field

And if I use parser.LazyQuotes = true, I'm getting:

2022/06/09 17:34:15 record on line 2: wrong number of fields

Faulty CSV file (reduced to its minimum) foo.csv:

1|2
"a|b

So I need to remove all occurences of double quotes " and I'm currently doing it on the whole file from terminal using sed 's/"//g', but I want to remove it from Go script instead.

How should I do it knowing that I'm reading the file like this:

func processCSV(filepath string){
    file, err := os.Open("foo.csv")
    if err != nil {
        log.Fatal(err)
    }

    parser := csv.NewReader(file)
    parser.Comma = '|'
    // parser.LazyQuotes = true

    _, err = parser.Read() // skip headers

    for {
        record, err := parser.Read()
        if err == io.EOF {
            break
        }
        if err != nil {
            log.Fatal(err)
        }

        // process record

    }
}

CodePudding user response:

Create an io.Reader that removes quotes from data read through an underlying io.Reader.

// rmquote reads r with " removed.
type rmquote struct {
    r io.Reader
}

func (c rmquote) Read(p []byte) (int, error) {
    n, err := c.r.Read(p)

    // i is output position for loop below
    i := 0

    // for each byte read from the file
    for _, b := range p[:n] {

        // skip quotes
        if b == '"' {
            continue
        }

        // copy byte to output position and advance position
        p[i] = b
        i  
    }

    // output position is the new length
    return i, err
}

Plumb it in between the CSV reader and file:

parser := csv.NewReader(rmquote{file})
  •  Tags:  
  • go
  • Related