How to read the headers of a csv file using csv module in "rb" mode?-CodePudding

I am currently reading the csv file in "rb" mode and uploading the file to an s3 bucket.

with open(csv_file, 'rb') as DATA:
    s3_put_response = requests.put(s3_presigned_url,data=DATA,headers=headers)

All of this is working fine but now I have to validate the headers in the csv file before making the put call.

When I try to run below, I get an error.

with open(csv_file, 'rb') as DATA:
   csvreader = csv.reader(file)
   columns = next(csvreader)
   // run-some-validations
   s3_put_response = requests.put(s3_presigned_url,data=DATA,headers=headers)

This throws

_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

As a workaround, I have created a new function which opens the file in "r" mode and does validation on the csv headers and this works ok.

def check_csv_headers():
    with open(csv_file, 'r') as file:
        csvreader = csv.reader(file)
        columns = next(csvreader)

I do not want to read the same file twice. Once for header validation and once for uploading to s3. The upload part also doesn't work if I do it in "r" mode.

Is there a way I can achieve this while reading the file only once in "rb" mode ? I have to make this work using the csv module and not the pandas library.

CodePudding user response：

Doing what you want is possible but not very efficient. Simply opening a file isn't that expensive. The CSV reader only reads only line at a time, not the entire file.

To do what you want you have to :

Read the first line as bytes
Decode it into a string (using the correct encoding)
Convert it to a list of strings
Parse it with csv.reader and finally
Seek to the start of the stream.

Otherwise you'll end up uploading only the data without the headers :

with open(csv_file, 'rb') as DATA:
   header=file.readline()
   lines=[header.decode()]
   csvreader = csv.reader(lines)
   columns = next(csvreader)
   // run-some-validations
   DATA.seek(0)

   s3_put_response = requests.put(s3_presigned_url,data=DATA,headers=headers)

Opening the file as text is not only simpler, it allows you to separate the validation logic from the upload code.

To ensure only one line is read at a time you can use buffering=1

def check_csv_headers():
    with open(csv_file, 'r', buffering=1) as file:
        csvreader = csv.reader(file)
        columns = next(csvreader)
        // run-some-validations

    with open(csv_data, 'rb') as DATA:
        s3_put_response = requests.put(s3_presigned_url,data=DATA,headers=headers)

def check_csv_headers():
    with open(csv_file, 'r', buffering=1) as file:
        csvreader = csv.reader(file)
        columns = next(csvreader)
        // run-some-validations
        //If successful
        return True

def upload_csv(filePath):
    if check_csv_headers(filePath) :    
        with open(csv_data, 'rb') as DATA:
            s3_put_response = requests.put(s3_presigned_url,data=DATA,headers=headers)