Home > database >  Simplest ways of getting CSV headers from URL
Simplest ways of getting CSV headers from URL

Time:12-28

I've built a web-app that where users can upload contacts using CSV files.

Once the CSV file is uploaded, I'd like to map the header fields to the app's fields.

The issue I'm facing is figuring out an efficient and simple way to return a list of CSV headers from a CSV link.

Ideally, I'd like the list to return via an HTTP request; or alternatively, using JavaScript, so I can pass it back to the app.

For example, here is a CSV file containing hurricane counts: https://people.sc.fsu.edu/~jburkardt/data/csv/hurricanes.csv

The headers are:

Month, "Average", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015"

My idea is to run a Python script in AWS Lambda that gets the CSV headers, then sends them to the app using an HTTP request.

Is there another way to get this done client-side or without having to setup a backend infrustructure like AWS Lambda?

CodePudding user response:

There's an option to stream from requests and you can process CSV per line. See: Body Content Workflow

Example:

import requests

url = "https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/addresses.csv"
r = requests.get(url, stream=True)

for line in r.iter_lines():
    header = line.decode("utf-8").split(",")
    print(header)
    break

This will give you only the header. I've used an example of raw CSV file from github: sample-csv

CodePudding user response:

The "simple" answer is obviously to download the whole file and work with that. But you may be able to save some time. If the file accepts partial downloads, you are in luck. To check that:

import requests
requests.head("https://people.sc.fsu.edu/~jburkardt/data/csv/hurricanes.csv").headers

Yields

{'Date': 'Mon, 27 Dec 2021 14:00:21 GMT', 
'Server': 'Apache/2.4.46 (Fedora) OpenSSL/1.1.1g', 
'Last-Modified': 'Mon, 27 Jun 2016 12:37:53 GMT', 
'ETag': '"1f6-53641c9fb0a40"', 
'Accept-Ranges': 'bytes', 
'Content-Length': '502', 
'X-Frame-Options': 'SAMEORIGIN', 
'Keep-Alive': 'timeout=5, max=100', 
'Connection': 'Keep-Alive', 
'Content-Type': 'text/csv'}

Note the Accept-Ranges: 'bytes'. You can ask for specific portions of the file. Jackpot!

Then it's a question of working with requests and the headers:

resume_headers = {'Range':'bytes=0-2048'}
r = requests.get(url, stream=True, headers=resume_headers)
for chunk in r.iter_content(chunk_size=1024):
    print(chunk)

Output:

b'"Month", "Average", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015"\n"May",  0.1,  0,  0, 1, 1, 0, 0, 0, 2, 0,  0,  0  \n"Jun",  0.5,  2,  1, 1, 0, 0, 1, 1, 2, 2,  0,  1\n"Jul",  0.7,  5,  1, 1, 2, 0, 1, 3, 0, 2,  2,  1\n"Aug",  2.3,  6,  3, 2, 4, 4, 4, 7, 8, 2,  2,  3\n"Sep",  3.5,  6,  4, 7, 4, 2, 8, 5, 2, 5,  2,  5\n"Oct",  2.0,  8,  0, 1, 3, 2, 5, 1, 5, 2,  3,  0\n"Nov",  0.5,  3,  0, 0, 1, 1, 0, 1, 0, 1,  0,  1\n"Dec",  0.0,  1,  0, 1, 0, 0, 0, 0, 0, 0,  0,  1\n\n'

So we have our header, as well as a couple more lines.

Note that I print, but you can save into a variable, a large string, write to a local CSV file,... whatever suits your need. Also, I selected a range of 2048 bytes, and chunks of 1024 bytes; adapt to the situation!

  • Related