Can data.table's fread() skip second empty row and keep first row headers?-CodePudding

I am trying to read a CSV where column headers are on row 1, but then row 2 is empty and data starts in row 3. I tried the various options below, but always end up with generic V# column names. Any ideas on how to retain column headers?

fread("https://s3.amazonaws.com/nyc-tlc/trip data/green_tripdata_2013-08.csv",
      header = F)
fread("https://s3.amazonaws.com/nyc-tlc/trip data/green_tripdata_2013-08.csv",
      skip = 0)
fread("https://s3.amazonaws.com/nyc-tlc/trip data/green_tripdata_2013-08.csv",
      skip = 1)
fread("https://s3.amazonaws.com/nyc-tlc/trip data/green_tripdata_2013-08.csv",
      blank.lines.skip = T)
fread("https://s3.amazonaws.com/nyc-tlc/trip data/green_tripdata_2013-08.csv",
      skip = 0, blank.lines.skip = T)
fread("https://s3.amazonaws.com/nyc-tlc/trip data/green_tripdata_2013-08.csv",
      header = F, skip = 0, blank.lines.skip = T)

CodePudding user response：

url ="https://s3.amazonaws.com/nyc-tlc/trip data/green_tripdata_2013-08.csv"
df = fread(url,header=F)
headers = names(fread(url, nrows=0))
setnames(df, old=1:length(headers), new = headers)

I notice that there are 20 headers, but 22 columns returned. Therefore, I named the first 20 columns with the 20 names in headers.

As suggested by r2evans in comments, to avoid the double download/read, you could do something like this:

url ="https://s3.amazonaws.com/nyc-tlc/trip data/green_tripdata_2013-08.csv"

# download file
tfile = tempfile()
curl::curl_download(url,destfile = tfile)

# read to get headers
headers = names(fread(tfile, nrows=0))

# read to get data
df = fread(tfile, header=F)

# set the names based on `headers`
setnames(df, old=1:length(headers), new = headers)

# remove the file
file.remove(tfile)