Read CSV using pandas-CodePudding

I'm trying to read data from https://download.bls.gov/pub/time.series/bp/bp.measure using pandas, like this:

import pandas as pd

url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t')

However, I just need to get the dataset with the two columns: measure_code and measure_text. As this dataset as a title BP measure I also tried to read it like:

url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t', skiprows=1)

But in this case it returns a dataset with just one column and I'm not being able to slipt it:

>>> df.columns
Index([' measure_code         measure_text'], dtype='object')

Any suggestion/idea on a better approach to get this dataset?

CodePudding user response：

It's definitely possible, but the format has a few quirks.

As you noted, the column headers start on line 2, so you need skiprows=1.
The file is space-separated, not tab-separated.
Column values are continued across multiple lines.

Issues 1 and 2 can be fixed using skiprows and sep. Problem 3 is harder, and requires you to preprocess the file a little. For that reason, I used a slightly more flexible way of fetching the file, using the requests library. Once I have the file, I use regular expressions to fix problem 3, and give the file back to pandas.

Here's the code:

import requests
import re
import io
import pandas as pd

url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'

# Get the URL, convert the document from DOS to Unix linebreaks
measure_codes = requests.get(url) \
    .text \
    .replace("\r\n", "\n")

# If there's a linebreak, followed by at least 7 spaces, combine it with
# previous line
measure_codes = re.sub("\n {7,}", " ", measure_codes)

# Convert the string to a file-like object
measure_codes = io.BytesIO(measure_codes.encode('utf-8'))

# Read in file, interpreting 4 spaces or more as a delimiter.
# Using a regex like this requires using the slower Python engine.
# Use skiprows=1 to skip the header
# Use dtype="str" to avoid converting measure code to integer.
df = pd.read_csv(measure_codes, engine="python", sep=" {4,}", skiprows=1, dtype="str")

print(df)