I'm trying to read data from https://download.bls.gov/pub/time.series/bp/bp.measure using pandas, like this:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t')
However, I just need to get the dataset with the two columns: measure_code
and measure_text
. As this dataset as a title BP measure
I also tried to read it like:
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t', skiprows=1)
But in this case it returns a dataset with just one column and I'm not being able to slipt it:
>>> df.columns
Index([' measure_code measure_text'], dtype='object')
Any suggestion/idea on a better approach to get this dataset?
CodePudding user response:
It's definitely possible, but the format has a few quirks.
- As you noted, the column headers start on line 2, so you need
skiprows=1
. - The file is space-separated, not tab-separated.
- Column values are continued across multiple lines.
Issues 1 and 2 can be fixed using skiprows
and sep
. Problem 3 is harder, and requires you to preprocess the file a little. For that reason, I used a slightly more flexible way of fetching the file, using the requests library. Once I have the file, I use regular expressions to fix problem 3, and give the file back to pandas.
Here's the code:
import requests
import re
import io
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
# Get the URL, convert the document from DOS to Unix linebreaks
measure_codes = requests.get(url) \
.text \
.replace("\r\n", "\n")
# If there's a linebreak, followed by at least 7 spaces, combine it with
# previous line
measure_codes = re.sub("\n {7,}", " ", measure_codes)
# Convert the string to a file-like object
measure_codes = io.BytesIO(measure_codes.encode('utf-8'))
# Read in file, interpreting 4 spaces or more as a delimiter.
# Using a regex like this requires using the slower Python engine.
# Use skiprows=1 to skip the header
# Use dtype="str" to avoid converting measure code to integer.
df = pd.read_csv(measure_codes, engine="python", sep=" {4,}", skiprows=1, dtype="str")
print(df)