I'm trying to read data from https://download.bls.gov/pub/time.series/ee/ee.industry using pandas, like this:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/ee/ee.industry'
df = pd.read_csv(url, sep='\t')
Also tried getting the separator:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/ee/ee.industry'
reader = pd.read_csv(url, sep = None, iterator = True)
inferred_sep = reader._engine.data.dialect.delimiter
df = pd.read_csv(url, sep=inferred_sep)
However the data is not very weelk formated, the columns of the dataframe are right:
>>> df.columns
Index(['industry_code', 'SIC_code', 'publishing_status', 'industry_name'], dtype='object')
But the data does not correspond to the columns, it seems all the data is merged into the fisrt two columns and the last two do not have any data. Any suggestion/idea on a better approach to fet this data?
EDIT The expexted result should be something like:
industry_code | SIC_code | publishing_status | industry_name |
---|---|---|---|
000000 | N/A | B | Total nonfarm 1 T 1 |
CodePudding user response:
The reader works well but you don’t have the right number of columns in your header. You can get the other columns back using .reset_index()
and then rename the columns:
>>> df = pd.read_csv(url, sep='\t')
>>> n_missing_headers = df.index.nlevels
>>> cols = df.columns.to_list() [f'col{n}' for n in range(n_missing_headers)]
>>> df.reset_index(inplace=True)
>>> df.columns = cols
>>> df.head()
industry_code SIC_code publishing_status industry_name col0 col1 col2
0 0 NaN B Total nonfarm 1 T 1
1 5000 NaN A Total private 1 T 2
2 5100 NaN A Goods-producing 1 T 3
3 100000 10-14 A Mining 2 T 4
4 101000 10 A Metal mining 3 T 5
You can then keep the first 4 columns if you want:
>>> df.iloc[:, :-n_missing_headers].head()
industry_code SIC_code publishing_status industry_name
0 0 NaN B Total nonfarm
1 5000 NaN A Total private
2 5100 NaN A Goods-producing
3 100000 10-14 A Mining
4 101000 10 A Metal mining