Home > Software engineering >  How to read data from url to pandas dataframe
How to read data from url to pandas dataframe

Time:10-01

I'm trying to read data from https://download.bls.gov/pub/time.series/ee/ee.industry using pandas, like this:

import pandas as pd

url = 'https://download.bls.gov/pub/time.series/ee/ee.industry'
df = pd.read_csv(url, sep='\t')

Also tried getting the separator:

import pandas as pd

url = 'https://download.bls.gov/pub/time.series/ee/ee.industry'
reader = pd.read_csv(url, sep = None, iterator = True)
inferred_sep = reader._engine.data.dialect.delimiter

df = pd.read_csv(url, sep=inferred_sep)

However the data is not very weelk formated, the columns of the dataframe are right:

>>> df.columns
Index(['industry_code', 'SIC_code', 'publishing_status', 'industry_name'], dtype='object')

But the data does not correspond to the columns, it seems all the data is merged into the fisrt two columns and the last two do not have any data. Any suggestion/idea on a better approach to fet this data?

EDIT The expexted result should be something like:

industry_code SIC_code publishing_status industry_name
000000 N/A B Total nonfarm 1 T 1

CodePudding user response:

The reader works well but you don’t have the right number of columns in your header. You can get the other columns back using .reset_index() and then rename the columns:

>>> df = pd.read_csv(url, sep='\t')
>>> n_missing_headers = df.index.nlevels
>>> cols = df.columns.to_list()   [f'col{n}' for n in range(n_missing_headers)]
>>> df.reset_index(inplace=True)
>>> df.columns = cols
>>> df.head()
   industry_code SIC_code publishing_status    industry_name  col0 col1  col2
0              0      NaN                 B    Total nonfarm     1    T     1
1           5000      NaN                 A    Total private     1    T     2
2           5100      NaN                 A  Goods-producing     1    T     3
3         100000    10-14                 A           Mining     2    T     4
4         101000       10                 A     Metal mining     3    T     5

You can then keep the first 4 columns if you want:

>>> df.iloc[:, :-n_missing_headers].head()
   industry_code SIC_code publishing_status    industry_name
0              0      NaN                 B    Total nonfarm
1           5000      NaN                 A    Total private
2           5100      NaN                 A  Goods-producing
3         100000    10-14                 A           Mining
4         101000       10                 A     Metal mining
  • Related