Home > Mobile >  load Boston dataset with pandas
load Boston dataset with pandas

Time:08-19

I'm having an issue loading the Boston dataset with pandas. It seems like it't not recognizing the continuing/newlines. What am I missing?
python 3.9.0
pandas 1.3.5

import pandas as pd
pd.read_csv(filepath_or_buffer="http://lib.stat.cmu.edu/datasets/boston", sep="  ", skiprows=21)

enter image description here

CodePudding user response:

I don't know a good way to read in a table which has it's rows on multiple lines. Here's an approach that reads in the table, converts it to a single list of values, drops the nulls, and reshapes to have a new table with the correct number of columns

import pandas as pd
import numpy as np

df = pd.read_csv(
    filepath_or_buffer="http://lib.stat.cmu.edu/datasets/boston",
    delim_whitespace=True,
    skiprows=21,
    header=None,
)

columns = [
    'CRIM',
    'ZN',
    'INDUS',
    'CHAS',
    'NOX',
    'RM',
    'AGE',
    'DIS',
    'RAD',
    'TAX',
    'PTRATIO',
    'B',
    'LSTAT',
    'MEDV',
]

#Flatten all the values into a single long list and remove the nulls
values_w_nulls = df.values.flatten()
all_values = values_w_nulls[~np.isnan(values_w_nulls)]

#Reshape the values to have 14 columns and make a new df out of them
df = pd.DataFrame(
    data = all_values.reshape(-1, len(columns)),
    columns = columns,
)

df

CodePudding user response:

Try like it is done in the sklearn documentation:

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s ", skiprows=22, header=None)
  • Related