Home > Blockchain >  Reading semicolon separated data with read_html
Reading semicolon separated data with read_html

Time:12-02

I want to read winequality-white.csv data using pandas.read_html() function.

Here is my code:

import pandas as pd

wine = pd.DataFrame(
    pd.read_html(
        "https://github.com/shrikant-temburwar/Wine-Quality-Dataset/blob/master/winequality-white.csv",
        thousands=";",
        header=0,
    )[0]
)

... but the result is:

Unnamed: 0 "fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
    0         NaN   7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6                                                                                                                       
    1         NaN  6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9...                                                                                                                       
    2         NaN  8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;1...                                                                                                                       
    3         NaN  7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4...                                                                                                                       
    4         NaN  7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4...                                                                                                                       

Of course I can choose raw and then use read_csv, but in case of html reading, how can I fix it?

CodePudding user response:

Alright, here is an option using pd.read_html:

import pandas as pd

wine = pd.read_html(
    "https://github.com/shrikant-temburwar/Wine-Quality-Dataset/blob/master/winequality-white.csv",
    header=0
)[0]

wine.drop('Unnamed: 0', axis=1, inplace=True)
headers = wine.columns[0].replace('"', '').split(';')
wine.columns = ['data']
wine[headers] = wine.data.str.split(';', expand=True)
wine.drop('data', axis=1, inplace=True)
wine.head()

The code above will result in:

>>> wine.head()
  fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density    pH sulphates alcohol quality
0             7             0.27        0.36           20.7     0.045                  45                  170   1.001     3      0.45     8.8       6
1           6.3              0.3        0.34            1.6     0.049                  14                  132   0.994   3.3      0.49     9.5       6
2           8.1             0.28         0.4            6.9      0.05                  30                   97  0.9951  3.26      0.44    10.1       6
3           7.2             0.23        0.32            8.5     0.058                  47                  186  0.9956  3.19       0.4     9.9       6
4           7.2             0.23        0.32            8.5     0.058                  47                  186  0.9956  3.19       0.4     9.9       6
>>> 

But I would never exchange the simplicity of the following snippet for the above code:

import pandas as pd

wine = pd.read_csv(
    'https://raw.githubusercontent.com/shrikant-temburwar/Wine-Quality-Dataset/master/winequality-white.csv',
    header=0,
    sep=';'
)

CodePudding user response:

you could probably better to use the rawdatacontent address of github to remove the problem due to different html interface.

here is what you could do

import pandas as pd
import requests
import io
url = "https://raw.githubusercontent.com/shrikant-temburwar/Wine-Quality-Dataset/master/winequality-white.csv"
r = requests.get(url)
obj = io.BytesIO(r.content)
wine = pd.read_csv(obj, delimiter=";")
wine.head()
  • Related