I want to read winequality-white.csv
data using pandas.read_html()
function.
Here is my code:
import pandas as pd
wine = pd.DataFrame(
pd.read_html(
"https://github.com/shrikant-temburwar/Wine-Quality-Dataset/blob/master/winequality-white.csv",
thousands=";",
header=0,
)[0]
)
... but the result is:
Unnamed: 0 "fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
0 NaN 7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
1 NaN 6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9...
2 NaN 8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;1...
3 NaN 7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4...
4 NaN 7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4...
Of course I can choose raw
and then use read_csv
, but in case of html
reading, how can I fix it?
CodePudding user response:
Alright, here is an option using pd.read_html
:
import pandas as pd
wine = pd.read_html(
"https://github.com/shrikant-temburwar/Wine-Quality-Dataset/blob/master/winequality-white.csv",
header=0
)[0]
wine.drop('Unnamed: 0', axis=1, inplace=True)
headers = wine.columns[0].replace('"', '').split(';')
wine.columns = ['data']
wine[headers] = wine.data.str.split(';', expand=True)
wine.drop('data', axis=1, inplace=True)
wine.head()
The code above will result in:
>>> wine.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7 0.27 0.36 20.7 0.045 45 170 1.001 3 0.45 8.8 6
1 6.3 0.3 0.34 1.6 0.049 14 132 0.994 3.3 0.49 9.5 6
2 8.1 0.28 0.4 6.9 0.05 30 97 0.9951 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.4 9.9 6
4 7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.4 9.9 6
>>>
But I would never exchange the simplicity of the following snippet for the above code:
import pandas as pd
wine = pd.read_csv(
'https://raw.githubusercontent.com/shrikant-temburwar/Wine-Quality-Dataset/master/winequality-white.csv',
header=0,
sep=';'
)
CodePudding user response:
you could probably better to use the rawdatacontent address of github to remove the problem due to different html interface.
here is what you could do
import pandas as pd
import requests
import io
url = "https://raw.githubusercontent.com/shrikant-temburwar/Wine-Quality-Dataset/master/winequality-white.csv"
r = requests.get(url)
obj = io.BytesIO(r.content)
wine = pd.read_csv(obj, delimiter=";")
wine.head()