I am trying to scrape data from the web and while doing so there are unusual characters appearing in my data (i.e '\r\n\r\n'). Goal is to get a dataframe containing the site data.
This is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = "https://www.hubertiming.com/results/2018MLK"
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
title = soup.title
print(title)
print(title.text)
links = soup.find_all('a', href = True)
for link in links:
print(link['href'])
data = []
allrows = soup.find_all("tr")
for row in allrows:
row_list = row.find_all("td")
dataRow = []
for cell in row_list:
dataRow.append(cell.text)
data.append(dataRow)
print(data)
The output I got is as follows:
[[], ['Finishers:', '191'], ['Male:', '78'], ['Female:', '113'], [], ['1', '1191', '\r\n\r\n MAX RANDOLPH\r\n\r\n ', 'M', '29', 'WASHINGTON', 'DC', '5:25', '16:48', '\r\n\r\n 1 of 78\r\n\r\n ', 'M 21-39', '\r\n\r\n 1 of 33\r\n\r\n ', '0:08', '16:56'], ['2', '1080', '\r\n\r\n NEED NAME KAISER RUNNER\r\n\r\n ', 'M', '25', 'PORTLAND', 'OR', '5:39', '17:31', '\r\n\r\n 2 of 78\r\n\r\n ', 'M 21-39', '\r\n\r\n 2 of 33\r\n\r\n ', '0:09', '17:40'], ['3', '1275', '\r\n\r\n DAN FRANEK\r\n\r\n ', 'M', '52', 'PORTLAND', 'OR', '5:53', '18:15', '\r\n\r\n 3 of 78\r\n\r\n ', 'M 40-54', '\r\n\r\n 1 of 27\r\n\r\n ', '0:07', '18:22'], ['4', '1223', '\r\n\r\n PAUL TAYLOR\r\n\r\n ', 'M', '54', 'PORTLAND', 'OR', '5:58', '18:31', '\r\n\r\n 4 of 78\r\n\r\n ', 'M 40-54', '\r\n\r\n 2 of 27\r\n\r\n ', '0:07', '18:38'], ['5', '1245', '\r\n\r\n THEO KINMAN\r\n\r\n ', 'M', '22', '', '', '6:17', '19:31', '\r\n\r\n 5 of 78\r\n\r\n ', 'M 21-39', '\r\n\r\n 3 of 33\r\n\r\n ', '0:09', '19:40'], ['6', '1185', '\r\n\r\n MELISSA GIRGIS\r\n\r\n ', 'F', '27', 'PORTLAND', 'OR', '6:20', '19:39', '\r\n\r\n 1 of 113\r\n\r\n ', 'F 21-39', '\r\n\r\n 1 of 53\r\n\r\n ', '0:07', '19:46'],...
df = pd.DataFrame(data)
print(df)
And the dataframe is as follows:
0 1 2 \
0 None None None
1 Finishers: 191 None
2 Male: 78 None
3 Female: 113 None
4 None None None
.. ... ... ...
191 187 1254 \r\n\r\n CYNTHIA HARRIS\r\n...
192 188 1085 \r\n\r\n EBONY LAWRENCE\r\n...
193 189 1170 \r\n\r\n ANTHONY WILLIAMS\r...
194 190 2087 \r\n\r\n LEESHA POSEY\r\n\r...
195 191 1216 \r\n\r\n ZULMA OCHOA\r\n\r\...
3 4 5 6 7 8 \
0 None None None None None None
1 None None None None None None
2 None None None None None None
3 None None None None None None
4 None None None None None None
.. ... ... ... ... ... ...
191 F 64 PORTLAND OR 21:53 1:07:51
192 F 30 PORTLAND OR 22:00 1:08:12
193 M 39 PORTLAND OR 22:19 1:09:11
194 F 43 PORTLAND OR 30:17 1:33:53
195 F 40 GRESHAM OR 33:22 1:43:27
9 10 \
0 None None
1 None None
2 None None
3 None None
4 None None
.. ... ...
191 \r\n\r\n 110 of 113\r\n\r\n... F 55
192 \r\n\r\n 111 of 113\r\n\r\n... F 21-39
193 \r\n\r\n 78 of 78\r\n\r\n ... M 21-39
194 \r\n\r\n 112 of 113\r\n\r\n... F 40-54
195 \r\n\r\n 113 of 113\r\n\r\n... F 40-54
11 12 13
0 None None None
1 None None None
2 None None None
3 None None None
4 None None None
.. ... ... ...
191 \r\n\r\n 14 of 14\r\n\r\n ... 1:19 1:09:10
192 \r\n\r\n 53 of 53\r\n\r\n ... 0:58 1:09:10
193 \r\n\r\n 33 of 33\r\n\r\n ... 0:08 1:09:19
194 \r\n\r\n 36 of 37\r\n\r\n ... 0:00 1:33:53
195 \r\n\r\n 37 of 37\r\n\r\n ... 0:00 1:43:27
[196 rows x 14 columns]
I cant seem to understand how to remove the extra characters from my data. Please advice a way to do the same.
CodePudding user response:
Also mentioned by @SergeyK I would recommend to use pandas
it is common praxis and will work in most cases (bs4 under the hood) and you get your result in one line
pd.read_html(url)[1] print(df)
If you like to go your way, select more specific and strip()
the texts as mentioned:
for row in soup.select('#individualResults tr:has(td)'):
row_list = row.find_all("td")
dataRow = []
for cell in row_list:
dataRow.append(cell.text.strip())
data.append(dataRow)
Example
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(urlopen('https://www.hubertiming.com/results/2018MLK'))
data = []
for row in soup.select('#individualResults tr:has(td)'):
row_list = row.find_all("td")
dataRow = []
for cell in row_list:
dataRow.append(cell.text.strip())
data.append(dataRow)
pd.DataFrame(data, columns=[h.text for h in soup.select('#individualResults th')])