I try to learn data science with python in simplilearn. in matplotlib learning section they do web scraping from here.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url="https://www.hubertiming.com/results/2018MLK" #OPEN LINK
html=urlopen(URL)
soup=BeautifulSoup(html,"lxml")
title = soup.title
print (title)
print(title.text)
links = soup.find_all('a',href=True)
for link in links:
print (link['href'])
data =[]
allrows=soup.find_all("tr")
for row in allrows:
row_list = row.find_all("td")
dataRow=[]
data_converted = []
for cell in row_list:
dataRow.append(cell.text)
data.append(dataRow)
data=data[4:]
print(data[-2:])
And this is the results
[['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
how can I get rid the \r\n\r\n
?? i already use "replace"
function and it say "'list' object has no attribute 'replace'"
and also I can not use strip neither.
CodePudding user response:
You are having 2D List
What are we leveraging:
- List Comprehension
strip()
method- Thats it :)
Use the below code:
text = [['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
result = [[j.strip() for j in i] for i in text]
print(result)
Output:
[['190', '2087', 'LEESHA POSEY', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '112 of 113', 'F 40-54', '36 of 37', '0:00', '1:33:53'], ['191', '1216', 'ZULMA OCHOA', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '113 of 113', 'F 40-54', '37 of 37', '0:00', '1:43:27']]
CodePudding user response:
You can do this only. convert: cell.text
to cell.text.strip()
in your code like below:
...
for row in allrows:
row_list = row.find_all("td")
dataRow=[]
data_converted = []
for cell in row_list:
dataRow.append(cell.text.strip())
...
CodePudding user response:
text = [['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
print(text)
for i in range(len(text)):
for j in range(len(text[i])):
text[i][j] = text[i][j].replace('\r\n', '')
print(text)
Output:
[['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
[['190', '2087', ' LEESHA POSEY ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', ' 112 of 113 ', 'F 40-54', ' 36 of 37 ', '0:00', '1:33:53'], ['191', '1216', ' ZULMA OCHOA ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', ' 113 of 113 ', 'F 40-54', ' 37 of 37 ', '0:00', '1:43:27']]
CodePudding user response:
- This website has well defined table tags. As such, the easiest solution is to use
pandas.read_html
, which will scrape all the tables into a list of dataframes.- If there are no table tags in the html, then
.read_html()
will not work.
- If there are no table tags in the html, then
- Since this reads the tables correctly, there is no extra escape codes to strip or remove, but if that were required for a column of data, something like
df.Name = df.Name.str.strip()
ordf.Name = df.Name.str.replace('\r', '')
, would work. - This has the benefit of reducing the code to two lines, and the data will be easier to manipulate, analyze, and plot
import pandas as pd
url = 'https://www.hubertiming.com/results/2018MLK'
# read the tables
df_list = pd.read_html(url)
# in this case the desired dataframe is at index 1
df = df_list[1]
# display(df.head())
Place Bib Name Gender Age City State Chip Time Chip Pace Gender Place Age Group Age Group Place Time to Start Gun Time
0 1 1191 MAX RANDOLPH M 29.0 WASHINGTON DC 16:48 5:25 1 of 78 M 21-39 1 of 33 0:08 16:56
1 2 1080 NEED NAME KAISER RUNNER M 25.0 PORTLAND OR 17:31 5:39 2 of 78 M 21-39 2 of 33 0:09 17:40
2 3 1275 DAN FRANEK M 52.0 PORTLAND OR 18:15 5:53 3 of 78 M 40-54 1 of 27 0:07 18:22
3 4 1223 PAUL TAYLOR M 54.0 PORTLAND OR 18:31 5:58 4 of 78 M 40-54 2 of 27 0:07 18:38
4 5 1245 THEO KINMAN M 22.0 NaN NaN 19:31 6:17 5 of 78 M 21-39 3 of 33 0:09 19:40
# output the dataframe as an array, and see the values in the last two lists have no escape codes
data = df.to_numpy()
print(data[-2:])
[out]:
array([[190, 2087, 'LEESHA POSEY', 'F', 43.0, 'PORTLAND', 'OR',
'1:33:53', '30:17', '112 of 113', 'F 40-54', '36 of 37', '0:00',
'1:33:53'],
[191, 1216, 'ZULMA OCHOA', 'F', 40.0, 'GRESHAM', 'OR', '1:43:27',
'33:22', '113 of 113', 'F 40-54', '37 of 37', '0:00', '1:43:27']],
dtype=object)