Home > database >  How to get rid word in /r/n in python after scrapping web
How to get rid word in /r/n in python after scrapping web

Time:09-29

I try to learn data science with python in simplilearn. in matplotlib learning section they do web scraping from here.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url="https://www.hubertiming.com/results/2018MLK" #OPEN LINK
html=urlopen(URL)
soup=BeautifulSoup(html,"lxml")
title = soup.title
print (title)
print(title.text)
links = soup.find_all('a',href=True)
for link in links:
    print (link['href'])
data =[]
allrows=soup.find_all("tr")
for row in allrows:
    row_list = row.find_all("td")
    dataRow=[]
    data_converted = []
    for cell in row_list:
        dataRow.append(cell.text)
    data.append(dataRow)
data=data[4:]
print(data[-2:])

And this is the results

[['190', '2087', '\r\n\r\n                    LEESHA POSEY\r\n\r\n                ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n                    112 of 113\r\n\r\n                ', 'F 40-54', '\r\n\r\n                    36 of 37\r\n\r\n                ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n                    ZULMA OCHOA\r\n\r\n                ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n                    113 of 113\r\n\r\n                ', 'F 40-54', '\r\n\r\n                    37 of 37\r\n\r\n                ', '0:00', '1:43:27']]

how can I get rid the \r\n\r\n?? i already use "replace" function and it say "'list' object has no attribute 'replace'" and also I can not use strip neither.

CodePudding user response:

You are having 2D List

What are we leveraging:
  1. List Comprehension
  2. strip() method
  3. Thats it :)

Use the below code:

text = [['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
result = [[j.strip() for j in i] for i in text]
print(result)

Output:

[['190', '2087', 'LEESHA POSEY', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '112 of 113', 'F 40-54', '36 of 37', '0:00', '1:33:53'], ['191', '1216', 'ZULMA OCHOA', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '113 of 113', 'F 40-54', '37 of 37', '0:00', '1:43:27']]

CodePudding user response:

You can do this only. convert: cell.text to cell.text.strip() in your code like below:

...
for row in allrows:
    row_list = row.find_all("td")
    dataRow=[]
    data_converted = []
    for cell in row_list:
        dataRow.append(cell.text.strip())
...

CodePudding user response:

text = [['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
print(text)
for i in range(len(text)):
    for j in range(len(text[i])):
        text[i][j] = text[i][j].replace('\r\n', '')
print(text)

Output:

[['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
[['190', '2087', ' LEESHA POSEY ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', ' 112 of 113 ', 'F 40-54', ' 36 of 37 ', '0:00', '1:33:53'], ['191', '1216', ' ZULMA OCHOA ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', ' 113 of 113 ', 'F 40-54', ' 37 of 37 ', '0:00', '1:43:27']]

CodePudding user response:

  • This website has well defined table tags. As such, the easiest solution is to use pandas.read_html, which will scrape all the tables into a list of dataframes.
    • If there are no table tags in the html, then .read_html() will not work.
  • Since this reads the tables correctly, there is no extra text to strip or remove, but if that were required for a column of data, something like df.Name = df.Name.str.strip() or df.Name = df.Name.str.replace('\r', ''), would work.
  • This has the benefit of reducing you code to two lines.
import pandas as pd

url = 'https://www.hubertiming.com/results/2018MLK'

# read the tables
df_list = pd.read_html(url)

# in this case the desired dataframe is at index 1
df = df_list[1]

# display(df.head())
   Place   Bib                     Name Gender   Age        City State Chip Time Chip Pace Gender Place Age Group Age Group Place Time to Start Gun Time
0      1  1191             MAX RANDOLPH      M  29.0  WASHINGTON    DC     16:48      5:25      1 of 78   M 21-39         1 of 33          0:08    16:56
1      2  1080  NEED NAME KAISER RUNNER      M  25.0    PORTLAND    OR     17:31      5:39      2 of 78   M 21-39         2 of 33          0:09    17:40
2      3  1275               DAN FRANEK      M  52.0    PORTLAND    OR     18:15      5:53      3 of 78   M 40-54         1 of 27          0:07    18:22
3      4  1223              PAUL TAYLOR      M  54.0    PORTLAND    OR     18:31      5:58      4 of 78   M 40-54         2 of 27          0:07    18:38
4      5  1245              THEO KINMAN      M  22.0         NaN   NaN     19:31      6:17      5 of 78   M 21-39         3 of 33          0:09    19:40
  • Related