Home > Blockchain >  How to remove escape codes from strings after scraping a website
How to remove escape codes from strings after scraping a website

Time:09-30

I try to learn data science with python in simplilearn. in matplotlib learning section they do web scraping from here.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url="https://www.hubertiming.com/results/2018MLK" #OPEN LINK
html=urlopen(URL)
soup=BeautifulSoup(html,"lxml")
title = soup.title
print (title)
print(title.text)
links = soup.find_all('a',href=True)
for link in links:
    print (link['href'])
data =[]
allrows=soup.find_all("tr")
for row in allrows:
    row_list = row.find_all("td")
    dataRow=[]
    data_converted = []
    for cell in row_list:
        dataRow.append(cell.text)
    data.append(dataRow)
data=data[4:]
print(data[-2:])

And this is the results

[['190', '2087', '\r\n\r\n                    LEESHA POSEY\r\n\r\n                ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n                    112 of 113\r\n\r\n                ', 'F 40-54', '\r\n\r\n                    36 of 37\r\n\r\n                ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n                    ZULMA OCHOA\r\n\r\n                ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n                    113 of 113\r\n\r\n                ', 'F 40-54', '\r\n\r\n                    37 of 37\r\n\r\n                ', '0:00', '1:43:27']]

how can I get rid the \r\n\r\n?? i already use "replace" function and it say "'list' object has no attribute 'replace'" and also I can not use strip neither.

CodePudding user response:

You are having 2D List

What are we leveraging:
  1. List Comprehension
  2. strip() method
  3. Thats it :)

Use the below code:

text = [['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
result = [[j.strip() for j in i] for i in text]
print(result)

Output:

[['190', '2087', 'LEESHA POSEY', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '112 of 113', 'F 40-54', '36 of 37', '0:00', '1:33:53'], ['191', '1216', 'ZULMA OCHOA', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '113 of 113', 'F 40-54', '37 of 37', '0:00', '1:43:27']]

CodePudding user response:

You can do this only. convert: cell.text to cell.text.strip() in your code like below:

...
for row in allrows:
    row_list = row.find_all("td")
    dataRow=[]
    data_converted = []
    for cell in row_list:
        dataRow.append(cell.text.strip())
...

CodePudding user response:

text = [['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
print(text)
for i in range(len(text)):
    for j in range(len(text[i])):
        text[i][j] = text[i][j].replace('\r\n', '')
print(text)

Output:

[['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
[['190', '2087', ' LEESHA POSEY ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', ' 112 of 113 ', 'F 40-54', ' 36 of 37 ', '0:00', '1:33:53'], ['191', '1216', ' ZULMA OCHOA ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', ' 113 of 113 ', 'F 40-54', ' 37 of 37 ', '0:00', '1:43:27']]

CodePudding user response:

  • This website has well defined table tags. As such, the easiest solution is to use pandas.read_html, which will scrape all the tables into a list of dataframes.
    • If there are no table tags in the html, then .read_html() will not work.
  • Since this reads the tables correctly, there is no extra escape codes to strip or remove, but if that were required for a column of data, something like df.Name = df.Name.str.strip() or df.Name = df.Name.str.replace('\r', ''), would work.
  • This has the benefit of reducing the code to two lines, and the data will be easier to manipulate, analyze, and plot
import pandas as pd

url = 'https://www.hubertiming.com/results/2018MLK'

# read the tables
df_list = pd.read_html(url)

# in this case the desired dataframe is at index 1
df = df_list[1]

# display(df.head())
   Place   Bib                     Name Gender   Age        City State Chip Time Chip Pace Gender Place Age Group Age Group Place Time to Start Gun Time
0      1  1191             MAX RANDOLPH      M  29.0  WASHINGTON    DC     16:48      5:25      1 of 78   M 21-39         1 of 33          0:08    16:56
1      2  1080  NEED NAME KAISER RUNNER      M  25.0    PORTLAND    OR     17:31      5:39      2 of 78   M 21-39         2 of 33          0:09    17:40
2      3  1275               DAN FRANEK      M  52.0    PORTLAND    OR     18:15      5:53      3 of 78   M 40-54         1 of 27          0:07    18:22
3      4  1223              PAUL TAYLOR      M  54.0    PORTLAND    OR     18:31      5:58      4 of 78   M 40-54         2 of 27          0:07    18:38
4      5  1245              THEO KINMAN      M  22.0         NaN   NaN     19:31      6:17      5 of 78   M 21-39         3 of 33          0:09    19:40

# output the dataframe as an array, and see the values in the last two lists have no escape codes
data = df.to_numpy()
print(data[-2:])
[out]: 
array([[190, 2087, 'LEESHA POSEY', 'F', 43.0, 'PORTLAND', 'OR',
        '1:33:53', '30:17', '112 of 113', 'F 40-54', '36 of 37', '0:00',
        '1:33:53'],
       [191, 1216, 'ZULMA OCHOA', 'F', 40.0, 'GRESHAM', 'OR', '1:43:27',
        '33:22', '113 of 113', 'F 40-54', '37 of 37', '0:00', '1:43:27']],
      dtype=object)
  • Related