Home > database >  How to remove '\r\n\r\n' character from a list containing various strings while web sc
How to remove '\r\n\r\n' character from a list containing various strings while web sc

Time:07-12

I am trying to scrape data from the web and while doing so there are unusual characters appearing in my data (i.e '\r\n\r\n'). Goal is to get a dataframe containing the site data.

This is my code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url = "https://www.hubertiming.com/results/2018MLK"  
html = urlopen(url)    

soup = BeautifulSoup(html, "lxml")
title = soup.title
print(title)
print(title.text)

links = soup.find_all('a', href = True)
for link in links:
    print(link['href'])

data = []
allrows = soup.find_all("tr")
for row in allrows:
    row_list = row.find_all("td")
    dataRow = []
    for cell in row_list:
        dataRow.append(cell.text)
    data.append(dataRow)
    
print(data)

The output I got is as follows:

[[], ['Finishers:', '191'], ['Male:', '78'], ['Female:', '113'], [], ['1', '1191', '\r\n\r\n                    MAX RANDOLPH\r\n\r\n                ', 'M', '29', 'WASHINGTON', 'DC', '5:25', '16:48', '\r\n\r\n                    1 of 78\r\n\r\n                ', 'M 21-39', '\r\n\r\n                    1 of 33\r\n\r\n                ', '0:08', '16:56'], ['2', '1080', '\r\n\r\n                    NEED NAME KAISER RUNNER\r\n\r\n                ', 'M', '25', 'PORTLAND', 'OR', '5:39', '17:31', '\r\n\r\n                    2 of 78\r\n\r\n                ', 'M 21-39', '\r\n\r\n                    2 of 33\r\n\r\n                ', '0:09', '17:40'], ['3', '1275', '\r\n\r\n                    DAN FRANEK\r\n\r\n                ', 'M', '52', 'PORTLAND', 'OR', '5:53', '18:15', '\r\n\r\n                    3 of 78\r\n\r\n                ', 'M 40-54', '\r\n\r\n                    1 of 27\r\n\r\n                ', '0:07', '18:22'], ['4', '1223', '\r\n\r\n                    PAUL TAYLOR\r\n\r\n                ', 'M', '54', 'PORTLAND', 'OR', '5:58', '18:31', '\r\n\r\n                    4 of 78\r\n\r\n                ', 'M 40-54', '\r\n\r\n                    2 of 27\r\n\r\n                ', '0:07', '18:38'], ['5', '1245', '\r\n\r\n                    THEO KINMAN\r\n\r\n                ', 'M', '22', '', '', '6:17', '19:31', '\r\n\r\n                    5 of 78\r\n\r\n                ', 'M 21-39', '\r\n\r\n                    3 of 33\r\n\r\n                ', '0:09', '19:40'], ['6', '1185', '\r\n\r\n                    MELISSA GIRGIS\r\n\r\n                ', 'F', '27', 'PORTLAND', 'OR', '6:20', '19:39', '\r\n\r\n                    1 of 113\r\n\r\n                ', 'F 21-39', '\r\n\r\n                    1 of 53\r\n\r\n                ', '0:07', '19:46'],...

df = pd.DataFrame(data)
    print(df)

And the dataframe is as follows:

              0     1                                                  2  \
0          None  None                                               None   
1    Finishers:   191                                               None   
2         Male:    78                                               None   
3       Female:   113                                               None   
4          None  None                                               None   
..          ...   ...                                                ...   
191         187  1254  \r\n\r\n                    CYNTHIA HARRIS\r\n...   
192         188  1085  \r\n\r\n                    EBONY LAWRENCE\r\n...   
193         189  1170  \r\n\r\n                    ANTHONY WILLIAMS\r...   
194         190  2087  \r\n\r\n                    LEESHA POSEY\r\n\r...   
195         191  1216  \r\n\r\n                    ZULMA OCHOA\r\n\r\...   

        3     4         5     6      7        8  \
0    None  None      None  None   None     None   
1    None  None      None  None   None     None   
2    None  None      None  None   None     None   
3    None  None      None  None   None     None   
4    None  None      None  None   None     None   
..    ...   ...       ...   ...    ...      ...   
191     F    64  PORTLAND    OR  21:53  1:07:51   
192     F    30  PORTLAND    OR  22:00  1:08:12   
193     M    39  PORTLAND    OR  22:19  1:09:11   
194     F    43  PORTLAND    OR  30:17  1:33:53   
195     F    40   GRESHAM    OR  33:22  1:43:27   

                                                     9       10  \
0                                                 None     None   
1                                                 None     None   
2                                                 None     None   
3                                                 None     None   
4                                                 None     None   
..                                                 ...      ...   
191  \r\n\r\n                    110 of 113\r\n\r\n...    F 55    
192  \r\n\r\n                    111 of 113\r\n\r\n...  F 21-39   
193  \r\n\r\n                    78 of 78\r\n\r\n  ...  M 21-39   
194  \r\n\r\n                    112 of 113\r\n\r\n...  F 40-54   
195  \r\n\r\n                    113 of 113\r\n\r\n...  F 40-54   

                                                    11    12       13  
0                                                 None  None     None  
1                                                 None  None     None  
2                                                 None  None     None  
3                                                 None  None     None  
4                                                 None  None     None  
..                                                 ...   ...      ...  
191  \r\n\r\n                    14 of 14\r\n\r\n  ...  1:19  1:09:10  
192  \r\n\r\n                    53 of 53\r\n\r\n  ...  0:58  1:09:10  
193  \r\n\r\n                    33 of 33\r\n\r\n  ...  0:08  1:09:19  
194  \r\n\r\n                    36 of 37\r\n\r\n  ...  0:00  1:33:53  
195  \r\n\r\n                    37 of 37\r\n\r\n  ...  0:00  1:43:27  

[196 rows x 14 columns]

I cant seem to understand how to remove the extra characters from my data. Please advice a way to do the same.

CodePudding user response:

Also mentioned by @SergeyK I would recommend to use pandas it is common praxis and will work in most cases (bs4 under the hood) and you get your result in one line

pd.read_html(url)[1] print(df)

If you like to go your way, select more specific and strip() the texts as mentioned:

for row in soup.select('#individualResults tr:has(td)'):
    row_list = row.find_all("td")
    dataRow = []
    for cell in row_list:
        dataRow.append(cell.text.strip())
    data.append(dataRow)
Example
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

soup = BeautifulSoup(urlopen('https://www.hubertiming.com/results/2018MLK'))
data = []

for row in soup.select('#individualResults tr:has(td)'):
    row_list = row.find_all("td")
    dataRow = []
    for cell in row_list:
        dataRow.append(cell.text.strip())
    data.append(dataRow)
    
pd.DataFrame(data, columns=[h.text for h in soup.select('#individualResults th')])
  • Related