Home > Software engineering >  How to remove patterns and whitespaces from a scraped data?
How to remove patterns and whitespaces from a scraped data?

Time:12-11

I have scraped data but having difficulty in removing the tags and white spaces such that each string word can iterable then added as a pair to a key

I converted the result to string, used the map function and join, then removed line spaces using regular expression, the result still had some tags but can't iterate the string since it now a string even while I tried converting from string back to list.

datetime_end_list = []
datetime = soup.find_all(class_="cloture-line")
for dt in datetime:
      df_text = dt.getText()
      datetime_end_list.append(df_text)
print(datetime_end_list)
'\r\n                                            17/04/202311:00\r\n                                        ', '\n...\n\n\n\n\n\n', '\r\n                                            28/02/202310:00\r\n                                        ', '\n...\n\n\n\n\n\n', '\r\n                                            02/02/202311:00\r\n                                        ', '\n...\n\n\n\n\n\n', '\r\n                                            01/02/202309:00\r\n                                        ', '\n...\n\n\n\n\n\n', '\r\n                                            30/01/202310:00\r\n                                        ', '\n...\n\n\n\n\n\n', '\r\n                                            25/01/202312:00\r\n                                        ', '\n...\n\n\n\n\n\n', '\r\n                                            25/01/202309:00\r\n                                        ', '\n...\n\n\n\n\n\n', '\r\n                                            24/01/202312:00\r\n                                        ', '\n...\n\n\n\n\n\n', '\r\n                                            24/01/202311:00\r\n                                        ', '\n...\n\n\n\n\n\n', '\r\n                                            24/01/202310:00\r\n                                        ', '\n...\n\n\n\n\n\n'

start cleaning:

datetime_clean = ' '.join(map(str,datetime_end_list))
datetime_clean2 = re.sub(r'^\s ', '', datetime_clean, flags=re.MULTILINE)
print(datetime_clean2)

17/04/202311:00
...
28/02/202310:00
...
02/02/202311:00
...
01/02/202309:00
...
30/01/202310:00
...
25/01/202312:00
...
25/01/202309:00
...
24/01/202312:00
...
24/01/202311:00
...
24/01/202310:00
...

CodePudding user response:

Without additional information about the scraped elements it is hard to give an exact answer, but based on your input this should point in a direction.

Removing whitespaces or new line characters simply use the parameter strip=True and to remove the ... simply check for them and append only, if they are not in df_text:

for dt in datetime:
    df_text = dt.getText(strip=True)
    if '...' not in df_text:
        datetime_end_list.append(df_text)

I would probably choose the selection of the elements differently, but this would require knowledge of the HTML structure.

  • Related