I am scraping a website and need to remove all the /n and /t from my strings.
I have tried the following code:
item.post_category = [];
Array.from($doc.find('h6.link')).forEach(function(link){
console.log(link.textContent.replace(/\t \n /gm, ""));
item.post_category.push(link.textContent);
})
//this removes the linebreaks but not the tabs
Here are multiple sample array I have to iterate over:
["\n\t\t\t\t\tJune 15, 2021 • \n\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\tFamily,\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\tGender Equality,\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\tIn the News\n\t\t\t\t"]
["\n\t\t\t\t\tJune 13, 2020 • \n\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\tIn the News\n\t\t\t\t"]
["\n\t\t\t\t\tJuly 5, 2021 • \n\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\tNews\n\t\t\t\t"]
IDEALLY, I would want my arrays to look like this. Remove the date AND the \n and \t.
["Family,Gender Equality,In the News"]
["In the News"]
["News"]
CodePudding user response:
There are hundreds of ways to do it, you could use a regex, or split, depending on your need.
Here is one of the possible solutions :
let str = "\n\t\t\t\t\tJune 15, 2021 • \n\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\tFamily,\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\tGender Equality,\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\tIn the News\n\t\t\t\t"
// Remove all new lines and tabs with a regex. You could also add '\r\n' if necessary.
str = str.replace(/(\n|\t)/gm, '');
// Here we assume that your string will
// always contain the date followed by this character: •.
// So we split according to this character, and we select
// the second item of the table, which corresponds to the text without the date.
let result = str.split('•')[1].trim()
console.log(result) // prints 'Family,Gender Equality,In the News'