I am scraping a webpage as a fun little project while learning puppeteer
and during that time I've ran into a bit of a problem when it comes to cleaning a string to get useful data. I have come up with some easy methods of pulling the data I want from them, but I'm running into edge cases that I don't know the best way to handle things.
Take this string
Round 1 - Foo Bar (SchoolName) over John (JC) Cena (Fake School Name) (Fall 1:19)
The data I want to get.
- Winners name = Foo Bar
- Winner's School = (SchoolName)
- Loser's name = John Cena
- Loser's School = Fake School Name
Round 1
and the -
are useless to me and they follow the same structure throughout the entire app. So, using the method of just selecting the index where these items should be, should be pretty easy.
The most important index in this string is where over
is located. Once I find that, I can search the indexes around that to find where the rest of the information I need is.
let findOver = arr.indexOf('over')
let winnerName = arr[over - 3].concat(' ', arr[over - 2])
let winnerSchool = summaryBreakUp[over - 1]
This works for the string above, at least for the left side. It grabs the winners first and last name and concats them.
My question is when the string doesn't look like the left side, how do I account for edge cases like the one above.
I could search for all the ( && )
and capture all the data in them to get the School Names
but then I would need to sift through a way to figure out which one is a school and which one is a nickname.
Any directions would be appreciated. I will also post some more examples incase someone else wants to take a crack at it.
This is a win over an opponent not specified.
Michael Macontish (Fairview) over Unknown (For.)
No round given
John Heflin (Arlington) over Random Kid (Mistview) (Fall 1:59)
No Fall Given
Round 2 - Logan Paul George (High School) over Dontae Inverse (Jackson County) (Dec 3-0)
CodePudding user response:
Use regex and capture groups to get the bits that you are interested in, you might have to do some minor tidying up agterwards.
There are many patterns you could use (and I'm sure this is nt the best but it's a start):
([\w\s\(\)] )\s\(([\s\w\.] )\)\sover\s([\w\s\(\)] )\s\(([\s\w\.] )\)
This matches the names as
- Any alphanumeric, underscore, whictespace or parentheses
- one or more occurances
Followed by an open parenthesis, followed by the school as
- any whiotespace, alphanumeric, underscore or period
- one or more occurances
Followed by a closed parenthesis, followed by whitespace, then the word "over", whitespace again, and then repeats the name and school pattern. Everything else is ignored.
Usage: When you use regular expressions in javascript the capture groups end up as elements in the resulting array. The entire match is the first element with each additional element representing the capture groups in order. There are 4 capture groups in this expression, so you'll end up with elements 1-5 representing the name1, school1, name2 & school2.
const re = /([\w\s\(\)] )\s\(([\s\w\.] )\)\sover\s([\w\s\(\)] )\s\(([\s\w\.] )\)/
const input = [
'Round 1 - Foo Bar (SchoolName) over John (JC) Cena (Fake School Name) (Fall 1:19)',
'Michael Macontish (Fairview) over Unknown (For.)',
'John Heflin (Arlington) over Random Kid (Mistview) (Fall 1:59)',
'Round 2 - Logan Paul George (High School) over Dontae Inverse (Jackson County) (Dec 3-0)'
]
input.forEach( i => {
console.log(i.match(re))
})
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>
CodePudding user response:
As an alternative pattern, you might broaden the range what to match using a non greedy dot .*?
to match any character or negated character class [^
to exclude what is allowed to be matched.
Start the pattern with optionally matching the part at the beginning einding with -
.
To match the right part between parenthesis at the end, you could assert that the part part between parenthesis does not have - or : between digits using a negative lookahead.
(?:.*?\s -\s )?([^()] )\s \(([^()] )\)\s over\s (.*)\s \((?![^()]*\d[-:]\d[^()]*\))([^()]*)\)
The pattern matches:
(?:.*?\s -\s )?
Optionally match a part ending with<code> - </code>
([^()] )
Capture group 1 to match any char except(
and)
\s
Match 1 whitespace chars\(([^()] )\)
Match(
then capture any char except(
and)
in group 2 and match)
- \s over\s
Match
over` between 1 or more whitespace chars (.*)
Capture any char in group 3\s
Match 1 whitespace chars\(
Match(
(?![^()]*\d[-:]\d[^()]*\))
Negative lookahead, assert that between the parenthesis there is not:
or-
between digits([^()]*)
If the assertion is true, capture any char except(
and)
in group 4\)
Match)
See a regex demo
const regex = /(?:.*?\s -\s )?([^()] )\s \(([^()] )\)\s over\s (.*)\s \((?![^()]*\d[-:]\d[^()]*\))([^()\n]*)\)/;
[
"Round 1 - Foo Bar (SchoolName) over John (JC) Cena (Fake School Name) (Fall 1:19)",
"Michael Macontish (Fairview) over Unknown (For.)",
"John Heflin (Arlington) over Random Kid (Mistview) (Fall 1:59)",
"Round 2 - Logan Paul George (High School) over Dontae Inverse (Jackson County) (Dec 3-0)"
].forEach(s => console.log(s.match(regex)));
<iframe name="sif2" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>