I'm counting how many times different words appear in a text using Regular Expressions in JavaScript. My problem is when I have quoted words: 'word'
should be counted simply as word
(without the quotes, otherwise they'll behave as two different words), while it's
should be counted as a whole word.
(?<=\w)(')(?=\w)
This regex can identify apostrophes inside, but not around words. Problem is, I can't use it inside a character set such as [\w]
.
(?<=\w)(')(?=\w)|[\w]
Will count it's a 'miracle' of nature
as 7 words, instead of 5 (it
, '
, s
becoming 3 different words). Also, the third word should be selected simply as miracle
, and not as 'miracle'
.
To make things even more complicated, I need to capture diacritics too, so I'm using [A-Za-zÀ-ÖØ-öø-ÿ]
instead of \w
.
How can I accomplish that?
CodePudding user response:
1) You can simply use /[^\s] /g
regex
const str = `it's a 'miracle' of nature`;
const result = str.match(/[^\s] /g);
console.log(result.length);
console.log(result);
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>
2) If you are calculating total number of words in a string then you can also use split
as:
const str = `it's a 'miracle' of nature`;
const result = str.split(/\s /);
console.log(result.length);
console.log(result);
<iframe name="sif2" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>
3) If you want a word without quote
at the starting and at the end then you can do as:
const str = `it's a 'miracle' of nature`;
const result = str.match(/[^\s] /g).map((s) => {
s = s[0] === "'" ? s.slice(1) : s;
s = s[s.length - 1] === "'" ? s.slice(0, -1) : s;
return s;
});
console.log(result.length);
console.log(result);
<iframe name="sif3" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>
CodePudding user response:
You might use an alternation with 2 capture groups, and then check for the values of those groups.
(?<!\S)'(\S )'(?!\S)|(\S )
(?<!\S)'
Negative lookbehind, assert a whitespace boundary to the left and match'
(\S )
Capture group 1, match 1 non whitespace chars'(?!\S)
Match'
and assert a whitespace boundary to the right|
Or(\S )
Capture group 2, match 1 non whitespace chars
See a regex demo.
const regex = /(?<!\S)'(\S )'(?!\S)|(\S )/g;
const s = "it's a 'miracle' of nature";
Array.from(s.matchAll(regex), m => {
if (m[1]) console.log(m[1])
if (m[2]) console.log(m[2])
});
<iframe name="sif4" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>