Home > Software engineering >  Regex match apostrophe inside, but not around words, inside a character set
Regex match apostrophe inside, but not around words, inside a character set

Time:12-02

I'm counting how many times different words appear in a text using Regular Expressions in JavaScript. My problem is when I have quoted words: 'word' should be counted simply as word (without the quotes, otherwise they'll behave as two different words), while it's should be counted as a whole word.

(?<=\w)(')(?=\w)

This regex can identify apostrophes inside, but not around words. Problem is, I can't use it inside a character set such as [\w] .

(?<=\w)(')(?=\w)|[\w]

Will count it's a 'miracle' of nature as 7 words, instead of 5 (it, ', s becoming 3 different words). Also, the third word should be selected simply as miracle, and not as 'miracle'.

To make things even more complicated, I need to capture diacritics too, so I'm using [A-Za-zÀ-ÖØ-öø-ÿ] instead of \w.

How can I accomplish that?

CodePudding user response:

1) You can simply use /[^\s] /g regex

enter image description here

const str = `it's a 'miracle' of nature`;
const result = str.match(/[^\s] /g);

console.log(result.length);
console.log(result);
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

2) If you are calculating total number of words in a string then you can also use split as:

const str = `it's a 'miracle' of nature`;
const result = str.split(/\s /);

console.log(result.length);
console.log(result);
<iframe name="sif2" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

3) If you want a word without quote at the starting and at the end then you can do as:

const str = `it's a 'miracle' of nature`;
const result = str.match(/[^\s] /g).map((s) => {
  s = s[0] === "'" ? s.slice(1) : s;
  s = s[s.length - 1] === "'" ? s.slice(0, -1) : s;
  return s;
});

console.log(result.length);
console.log(result);
<iframe name="sif3" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

CodePudding user response:

You might use an alternation with 2 capture groups, and then check for the values of those groups.

(?<!\S)'(\S )'(?!\S)|(\S )
  • (?<!\S)' Negative lookbehind, assert a whitespace boundary to the left and match '
  • (\S ) Capture group 1, match 1 non whitespace chars
  • '(?!\S) Match ' and assert a whitespace boundary to the right
  • | Or
  • (\S ) Capture group 2, match 1 non whitespace chars

See a regex demo.

const regex = /(?<!\S)'(\S )'(?!\S)|(\S )/g;
const s = "it's a 'miracle' of nature";

Array.from(s.matchAll(regex), m => {
  if (m[1]) console.log(m[1])
  if (m[2]) console.log(m[2])
});
<iframe name="sif4" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

  • Related