Home > Enterprise >  Split a string by max characters length, word aware - but without capturing whitespaces
Split a string by max characters length, word aware - but without capturing whitespaces

Time:10-14

The following regex (taken from here) splits a string by characters length (e.g. 20 characters), while being word-aware (live demo):

\b[\w\s]{20,}?(?=\s)|. $

This means that if a word should be "cut" in the middle (based on the provided characters length) - then the whole word is taken instead:

const str = "this is an input example of one sentence that contains a bit of words and must be split"

const substringMaxLength = 20;

const regex = new RegExp(`\\b[\\w\\s]{${substringMaxLength},}?(?=\\s)|. $`, 'g');

const substrings = str.match(regex);

console.log(substrings);

However, as can be seen when running the snippet above, the leading whitespace is taken with each substring. Can it be ignored, so that we'll end up with this?

[
  "this is an input example",
  "of one sentence that",
  "contains a bit of words",
  "and must be split"
]

I tried adding either [^\s], (?:\s), (?!\s) everywhere, but just couldn't achieve it.

How can it be done?

CodePudding user response:

You can require that every match starts with \w -- so for both options of your current regex:

const str = "this is an input example of one sentence that contains a bit of words and must be split"

const substringMaxLength = 20;

const regex = new RegExp(`\\b\\w(?:[\\w\\s]{${substringMaxLength-1},}?(?=\\s)|.*$)`, 'g');

const substrings = str.match(regex);

console.log(substrings);

CodePudding user response:

This is how you can do it: const regex = new RegExp(\\b((?:[^\\s] \\s?){${substringMaxLength},}?)(?=\\s)|. $, 'g'); The regex uses a non-capturing group with a positive lookahead (?=\s) to prevent whitespace from being captured. The lookahead checks if there is a whitespace after the group and if there is a whitespace it returns a match. The non-capturing group uses a positive lookbehind (?<=\s) to make sure that the group starts with a whitespace. \b((?:[^\s] \s?){20,}?)\b(?=\s) Regex Demo

CodePudding user response:

Your pattern can start with a word character and the length minus 1.

The negative lookahead (?!\S) asserts a whitespace boundary to the right.

The alternative matches the rest of the line, and also starta with a word character.

\b\w(?:[\w\s]{19,}?(?!\S)|.*)

Regex demo

const str = "this is an input example of one sentence that contains a bit of words and must be split"

const substringMaxLength = 20;

const regex = new RegExp(`\\b\\w(?:[\\w\\s]{${substringMaxLength-1},}?(?!\\S)|.*)`, 'g');

const substrings = str.match(regex);

console.log(substrings);

  • Related