How to extract symbols, numbers and words from a string and store each into an accordingly categoriz-CodePudding

How does one extract symbols, numbers, words with maximum 3 and words with at least 4 letters from a string and store each into an accordingly categorized array?

The given string is:

const string = 'There are usually 100 to 200 words   in a paragraph';

The expected response is:

const numbers = ['200', '100'];

const wordsMoreThanThreeLetters = ['There', 'words ', 'paragraph', 'usually'];

const symbols = [' '];

const words = ['are', 'to', 'in', 'a'];

CodePudding user response：

A valid approach was to split the string at any whitespace-sequence and then to operate a reduce method on the split method's result array.

The reducer function will be implemented in a way that it collects and aggregates the string items (tokens) within specific arrays according to the OP's categories, supported by helper methods for e.g. digit and word tests ...

function collectWordsDigitsAndRest(collector, token) {
  const isDigitsOnly = value => (/^\d $/).test(token);
  const isWord = value => (/^\w $/).test(token);

  const listName = isDigitsOnly(token)
    ? 'digits'
    : (
        isWord(token)
        ? (token.length <= 3) && 'shortWords' || 'longWords'
        : 'rest'
    );
  (collector[listName] ??= []).push(token);

  return collector;
}
const {

  longWords: wordsMoreThanThreeLetters = [],
  shortWords: words = [],
  digits: numbers = [],
  rest: symbols = [],

} = 'There are usually 100 to 200 words   in a paragraph'

  .split(/\s /)
  .reduce(collectWordsDigitsAndRest, {});

console.log({
  wordsMoreThanThreeLetters,
  words,
  numbers,
  symbols,
});

.as-console-wrapper { min-height: 100%!important; top: 0; }

Of cause one also could matchAll the required tokens by a single regular expression / RegExp which features named capturing groups and also uses Unicode escapes in order to achieve a better internationalization (i18n) coverage.

The regex itself would look and work like this ...

(?:\b(?<digit>\p{N} )|(?<longWord>\p{L}{4,})|(?<shortWord>\p{L} )\b)|(?<rest>[^\p{Z}] )

... derived from ...

(?:\b(?<digit>\p{N} )|(?<word>\p{L} )\b)|(?<rest>[^\p{Z}] )

The reducer function of the first approach has to be adapted to this second approach in order to process each captured group accordingly ...

function collectWordsDigitsAndRest(collector, { groups }) {
  const { shortWord, longWord, digit, rest } = groups;

  const listName = (shortWord
    && 'shortWords') || (longWord
    && 'longWords') || (digit
    && 'digits') || (rest
    && 'rest');

  if (listName) {
    (collector[listName] ??= []).push(shortWord || longWord || digit || rest);
  }
  return collector;
}

// Unicode Categories ... [https://www.regularexpressions.info/unicode.html#category]
// regex101.com ... [https://regex101.com/r/nCga5u/2]
const regXWordDigitRestTokens =
  /(?:\b(?<digit>\p{N} )|(?<longWord>\p{L}{4,})|(?<shortWord>\p{L} )\b)|(?<rest>[^\p{Z}] )/gmu;

const {

  longWords: wordsMoreThanThreeLetters = [],
  shortWords: words = [],
  digits: numbers = [],
  rest: symbols = [],

} = Array
  .from(
    'There are usually 100 to 200 words    -- ** in a paragraph.'
    .matchAll(regXWordDigitRestTokens)
  )
  .reduce(collectWordsDigitsAndRest, {});

console.log({
  wordsMoreThanThreeLetters,
  words,
  numbers,
  symbols,
});

.as-console-wrapper { min-height: 100%!important; top: 0; }

CodePudding user response：

What you are trying to do is called tokenization. Typically this is done with regular expressions. You write a regular expression for every token, you want to recognize. Every token is surrounded by white-space. The position between white-space and words is called word boundary, which is matched by \b. The following regular expressions use Unicode character classes. Symbols are no words, so they have no word boundary.

Words with three or less letters: \b\p{Letter}{1,3}\b.
Words with more than three letters: \b\p{Letter}{4,}\b.
Numbers: \b\p{Number} \b
Symbols: \p{Symbol}

In order to parse the different tokens it is useful to put the regular expressions into named capture groups: (?<anything>.*). This will match anything and will store the match in the capture group anything.

const input = 'There are usually 100 to 200 words   in a paragraph'; 

let rx = new RegExp ([
  '(?<wle3>\\b\\p{L}{1,3}\\b)',
  '(?<wgt3>\\b\\p{L}{4,}\\b)',
  '(?<n>\\b\\p{N} \\b)',
  '(?<s>\\p{S} )'
  ].join ('|'),
  'gmu');

let words_le_3 = [];
let words_gt_3 = [];
let numbers = [];
let symbols = [];

for (match of input.matchAll(rx)) {
  let g = match.groups;
  switch (true) {
  case (!!g.wle3): words_le_3.push (g.wle3); break;
  case (!!g.wgt3): words_gt_3.push (g.wgt3); break;
  case (!!g.n):    numbers   .push (g.n);    break;
  case (!!g.s):    symbols   .push (g.s);    break;
  }
}

console.log (`Words with up to three letters: ${words_le_3}`);
console.log (`Words with more than three letters: ${words_gt_3}`);
console.log (`Numbers: ${numbers}`);
console.log (`Symbols: ${symbols}`);

The code will be simpler, if you store the matches in an object instead of four top-level arrays. In that case the switch statement can be replaced by a loop over the groups and an assignment.

CodePudding user response：

const string = 'There are usually 100 to 200 words   in a paragraph';
const response = [];
for (let i = 0; i < string.length; i  ) {
  response.push(string[i]);
  // console.log(response); All process of the loop
}
console.log(response);

CodePudding user response：

You can write a seperate function for these case:

const txt  = 'There are usually 100 to 200 words in a paragraph';
console.log(txt);
console.log( ctrim(txt) )

function ctrim(txt) {  
  let w = txt.split(' ');
  let _w = []
  w.forEach((w) => {
    if(w.length <= 3) {
      _w.push( w )
    }
  }) 
  return _w
}