Home > database >  Within a text, how to count and collect word occurrence totals and how to replace every other matchi
Within a text, how to count and collect word occurrence totals and how to replace every other matchi

Time:03-03

There is a string called story. We want to gather some information about the individual words and sentences in the string.

let story = `Last weekend, I took literally the most beautiful
bike ride of my life. The route is called "The 9W to Nyack"
and it actually stretches all the way from Riverside Park
in Manhattan to South Nyack, New Jersey.
is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever
since the 1500s, when an unknown printer took a galley of type
and scrambled it to make a type specimen book. It has survived
not only five centuries, but also the leap into electronic
typesetting, remaining essentially unchanged. It was popularised
in the 1960s with the release of Letraset sheets containing Lorem
Ipsum passages, and more recently with desktop publishing software
like Aldus PageMaker including versions of Lorem Ipsum.`;
  • There is an array of words called overusedWords. These are words overused in story.
  • There is also an array of words that are unnecessary. Iterate over the unnecessaryWords array to filter out these words and calculate each word's total occurrence count within story.
let overusedWords = ['really', 'very', 'basically'];

let unnecessaryWords = ['extremely', 'literally', 'actually' ];

But the main question is ... How, for the overused words, does one remove every other word (appearance/occurrence) from the text?

const storyWords = story.split(" ")
let reallyCounter = 0;
let veryCounter = 0;
let basicallyCounter = 0;


betterWords.map(word => {
  if(word === 'really') {
    reallyCounter  = 1
  } else if (word === 'very') {
    veryCounter  = 1
  } else if (word === 'basically') {
    basicallyCounter  = 1
  }
})

I've been trying with:

let wayBetterWords = betterWords.filter((word, i) => {
  if(reallyCounter || veryCounter || basicallyCounter > 1) {
    !overusedWords.includes(word[i 1])
  }
})

Or this one:

uniq = [...new Set(array)];

But this solution removes all the future words in general that are repeated later, I only want it to leave the first time word and the next time remove it but only from an array of predefined words.

console.log(wayBetterWords.join(' '))

Can you give me a hand with this?

CodePudding user response:

An entirely generic approach could be based on a reduce task which does the word replacement and word counting by utilizing String.prototype.replace together with a RegExp which reflects the currently processed word.

Every word occurrence gets counted but whether to replace a word or not is determined by an optionally configurable nthWord property. It's value defaults to 1, thus, either omitted or explicitly provided as 1, every word occurrence gets replaced; a nthWord value of 2 results in the replacement of every 2nd matching word ... and so forth ...

function removeEveryNthMatchFromTextButCollectOverallWordCount(collector, word) {
  const { text, counts, nthWord = 1 } = collector;
  let wordCount = 0;

  const removeAndCount = (match, leftCapture, rightCapture) => {
    // increment `word` specific count with every
    // replacement of the currently processed `word`.
      wordCount;

    // assure the correct replacement string,
    // either the match itself (no replacemnt)
    // or an empty string or the shortest
    // possible whitespace sequence.
    return ((wordCount % nthWord) === 0)
      ? `${ leftCapture }${ rightCapture }`.replace(/\s /g, ' ')
      : match;
  };
  // create regex which reflects / relates to
  // the currently processed `word`.
  const regX = RegExp(`(\\s |\\b)${ word }(\\s |\\b)`, 'g');

  return {
    text: text.replace(regX, removeAndCount),
    counts: Object.assign(counts, {
      [word]: wordCount,
    }),
    nthWord,
  };
}

const story = `Last weekend, I took literally the most beautiful
bike ride of my life. The route is called "The 9W to Nyack"
and it actually stretches all the way from Riverside Park
in Manhattan to South Nyack, New Jersey. literally. extremely.
actually. extremely. literally.
is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever
since the 1500s, when an unknown printer took a galley of type
and scrambled it to make a type specimen book. It has survived
not only five centuries, but also the leap into electronic
typesetting, remaining essentially unchanged. It was popularised
in the 1960s with the release of Letraset sheets containing Lorem
Ipsum passages, and more recently with desktop publishing software
like Aldus PageMaker including versions of Lorem Ipsum.`;

const overusedWords = ['really', 'very', 'basically'];
const unnecessaryWords = ['extremely', 'literally', 'actually' ];

const overusedWordCounts = overusedWords
  .reduce(removeEveryNthMatchFromTextButCollectOverallWordCount, {

    text: story,
    counts: {},

  }).counts;

console.log(
  'result on `overusedWords` ...',
  { overusedWordCounts }
);

const {

  counts: {
    extremely: extremelyCount,
    literally: literallyCount,
    actually: actuallyCount,
  },
  text: alteredStory,

} = unnecessaryWords
  .reduce(removeEveryNthMatchFromTextButCollectOverallWordCount, {

    text: story,
    nthWord: 2,
    counts: {},
  });

console.log(
  'result on `unnecessaryWords` ...', {

  extremelyCount,
  literallyCount,
  actuallyCount,

  alteredStory,
});
.as-console-wrapper { min-height: 100%!important; top: 0; }

CodePudding user response:

If I were doing this, I would split the story into an array of words, and iterate over it, each time I see an overused word, I would increase that words count. If the count is two (seen two of that word), I would just ignore that word when remaking the story. Similar to this:

let reallyCounter = 0;
let veryCounter = 0;
let basicallyCounter = 0;
const storyWords = story.split(" ");
let betterStory = "";
for (let x = 0; x < storyWords.length; x  ) {
  if (storyWords[x] === "really")
    reallyCounter  ;
  if (storyWords[x] === "very")
    veryCounter  ;
  if (storyWords[x] === "basically")
    basicallyCounter  ;
  if (basicallyCounter === 2) {
    //ignore this word and reset basically Counter
    basicallyCounter = 0;
    continue;
  } else if (reallyCounter === 2) {
    //ignore this word and reset reallyCounter
    reallyCounter = 0;
    continue;
  } else if (veryCounter === 2) {

    //ignore this word and reset veryCounter
    veryCounter = 0;
    continue;
  } else {
    //all words havent been duplicated
    betterStory  = storyWords[x];
  }

I'm not sure if this is the most efficient, but it does guarantee that you get all of your cases covered. I wouldn't recommend this approach if you have more than like, 5 overused words.

CodePudding user response:

You're on the right track. We can create a boolean that tells us whether the last time text was removed or not (I made a closure to scope the variable), then use Array.prototype.filter to remove the excess.

let story = 'Last weekend, I took literally the most beautiful bike ride of my life. The route is called "The 9W to Nyack" and it actually stretches all the way from Riverside Park in Manhattan to South Nyack, New Jersey. is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry\'s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.';

let overusedWords = ['really', 'very', 'basically'];
let unnecessaryWords = ['extremely', 'literally', 'actually'];

const words = story.split(" ");

const removeUnnecessary = word => !unnecessaryWords.includes(word.toLowerCase());
const removeOverused = () => {
  let lastRemoved = false;
  return word => overusedWords.includes(word.toLowerCase()) ? (lastRemoved = !lastRemoved) : true;
}

const result = words.filter(removeUnnecessary).filter(removeOverused()).join(" ");
console.log(result);

  • Related