Home > Back-end >  Counting a number of keywords in Excel within a range of specific keywords
Counting a number of keywords in Excel within a range of specific keywords

Time:09-28

I know how to count a series of keywords in Excel. I use this formula:

=SUMPRODUCT(--ISNUMBER(SEARCH($CE$2:$CE$43,(G2:AP2))))

However, what would be the Excel formula if I want to count the number of keywords that exist only within /-3 words around "risk" in the selected rows?

Consider this sentence: "Political uncertainty generates economic risk which stagnates economic activities." If my keywords are political, uncertainty, stagnates, and economic, the total number of keywords within /- 3 words around "risk" will be 3, i.e., uncertainty, stagnates, and economic. political will be excluded since it is out of range.

CodePudding user response:

You may try the following formula if you have Microsoft-365 with most recent release.

=LET(x,TOCOL(TEXTSPLIT(A1," ")),y,MATCH("risk",x,0),z,INDEX(x,SEQUENCE(y 2,1,y-3)),COUNT(XMATCH(D1:D4,z,0)))

enter image description here

CodePudding user response:

You could try:

enter image description here

Formula in E1:

=SUM(--ISNUMBER(MATCH(FILTERXML("<t><s>"&SUBSTITUTE("a a a "&A1," ","</s><s>")&"</s></t>","//s[following::s[4]='risk']/following::*[position()<8]")&"*",C1:C4,0)))

This would in order:

  • "<t><s>"&SUBSTITUTE("a a a "&A1," ","</s><s>")&"</s></t>" - Create a valid xml-string to process with xpath;
  • "//s[following::s[4]='risk']/following::*[position()<8]" - A valid xpath 1.0 expression to retrieve only upto seven nodes with risk in the middle. In more detail: //s[following::s[4]='risk'] will get the node where the following sibling on 4 indices to it's right will equal risk (case-sensitive), we then take all following siblings from that position with /following::* where we limit the returned nodes to a max of seven with [position()<8]. Now it would also make sense why we concatenate a a a with the string from A1 since risk could appear at the start of the string or less then three words in;
  • MATCH() - Will then check if any of the returned nodes start with any of the words in C1:C4 using a wildcard. This is to prevent possible punctuation to avoid matching;
  • SUM() and ISNUMBER() prepended with double unary will lastly make a summation of hits.

Note: The answer is not 3 but 4! Since 'economic' is to be counted twice.

  • Related