Home > Software engineering >  How to extract out specific substrings?
How to extract out specific substrings?

Time:12-12

I have these long strings that have multiple substrings in them all separated by periods. The good news is I've found out how to extract most of the substrings on the left or right of the strings by using functions like left, mid, right, regexextract, find, len, and substitute, but I just can't figure out this last problem.

The problem with these substrings is sometimes some are there, sometimes none are there (most I've seen at once is, I think, 3). And other than being in all caps, which some of the other substrings that I don't want are also in, I don't think there's any regex pattern one could use except something like string1|string2|string3, etc all the way up to maybe string30.

I first thought it would be best to just have a formula look at the string, compare it to a range on another sheet, and if there was something in the range that was in the string, then show it. But I was lost on how to do that. Then I figured just put the whole range list in a regex and somehow extract any substrings that were in the string.

And that worked, but it would only extract the first substring it found whereas I wanted it to extract all the substrings it found. And while I think I'd prefer the substrings to be put into different columns (not rows) by using the Split function, I'd settle for them all being put in the same cell via the Textjoin function.

The farthest I've gotten is

=split(REGEXextract(A2,"\b(?:string1|string2|string3)\b")," ")

but like I said that only spits out the first substring it finds. And I've seen some people use REGEXreplace with Split and ArrayFormula and sometimes double REGEX functions, but I just can't seem to make those work for my purposes.

I'm doing this in GoogleSheets, but even an Excel or LibreOffice answer will probably be helpful as I can probably turn them into a GS solution. I realize I could just make a simple regexextract in 30 or so columns, but I'd really rather not do that. Thanks in advance, even if you just give me an idea of what direction to head in.

CodePudding user response:

You could try something like this, that would filter all values that match your desired list of substrings. Replace F1:F2 with the range where you save the values you want to appear, and A3 with the cell of the substring. If you need you can set this as an array with Map or BYROW, for example

=filter(split(A3,"."),INDEX(REGEXMATCH(SPLIT(A3,"."),JOIN("|",F1:F2))))
  • Related