I am trying to create a regex that finds ticker symbols in bodies of text. However it is a bit of a struggle to get one to do everything I need.
Example:
This is a $test to show what I would LIKE to match. If $YOU look below you will FIND the list of simulated tickers ($STOck symbols) I would like to match.
So in this case I would like to match the following from the above:
- test
- LIKE
- YOU
- FIND
- STOck
I am trying to get:
- any word after a "$" sign (not including the $), case insensitive
- any word that is ALL CAPS and between 3-6 characters long
I've tried:
\b[A-Z]{3,6}\b
but that matches pretty much every word\$[^3-6\s]\S*
but that includes the $ and also ignores any ALL CAPS without a dollar sign
CodePudding user response:
Would you please try the following:
import re
s = 'This is a $test to show what I would LIKE to match. If $YOU look below you will FIND the list of simulated tickers ($STOck symbols) I would like to match.'
print(re.findall(r'(?<=\$)\w |[A-Z]{3,6}', s))
Output:
['test', 'LIKE', 'YOU', 'FIND', 'STOck']
(?<=\$)
is a lookbehind assertion which matches a leading dollar sign without including the match in the result.
(Precisely speaking, it matches the boundary just after the dollar sign rather than the character itself.)