Home > Enterprise >  Regex to match SHA1 but must contain HEX characters
Regex to match SHA1 but must contain HEX characters

Time:08-11

I have this regex to find SHA1's in a Kusto column:

\b[a-fA-F0-9]{40}\b

However, I am getting lots of matches for non-hex numbers (only 1-9 digits). How can I ensure that the match contains at least one HEX digit (a-f)?

Kusto doesn't support lookarounds according to this: Does Kusto not support regex lookarounds?

CodePudding user response:

Perhaps you can match 40 digits between word boundaries to get that out of the way, and use an alternation | with a capture group ([a-fA-F0-9]{40}) to capture what you would allow with extract_all

\b[0-9]{40}\b|\b([a-fA-F0-9]{40})\b

See a regex demo with the capture group value.

CodePudding user response:

I made my query more efficient and was able to resolve later in the Kusto query instead of changing the regex. I will not mark this as an answer because the original question is about how to accomplish this from the regex itself and it would be interesting to have that answer.

This is what I did:

...
| where Content matches regex @'\b[a-fA-F0-9]{40}\b'
| extend match = extract_all(@'(\b[a-fA-F0-9]{40}\b)', Content) 
| mv-expand match
| where not (match matches regex @'\b[0-9]{40}\b')
...

In the last line I remove matches with all decimal digits

CodePudding user response:

Based on extract_all(), compare the number of Hex based strings to Dec based strings.

Please note that with this method we don't really need to extract anything but empty strings.

datatable(text:string)
[
    "SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02"
   ,"Only digits: 6791012659213568246582140340987435098743"
   ,"Too short: f0cf934569319b10e85a9d"
   ,"Too long: 273d3fd2f0cf934569319b10e85a9dfadcff113c123"
   ,"888ead874a7c562ef1642e83cca05f2f920a2399"
]
| where array_length(extract_all(@"\b[[:xdigit:]]{40}\b()", text)) > coalesce(array_length(extract_all(@"\b\d{40}\b()", text)), 0)
text
SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02
888ead874a7c562ef1642e83cca05f2f920a2399

Fiddle


For educational purposes / in case you want to extract the SHA1 values

datatable(text:string)
[
    "SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02"
   ,"Only digits: 6791012659213568246582140340987435098743"
   ,"Too short: f0cf934569319b10e85a9d"
   ,"Too long: 273d3fd2f0cf934569319b10e85a9dfadcff113c123"
   ,"888ead874a7c562ef1642e83cca05f2f920a2399"
]
| extend hex = extract_all(@"\b([[:xdigit:]]{40})\b", text), dec = extract_all(@"\b(\d{40})\b", text)
| extend sha1 = set_difference(hex, dec)
text hex dec sha1
SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02 ["273d3fd2f0cf934569319b10e85a9dfadcff113c","6791012659213568246582140340987435098743","e59c299bc9b181240c546464a93ac2d4d001ce02"] ["6791012659213568246582140340987435098743"] ["273d3fd2f0cf934569319b10e85a9dfadcff113c","e59c299bc9b181240c546464a93ac2d4d001ce02"]
Only digits: 6791012659213568246582140340987435098743 ["6791012659213568246582140340987435098743"] ["6791012659213568246582140340987435098743"] []
Too short: f0cf934569319b10e85a9d
Too long: 273d3fd2f0cf934569319b10e85a9dfadcff113c123
888ead874a7c562ef1642e83cca05f2f920a2399 ["888ead874a7c562ef1642e83cca05f2f920a2399"] ["888ead874a7c562ef1642e83cca05f2f920a2399"]

Fiddle

  • Related