Home > Enterprise >  Make a model to identify a string
Make a model to identify a string

Time:10-03

I have a string like this

ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8

The first part of the string is a random 18 digit number in base64 format and the second is a unix timestamp in base64 too, while the last is an hmac.

I want to make a model to recognize a string like this.

How may i do it?

CodePudding user response:

While I did not necessarily think deeply about it, this would be what comes to my mind first.

You certainly don't need machine learning for this. In fact, machine learning would not only be inefficient for problems like this but may even be worse, depending on a given approach.

Here, an exact solution can be achieved, simply by understanding the problem.

One way people often go about matching strings with a certain structure is with so called regular expressions or RegExp.

Regular expressions allow you to match string patterns of varying complexity.

To give a simple example in Python:

import re

your_string = "ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8"
regexp_pattern = r"(. )\.(. )\.(. )"
re.findall(regexp_pattern, your_string)

>>> [('ODQ1OTc3MzY0MDcyNDk3MTUy', 'YKoz0Q', 'wlST3vVZ3IN8nTtVX1tz8Vvq5O8')]

Now one problem with this is how do you know where your string starts and stops. Most of the times there are certain anchors, especially in strings that were created programmatically. For instance, if we knew that prior to each string you wanted to match there is the word Token: , you could include that in your RegExp pattern r"Token: (. )\.(. )\.(. )".

Other ways to avoid mismatches would be to clearer define the pattern requirements. Right now we simply match a pattern with any amount of characters and two . separating them into three sequences. If you would know which implementation of base64 you were using, you could limit the alphabet of potential characters from . (thus any) to the alphabet used in your base64 implementation [abcdefgh1234]. In this example it would be abcdefgh1234, so the pattern could be refined like this r"([abcdefgh1234] ).([abcdefgh1234] ).(. )"`. The same applies to the HMAC code.

Furthermore, you could specify the allowed length of each substring. For instance, you said you have 18 random digits. This would likely mean each is encoded as 1 byte, which would translate to 18*8 = 144 bits, which in base64, would translate to 24 tokens (where each encodes a sextet, thus 6 bits of information). The same could be done with the timestamp, assuming a 32 bit timestamp, this would likely necessitate 6 base64 tokens (representing 36 bits, 36 because you could not divide 32 into sextets).

With this information, you could further refine the pattern

r"([abcdefgh1234]{24})\.([abcdefgh1234]{6})\.(. )"`

In addition, the same could be applied to the HMAC code.

I leave it to you to read a bit about RegExp but I'd guess it is the easiest solution and certainly more appropriate than any kind of machine learning.

  • Related