Home > Blockchain >  String tokenizer method
String tokenizer method

Time:07-03

Consider strings with this format:

id-string1-string2-string3.extension

where id, string1, string2 and string3 can be string of variable length, and extension is an image extension type.

For example, two possible strings could be:

Il2dK-Ud2d9-Kod2d-d9dwo.jpg

j54fwf3da-7jrg-9eujodww-kio98ujk.png

I need tokenizer method in JavaScript for an express/nodejs API that takes these strings in input and outputs an object with this format:

{a: id-string1-string2, b: string3, c: extension}

For the example strings this tokenizer should then output:

{a: Il2dK-Ud2d9-Kod2d, b: d9dwo, c: jpg}

{a: j54fwf3da-7jrg-9eujodww, b: kio98ujk, c: png}

I think this can be done with regex. I tried to use the following regex match(/[^-] /g), but this tokenize every substring, I need a way to skip the first 2 char "-" but couldn't find it out.

Do you have any ideas? Or could you provide me a better solution instead of using regex? Thanks very much!

CodePudding user response:

You can achieve this using spit as:

const str = 'Il2dK-Ud2d9-Kod2d-d9dwo.jpg';
const [restStr, c] = str.split('.');
const [a, b] = restStr.split(/-([a-z0-9] $)/);
const result = { a, b, c };
console.log(result);

CodePudding user response:

You might use a pattern with capture groups:

^(?<a>[^\s-] (?:-[^\s-] )*)-(?<b>[^\s.-] )\.(?<c>\w )$

Explanation

  • ^ Start of string
  • (?<a>[^\s-] (?:-[^\s-] )*) Named group a, match any char except a whitespace char or - and optionally repeat - and again any char except a whitespace char
  • - Match literally
  • (?<b>[^\s.-] ) Named group b, match 1 chars other than . - or a whitespace char
  • \. Match .
  • (?<c>\w ) Named group c, match 1 word chars for the extension
  • $ End of string

regex demo

const regex = /^(?<a>[^\s-] (?:-[^\s-] )*)-(?<b>[^\s.-] )\.(?<c>\w )$/;
[
  "id-string1-string2-string3.extension",
  "Il2dK-Ud2d9-Kod2d-d9dwo.jpg",
  "j54fwf3da-7jrg-9eujodww-kio98ujk.png",
  "a-b-c",
  "a.b"
].forEach(s => {
  const m = s.match(regex);
  if (m) {
    console.log(m.groups);
  }
});

Without named groups, you can use capture groups and create the objects:

const regex = /^([^\s-] (?:-[^\s-] )*)-([^\s.-] )\.(\w )$/;
[
  "id-string1-string2-string3.extension",
  "Il2dK-Ud2d9-Kod2d-d9dwo.jpg",
  "j54fwf3da-7jrg-9eujodww-kio98ujk.png",
  "a-b-c",
  "a.b"
].forEach(s => {
  const m = s.match(regex);
  if (m) {
    console.log({
      "a": m[1],
      "b": m[2],
      "c": m[3]
    });
  }
});

CodePudding user response:

To split at the last hyphen or any period:

res = str.split(/-(?![^-]*-)|\./);

See this demo at regex101 or JS demo at tio.run

At the position after any hyphen a negative lookahead (zero-length assertion/condtion) checks if there is not another hyphen ahead with any amount of non-hyphens in between OR match period.

  • Related