Home > Enterprise >  regex for this "001/Cnt.A/2021/EX.Dng" pattern
regex for this "001/Cnt.A/2021/EX.Dng" pattern

Time:11-22

I have some string like this below:

0015/Cnt.A/2021/EX. Mmj tech
021/Cnt.B/2021/EX.Mm logs
31/ Cgt.A / 2020 / PK Jap
453/ Nnt.A / 2020 / WK Jap pom sc
13/Wnt.A/2021/ LO.Mm pom
1911/Cno.A/2021/PQ Mm ris dMn

and I want to select for output like this below:

0015/Cnt.A/2021/EX. Mmj
021/Cnt.B/2021/EX.Mm
31/ Cgt.A / 2020 / PK Jap
453/ Nnt.A / 2020 / WK Jap
13/Wnt.A/2021/ LO.Mm
1911/Cno.A/2021/PQ Mm

I have tried this pattern [0-9]{1,}\/[a-zA-Z.\s-]{1,}\/[0-9\s]{1,}\/[a-zA-Z\s] [\.\s] [a-zA-Z]{1,} but it can't handle the 4th and 6th string. Anyone, can fix that pattern? and maybe make it more efficient?

edited: There is a rule like this pattern -> number/letter with dot or space/year/letter with dot or space

CodePudding user response:

The pattern to get all text up to the last slash and then only two words separated with a whitespace or . is

.*\/\s*[a-zA-Z] [\s.] [a-zA-Z] 
.*\/\s*\w [\s.] \w 

If you need to keep the initial regex part for stricter validation, use

[0-9] \/[a-zA-Z.\s-] \/[0-9\s] \/\s*\w [\s.] \w 

See this demo (or this demo). Details:

  • .*\/ - any zero or more chars other than line break chars, as many as possible
  • \s* - zero or more whitespaces
  • [a-zA-Z] - one or more ASCII letters
  • [\s.] - one or more whitespaces/dots
  • [a-zA-Z] - one or more ASCII letters.

\w would match one or more letters, digits, or underscores.

Now, accommodating for the number/letter with dot or space/year/letter with dot or space rule:

\d \/\s*[a-zA-Z] (?:\.[a-zA-Z] )*\s*\/\s*[0-9]{4}\s*\/\s*\w [\s.] \w 

See this regex demo. Details:

  • \d - one or more digits
  • \/ - a / char
  • \s* - zero or more whitespaces
  • [a-zA-Z] (?:\.[a-zA-Z] )*
  • \s*\/\s* - 0 whitespaces, /, 0 whitespaces
  • \d{4} - four digits
  • \s*\/\s* - 0 whitespaces, /, 0 whitespaces
  • \w [\s.] \w - one or more word chars, 1 whitespaces/dots, 1 word chars.
  • Related