Home > Back-end >  Regex to match Strings which contain non Chinese characters between two Chinese Characters
Regex to match Strings which contain non Chinese characters between two Chinese Characters

Time:11-18

I'm trying to figure out how to write a regex to match this pattern

测试1003##$%#测试

Chinese Characters non Chinese Characters Chinese Characters, non Chinese Characters can be anything, and Chinese Characters are always the same(测试).

I know we can use ^((?!(\p{Han}).)*$ to match non Chinese Characters.. but not sure how should I make sure the head and tail are always the same Chinese Characters(测试 in this case).

CodePudding user response:

Use

^(\p{Han} )\P{Han}*\g{1}$

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  ^                        the beginning of the string
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \p{Han}                      Chinese characters 
                             (1 or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \P{Han}                  non-word Chinese characters (0 or more times 
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  \g{1}                    matches the same text as most recently matched
                           by the 1st capturing group
--------------------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string

If prefix = suffix = 测试, then use

^测试\P{Han}*测试$

Or, if the suffix and prefix can include more Chinese characters:

^测试\p{Han}*\P{Han}*\p{Han}*测试$

CodePudding user response:

If there should be at least a single character other than \p{Han} you can match \P{Han}.

Capture the \p{Han} chars in capture group 1, and add a backreference at the end to group 1.

^(\p{Han} )\P{Han}.*\1$
  • ^ Start of string
  • (\p{Han} ) Capture group 1, match 1 chars in the han script
  • \P{Han} Match at least a char other than \p{Han}
  • .* Match the rest of the string
  • \1$ Match a backreference to group 1 at the end of the string

Regex demo

To also match only 测试 you can use:

^(\p{Han} )(?:\P{Han}.*\1)?$

Regex demo

  • Related