Home > database >  How to understand snippet of Regex
How to understand snippet of Regex

Time:11-06

I am attempting to understand what this snippet of code does:

passwd1=re.sub(r'^.*? --', ' -- ', line)
password=passwd1[4:]

I understand that the top line uses regex to remove the " -- ", and the bottom line I think removes something as well? I went back to this code after a while and need to improve it but to do that I need to understand this again. I've been trying to read regex docs to no avail, what is this: r'^.*? at the beginning of the regex?.

CodePudding user response:

To break r'^.*? -- into pieces:

  • r in front of a string in Python lets the interpreter know that it's a regex string. This lets you not have to do a bunch of confusing character escaping.
  • The ^ tells the regex to match only from the beginning of the string.
  • .*? tells the regex to match any number of characters up to...
  • --, which is a literal match.

The sum of this is that it will match any string, starting at the beginning of a line up to the -- demarcation. Since it is re.sub(), the matched part of the string will be replaced with --.

This is why something like Google -- MyPassword becomes -- MyPassword.

The second line is a simple string slice, dropping the first four elements (characters) of the string. This might be superfluous - you could just substitute the match with an empty string like this:

passwd1 = re.sub(r'^.* --', '', line)

This achieves the same result. Note I've dropped the ?, which is also superfluous here, because the * has a similar but broader effect. There are some technical differences, but I don't think you need it for your stated purpose.

? will match zero or one of the previous character - in this case a ., which is 'any character'. The * will match zero or more of the previous character. .* is what is known as a greedy quantifier, and .*? a lazy quantifier. That is, the greedy quantifier will match as much as possible, and the lazy will match as little as possible. The difference between ^.*? -- and ^.* -- is what is matched in this case:

Something something -- mypassword -- yourpassword

In the greedy case, the first two clauses ('something something -- mypassword') are matched and deleted. In the lazy case, only 'something something' is deleted. Most passwords don't include spaces, nevermind ' -- ', so you probably want to use the greedy version.

CodePudding user response:

You can use a site like regex101 to input your regular expression and get some analysis of it. It will tell you whether your regular expression matches some test cases, and also explain what each character in the regular expression means. In this case it matches everything up to and including the first instance of ' -- ' in your string, and replaces it with just the characters ' -- '.

The second line is slicing the string. It takes a substring, skipping over the first four characters and then continuing to the end of the string.

Effectively, given a string which has ' -- ' somewhere in it, this pair of lines will take everything after that substring. However, if that substring is not found in line then instead you will simply be discarding the first four characters. If line has less than four characters you will get an error.

CodePudding user response:

  • r means Regex

  • ^ means Starts with

  • . means Any character (except newline character)

  • * means Zero or more occurrences

  • ? means Zero or one occurrences

In other words, it means that it matches a string that begins with the exact number of characters --. sub replaces that part of the string that matches with ' -- '.

The second command just sets a variable ignoring the first 4 characters of the string which is the newly set ' -- '.

  • Related