I'm trying to create a re pattern in python to extract this pattern of text.
contentId: '2301ae56-3b9c-4653-963b-2ad84d06ba08' contentId: 'a887526b-ff19-4409-91ff-e1679e418922'
The length of the content ID is 36 characters long and has a mix of lowercase letters and numbers with dashes included at position 8,13,18,23,36.
Any help with this would be much appreciated as I just can't seem to get the results right now.
r1 = re.findall(r'^[a-zA-Z0-9~@#$^*()_ =[\]{}|\\,.?: -]*{36}$',f.read())
print(r1)
Below is the file I'm trying to pull from
Object.defineProperty(e, '__esModule', { value: !0 }), e.default = void 0;
var t = r(d[0])(r(d[1])), n = r(d[0])(r(d[2])), o = r(d[0])(r(d[3])), c = r(d[0])(r(d[4])), l = r(d[0])(r(d[5])), u = function (t) {
return [
{
contentId: '2301ae56-3b9c-4653-963b-2ad84d06ba08',
prettyId: 'super',
style: { height: 0.5 * t }
},
{
contentId: 'a887526b-ff19-4409-91ff-e1679e418922',
prettyId: 'zap',
style: { height: t }
}
];
},
CodePudding user response:
Is there a typo in the regex in your question? *{36}
after the bracket ]
that closes the character group causes an error: multiple repeat
. Did you mean r'^[a-zA-Z0-9~@#$^*()_ =[\]{}|\\,.?: -]{36}$'
?
Fixing that, you get no results because ^
anchors the match to the start of the line, and $
to the end of the line, so you'd only get results if this pattern was alone on a single line.
Removing these anchors, we get lots of matches because it matches any string of those characters that is 36-long:
r1 = re.findall(r'[a-zA-Z0-9~@#$^*()_ =[\]{}|\\,.?: -]{36}',t)
r1: ['var t = r(d[0])(r(d[1])), n = r(d[0]',
')(r(d[2])), o = r(d[0])(r(d[3])), c ',
'= r(d[0])(r(d[4])), l = r(d[0])(r(d[',
'2301ae56-3b9c-4653-963b-2ad84d06ba08',
' style: { height: 0.5',
'a887526b-ff19-4409-91ff-e1679e418922',
' style: { height: t }']
To only match your ids, only look for alphanumeric characters or dashes.
r1 = re.findall(r'[a-zA-Z0-9\-]{36}',t)
r1: ['2301ae56-3b9c-4653-963b-2ad84d06ba08',
'a887526b-ff19-4409-91ff-e1679e418922']
To make it even more specific, you could specify the positions of the dashes:
r1 = re.findall(r'[a-z0-9]{8}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{12}', t, re.IGNORECASE)
r1: ['2301ae56-3b9c-4653-963b-2ad84d06ba08',
'a887526b-ff19-4409-91ff-e1679e418922']
Specifying the re.IGNORECASE
flag removes the need to look for both upper- and lower-case characters.
Note:
You should read the file into a variable and use that variable if you're going to use its contents more than once, since
f.read()
won't give anything after the first.read()
unless youf.seek(0)
To avoid creating a new file on disk with those contents, I just defined
t = """Object.defineProperty(e, '__esModule', { value: !0 }), e.default = void 0;
var t = r(d[0])(r(d[1])), n = r(d[0])(r(d[2])), o = r(d[0])(r(d[3])), c = r(d[0])(r(d[4])), l = r(d[0])(r(d[5])), u = function (t) {
return [
{
contentId: '2301ae56-3b9c-4653-963b-2ad84d06ba08',
prettyId: 'super',
style: { height: 0.5 * t }
},
{
contentId: 'a887526b-ff19-4409-91ff-e1679e418922',
prettyId: 'zap',
style: { height: t }
}
];
},"""
and used t
in place of f.read()
from your question.