Using Regex to add an id attribute to an HTML IMG tag containing just the file name from the src att-CodePudding

I am using a text editor (TextMate, not that it makes much difference) with a Regex search facility. I'm afraid I'm not good with Regex (as you'll see!)

So, if I have an HTML IMG tag like this:

<img src="../arbitrary/file/path/file-name.jpg" alt="arbitrary alt tag" />

I would like to replace it with one like this:

<img id="file-name" src="../arbitrary/file/path/file-name.jpg" alt="arbitrary alt tag" />

That is to say I want to add an id attribute that contain just the file name without the extension.

I've searched for hours on the net, but the closest I have been able to come up with is:

search term: (<img[^>])(src=")([^"] )([^>]>)

replacement term: $1 id="$3" $2$3$4

The capture number 3 captures the entire contents of the src attribute for the id attribute, but all I want is the file name without the extension.

Any help would be much appreciated. hank you.

CodePudding user response：

Let me preface this by saying parsing HTML with RegEx is a bad idea. However, it seems like in this case you have complete control over the source being provided to you, so it may be appropriate. Just keep in mind it might not work in all cases.

That being said, you're on the right track but I think you're overcomplicating things. Let's walk through my thought process.

Since you're wanting to reconstruct the element just with the addition of the id attribute, you'll need four capture groups. The first to match everything before where you want to insert the id, the second to match everything after the insert point up to the file name, the third to match the file name itself, and the fourth to match everything after the file name.

For the first group, we probably want the insert point of the id attribute to be right after <img but right before src= because this attribute is generally the first. In order for this to be syntactically correct, there also needs to be a space between the tag name and the attributes. This space will need to be included in either the first or second group. It doesn't really matter which, but for this example I will be including it in the first. The first capture group should look like this:

(<img )

For the second capture group, in order to match everything before the file name but not including the file name, we need to figure out a way to isolate the file name. Luckily, in this case it's pretty easy because we know there is a / right before the file name and a . right after it. This also doesn't apply to anything else in the src attribute, so we can use this to our advantage. We do need to make sure to only match the src attribute though. So, with this knowledge, in order to match everything up to the file name, we just need to match everything until we hit a / character. We also need to be careful not to match the closing double quotes here to make sure nothing outside of the src attribute is matched. We can use [^"]* to do this. The square brackets by themselves tell the engine to match any single character within them, the carat (^) at the beginning tells it to match everything except what's in the square brackets, and the * character at the end tells it to match between 0 and unlimited times. This also means it will work whether or not there is something before the /. We can then put the / character right after this to tell it to stop when it finds one. Keep in mind * is greedy, so it will always match up to the last occurrence, not the first one. Also, you may or may not need to escape the / character with a \, so I will do it in the example just in case. Putting this together, the second group will look like this:

(src="[^"]*\/)

The third capture group is quite simple. All you need to do is match all the characters between the / which was matched in the last capture group, and the . which will be matched in the next capture group. For this reason, we can just match everything. We can use . in this case. The . character will match any single character, and the character tells it to match between 1 and unlimited times. It would look like this:

(. )

And finally, the fourth capture group. This group needs to ensure the period is there, and match everything from it to the end of the element. We know the match begins with a ., and we know the element ends with a >. This means we can just match everything between . and >. We can again use . in this case. We do however want to prevent it from doing something funky if there are 2 elements on the same line, so we can add a ? right after the . to make it lazy instead of greedy, so it will match the first > rather than the last. The . and > characters will need to be escaped with a \ in order for them to be treated literally. This group would look like this:

(\.. ?\>)

The final expression should look like this:

(<img )(src="[^"]*\/)(. ?)(\.. ?\>)

Here is a sample on regex101

The replacement string will be the same as what you have, except you don't really need the first space before id because we already know there's a space at the end of group 1.

$1id="$3" $2$3$4

CodePudding user response：

following will work for your case

img = document.getElementsByTagName('img')[0]
src = img.getAttribute("src")
result = src.match(/(\x2F)(?!.*\1). \./g)
fileName = result.toString();
img.setAttribute('id',fileName.substring(1, fileName.length-1) )

<img src="../arbitrary/file/path/file-name.jpg" alt="arbitrary alt tag" />