I have a data table called Characteristics with 561 records, that has a column called "FullDescription," each row of which is full of paragraphs featuring one or more URLs that were entered incorrectly. In each row for this column, I need to capture a very specific string, a 6-digit number that exists in each URL, and then replace each entire URL with a very specific URL that ends with that 6 digit numeric value. Each URL to be replaced has 42 characters, and the 6-digit code occupies locations 32 through 37 relative to the URL. Unfortunately, each result contains several URLs, and because it's in paragraph form, there's no way to tell the numeric position of each URL relative to the result it's in.
As an example, one result might have 3 URLs in it. This is not the specific data, but each row will have different words in it. Here is an example of what this query might return:
SELECT FullDesciption
FROM Characteristics
WHERE ID = 400
Here are some words and
http://replace.this.org/number/123456.html
and then they say some more words and you get another URLhttp://replace.this.org/number/234567.html
and then it keeps going and more URLs appearhttp://replace.this.org/number/999999.html
the end.
I need to find a way to capture each 6-digit number right before the .html for every URL in a result, replace each instance of a URL with a 34 character URL where the first 28 characters are all the same, and the 6-digit number will occupy the last 6 locations, 29 through 34. When all is said and done, after the change, the paragraph above should look like this:
Here are some words and
https://www.this.org/number/123456
and then they say some more words and you get another URLhttps://www.this.org/number/234567
and then it keeps going and more URLs appearhttps://www.this.org/number/999999
the end.
I've tried using wildcards that don't seem to work in the search function, but everything I've tried isn't working, and perhaps it required functions I haven't used yet.
Thank you!
CodePudding user response:
Ignoring the fact that doing this somewhere other than SQL Server using some regex-replace might be preferable, looking at the challenge of doing it in SQL Server the obvious approach is to break the source data into rows with one URL per row, perform the string replacements, and then re-aggregate.
Making use of OpenJson to split the string (it provides for dependable ordering) and String_Agg to reaggregate yields the following (assuming a fully supported version of SQL Server):
select String_Agg(NewDescription, '') within group (order by seq)
from t
cross apply (
select IsNull(
Replace(
Replace(
j.[value], Left(j.[value], NullIf(CharIndex('.', j.[value]),0) -1), 'https://www'
), '.html',''
), j.[value]
) NewDescription,
Convert(int, j.[Key]) Seq
from OpenJson(Concat('["',replace(String_Escape(FullDescription, 'json'), 'http://', '","'), '"]')) j
)u;
This will almost certainly require some work to deal with edge cases and other content in your source data, but works with what you have provided.