Home > Net >  PDF URL from old site not redirecting to PDF location in new site - RegEx
PDF URL from old site not redirecting to PDF location in new site - RegEx

Time:10-10

I am rebuilding and old .asp site in WordPress and am having a real problem addressing the URL structure of PDF files. I know this is probably long-winded - just trying to paint an accurate picture in hopes of finding a resolution.

Here is the background information.

A few sample URLs for a document on the old site:

https://oldsite.example/folder/sub-folder/sub-folder/mypdf.pdf
https://oldsite.example/folder/sub-folder/mypdf.pdf

The PDF docs from the old site have already been uploaded into WordPress with the folder structure of:

https://newsite.example/wp-content/uploads/2022/08/mypdf.pdf

OR

https://newsite.example/wp-content/uploads/2022/09/mypdf.pdf

...depending on the month it was uploaded

Here is the problem. Many of these pdf's have embedded URLs within the body of the PDF that have the structure of the old site.

So a link clicked within a PDF such as https://oldsite.example/folder/sub-folder/mypdf.pdf will result in a 404 because that folder structure no longer exists.

What I am trying to figure out is a RegEx to search all pdf's that have the old site URL structure and find it corresponding matching PDF file name in one of the new sites /wp-content/uploads/20xx/xx/mypdf.pdf folder(s).

I am using the Redirection Plugin

Note: This uses PHP’s regular expressions (commonly known as PCRE) and may not be exactly the same as other regular expression libraries.

I can write a standard redirect that works fine:

Source URL: /folder/sub-folder/sub-folder/somepdf.pdf
Target URL: /wp-content/uploads/2022/07/somepdf.pdf

But writing 500 standard redirects is not practical

Where I am having issues is capturing any path of the old PDF URL and matching it to the correct folder in the new folder structure. (i.e. RegEx)

Any time I try a RegEx, I end up with either the sorce folder structure appended to the target URL structure (giving a 404)

Source URL: `^/([^\s/] \.pdf)`
Target URL: `/wp-content/uploads/^/\d{4}/\d{2}/$1`
RegEx checked.

OR I end up with an endless redirect loop

Source URL: /(.*).pdf
Target URL: /wp-content/uploads/2022/07/$1.pdf
RegEx checked.

Is RegEx even the way to go? Is an .htaccess redirect a better option? A PHP function maybe?

CodePudding user response:

Regular expression redirects only work when all the information about the new location can be derived from the URL for the old location. That isn't the case here. You have new information in the new URLs. The data that the document was uploaded is now in the URL when it wasn't before. There is no way to write a single rule with a regular expression to redirect all the URLs.

You have a couple options:

  • Redirect each PDF URL individually
  • Upload your PDFs to your new site in their original locations

The second option is more possible than you might think. You can upload files to a WordPress site using FTP in arbitrary paths. The front controller for WordPress only handles URLs for which files don't exist on the file system. If you put the PDFs in their original directories along site the WordPress files, that will work in the majority of WordPress installations.

  • Related