Home > database >  replacing every instance of a link with another link
replacing every instance of a link with another link

Time:03-01

I am scraping and modifying content from a website. The website consists of broken images that I need to fix. My JSON looks something like this

[
  {
    "post_title": "post 1",
    "post_link": "link 1",
    "post_date": "@1550725200",
    "post_content": [
      "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna <a href=\"somelink.com\">aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.<a href=\"url1.jpg\"><img src=\"brokenURL1.jpg\" alt=\"\"></a><a href=\"url2.jpg\"><img src=\"brokenURL2.jpg\" alt=\"\"></a><a href=\"url3.jpg\"><img src=\"brokenURL3.jpg\" alt=\"\"></a><a href=\"url4.jpg\"><img src=\"brokenURL4.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  },
{
    "post_title": "post 2",
    "post_link": "link 2",
    "post_date": "@1550725200",
    "post_content": [
      "<p>At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, <a href=\"somelink.com\">similique</a> sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga.<a href=\"url5.jpg\"><img src=\"brokenURL5.jpg\" alt=\"\"></a><a href=\"url6.jpg\"><img src=\"brokenURL6.jpg\" alt=\"\"></a><a href=\"url7.jpg\"><img src=\"brokenURL7.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  },
{
    "post_title": "post 3",
    "post_link": "link 3",
    "post_date": "@1550725200",
    "post_content": [
      "<p>Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus. <a href=\"url8.jpg\"><img src=\"brokenURL8.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  },
{
    "post_title": "post 4",
    "post_link": "link 4",
    "post_date": "@1550725200",
    "post_content": [
      "<p>Itaque earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis <a href=\"somelink.com\">doloribus asperiores repellat</a>.<a href=\"url9.jpg\"><img src=\"brokenURL9.jpg\" alt=\"\"></a><a href=\"url10.jpg\"><img src=\"brokenURL10.jpg\" alt=\"\"></a><a href=\"url11.jpg\"><img src=\"brokenURL11.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  }
]

I have the image links and I want to replace every instance of the src link with the a href link. So the end result would look something like this.

[
  {
    "post_title": "post 1",
    "post_link": "link 1",
    "post_date": "@1550725200",
    "post_content": [
      "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna <a href=\"somelink.com\">aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.<a href=\"url1.jpg\"><img src=\"url1.jpg\" alt=\"\"></a><a href=\"url2.jpg\"><img src=\"url2.jpg\" alt=\"\"></a><a href=\"url3.jpg\"><img src=\"url3.jpg\" alt=\"\"></a><a href=\"url4.jpg\"><img src=\"url4.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  },
{
    "post_title": "post 2",
    "post_link": "link 2",
    "post_date": "@1550725200",
    "post_content": [
      "<p>At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, <a href=\"somelink.com\">similique</a> sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga.<a href=\"url5.jpg\"><img src=\"url5.jpg\" alt=\"\"></a><a href=\"url6.jpg\"><img src=\"url6.jpg\" alt=\"\"></a><a href=\"url7.jpg\"><img src=\"url7.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  },
{
    "post_title": "post 3",
    "post_link": "link 3",
    "post_date": "@1550725200",
    "post_content": [
      "<p>Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus. <a href=\"url8.jpg\"><img src=\"url8.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  },
{
    "post_title": "post 4",
    "post_link": "link 4",
    "post_date": "@1550725200",
    "post_content": [
      "<p>Itaque earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis <a href=\"somelink.com\">doloribus asperiores repellat</a>.<a href=\"url9.jpg\"><img src=\"url9.jpg\" alt=\"\"></a><a href=\"url10.jpg\"><img src=\"url10.jpg\" alt=\"\"></a><a href=\"url11.jpg\"><img src=\"url11.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  }
]

I also have random links that are not associated with images and are just links like in post 1, 2 and 4. Is there any way to do this with Javascript?

Thanks

CodePudding user response:

Consider using the DOMParser utility to create a document from your html string. Then you can use the typical DOM methods (notably, querySelectorAll) to find the relevant elements and then to execute your replacement.

Eg, something like:

const parser = new DOMParser();
const doc = parser.parseFromString(html_string_here, "text/html")
doc.querySelectorAll("a > img").forEach(img => img.setAttribute("src", img.parentElement.getAttribute("href")))

Using your example data, you might write it like so:

const example_data = [{
    "post_title": "post 1",
    "post_link": "link 1",
    "post_date": "@1550725200",
    "post_content": [
      "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna <a href=\"somelink.com\">aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.<a href=\"url1.jpg\"><img src=\"brokenURL1.jpg\" alt=\"\"></a><a href=\"url2.jpg\"><img src=\"brokenURL2.jpg\" alt=\"\"></a><a href=\"url3.jpg\"><img src=\"brokenURL3.jpg\" alt=\"\"></a><a href=\"url4.jpg\"><img src=\"brokenURL4.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  },
  {
    "post_title": "post 2",
    "post_link": "link 2",
    "post_date": "@1550725200",
    "post_content": [
      "<p>At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, <a href=\"somelink.com\">similique</a> sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga.<a href=\"url5.jpg\"><img src=\"brokenURL5.jpg\" alt=\"\"></a><a href=\"url6.jpg\"><img src=\"brokenURL6.jpg\" alt=\"\"></a><a href=\"url7.jpg\"><img src=\"brokenURL7.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  },
  {
    "post_title": "post 3",
    "post_link": "link 3",
    "post_date": "@1550725200",
    "post_content": [
      "<p>Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus. <a href=\"url8.jpg\"><img src=\"brokenURL8.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  },
  {
    "post_title": "post 4",
    "post_link": "link 4",
    "post_date": "@1550725200",
    "post_content": [
      "<p>Itaque earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis <a href=\"somelink.com\">doloribus asperiores repellat</a>.<a href=\"url9.jpg\"><img src=\"brokenURL9.jpg\" alt=\"\"></a><a href=\"url10.jpg\"><img src=\"brokenURL10.jpg\" alt=\"\"></a><a href=\"url11.jpg\"><img src=\"brokenURL11.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  }
]

const parser = new DOMParser();

const fixed_links_data = example_data.map(item => {
  return {
    ...item,
    post_content: item.post_content.map(html => {
      const doc = parser.parseFromString(html, "text/html");
      doc.querySelectorAll("a > img").forEach(img => img.setAttribute("src", img.parentElement.getAttribute("href")));
      return doc.body.innerHTML;
    }),
  }
});

console.log("Final Result:", fixed_links_data)

CodePudding user response:

var arr = [
  {
    "post_title": "post 1",
    "post_link": "link 1",
    "post_date": "@1550725200",
    "post_content": [
      "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna <a href=\"somelink.com\">aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.<a href=\"url1.jpg\"><img src=\"brokenURL1.jpg\" alt=\"\"></a><a href=\"url2.jpg\"><img src=\"brokenURL2.jpg\" alt=\"\"></a><a href=\"url3.jpg\"><img src=\"brokenURL3.jpg\" alt=\"\"></a><a href=\"url4.jpg\"><img src=\"brokenURL4.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  },
{
    "post_title": "post 2",
    "post_link": "link 2",
    "post_date": "@1550725200",
    "post_content": [
      "<p>At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, <a href=\"somelink.com\">similique</a> sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga.<a href=\"url5.jpg\"><img src=\"brokenURL5.jpg\" alt=\"\"></a><a href=\"url6.jpg\"><img src=\"brokenURL6.jpg\" alt=\"\"></a><a href=\"url7.jpg\"><img src=\"brokenURL7.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  },
{
    "post_title": "post 3",
    "post_link": "link 3",
    "post_date": "@1550725200",
    "post_content": [
      "<p>Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus. <a href=\"url8.jpg\"><img src=\"brokenURL8.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  },
{
    "post_title": "post 4",
    "post_link": "link 4",
    "post_date": "@1550725200",
    "post_content": [
      "<p>Itaque earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis <a href=\"somelink.com\">doloribus asperiores repellat</a>.<a href=\"url9.jpg\"><img src=\"brokenURL9.jpg\" alt=\"\"></a><a href=\"url10.jpg\"><img src=\"brokenURL10.jpg\" alt=\"\"></a><a href=\"url11.jpg\"><img src=\"brokenURL11.jpg\" alt=\"\"></a></p>"
    ],
    "custom": {
      "image": "thumbnail.jpg"
    }
  }
];

arr.forEach(function(item) {
    item.post_content[0] = item.post_content[0].replace(/(<img src=")([^] ?)(\d.jpg)/g, '$1url$3');
});

or use https://www.npmjs.com/package/cheerio deal with

  • Related