Home > Back-end >  Bash: Extract URL from markdown format
Bash: Extract URL from markdown format

Time:08-02

I have a set of markdown format posts for a jekyll site that each contain a markdown link. For example:

---
layout: post
title: "The Title"
date: 2022-07-31
categories:
- CategoryX
- CategoryY
author: AuthorName, SecondAuthor
tags: [tag1,tag2,tag3]
---

Some text that might contain (brackets] or other symbols.

[Visit Link](https://www.linkhere.net/somepage){:target="_blank" rel="noopener"}

I'd like to extract just the full URLs from each file in the _post directory and write them to a new file.

This is the code and commented attempts

#!/bin/bash

# configuration
jekyll_post_dir="<jekyll_dir>/_posts"


for file in $jekyll_post_dir/*
do
    #link=$(sed -n -e '/[Visit Link]/,/{:target/p' $file)

    #link=$(sed -n '/[Visit Link]/,/target/{ /html>/d; p }' $file)

    #link=$(awk '/[Visit Link]/,/target/' $file)

    #link=$(sed -n 's/[^{]*\({[^}]*}\).*/\1/g' $file)

    #link=$(sed 's/.*Link](\(.*\))/\1/' $file)

    #link=$(awk -F"[()]" '{print $2}' $file )

    #while IFS="](){" read a b; do echo "$b"; done < $file

    #link=$(sed -n '/\](/,/)\{:/p' $file)

    #echo $link >> linklist.txt

done

All my attempts have either selected unwanted text or failed completely. I am not familiar with regex or similar definitions so I would appreciate some guidance. I'm happy to use any bash-supported solution.

Thanks for reading/helping...

CodePudding user response:

The command below gets the expected URL

sed -nre '/:target=/ s/.*[]][(]([^)] )[)][{]:target=.*/\1/p' test.txt 

Result

https://www.linkhere.net/somepage

Alternative command

sed -nre '/:target=/ s/.*\]\(([^)] )\)\{:target=.*/\1/p' test.txt

  • Related