Home > Net >  Split Markdown text file by regular expression that defines headings
Split Markdown text file by regular expression that defines headings

Time:09-17

I am trying to use a commandline program to split a larger text file into chunks with:

  • split on defined regex pattern
  • filenames defined by a capturing group in that regex pattern

The text file is of the format:

# Title

# 2020-01-01

Multi-line content
goes here

# 2020-01-02

Other multi-line content
goes here

Output should be these two files with the following filenames and contents:

2020-01-01.md ↓

# 2020-01-01

Multi-line content
goes here

2020-01-02.md ↓

# 2020-01-02

Other multi-line content
goes here

I can't seem to get all the criteria right.

The regex pattern to split on (separator) is simple enough, something along the lines of ^# (2020-.*)$

Either I can't set up a multi-line regex pattern that goes over \n newlines and stops at the next occurrence of the separator pattern.

Or I can split with csplit on the regex pattern, but I can't name the files with what is captured in (2020-.*)

Same for awk split() or match(), can't get it to work entirely.

I'm looking for a general solution, with the parameter being the regex patterns that define the chunk beginnings (eg. # 2020-01-01) and endings (eg. the next date heading # 2020-01-02 or EOF)

CodePudding user response:

Using this regex, here is a perl to do that:

perl -0777 -nE 'while (/^\h*#\h*(2020.*)([\s\S]*?(?:(?=(^\h*#\h*2020.*))|\z))/gm) {
    open($fh, ">", $1.".md") or die $!;
    print $fh $1;
    print $fh $2;
    close $fh;
}' file 

result:

head 2020*
==> 2020-01-01.md <==
2020-01-01

Multi-line content
goes here


==> 2020-01-02.md <==
2020-01-02

Other multi-line content
goes here

CodePudding user response:

Using any awk in any shell on every Unix box:

$ awk '/^# [0-9]/{ close(out); out=$2".md" } out!=""{print > out}' file

$ head *.md
==> 2020-01-01.md <==
# 2020-01-01

Multi-line content
goes here


==> 2020-01-02.md <==
# 2020-01-02

Other multi-line content
goes here

if /^# [0-9]/ isn't an adequate regexp then change it to whatever you like, e.g. /^# [0-9]{4}(-[0-9]{2}){2}$/ would be more restrictive. FWIW though I wouldn't have used a regexp at all for this if you hadn't asked for one. I'd have used:

awk '($1=="#") && (c  ){ close(out); out=$2".md" } out!=""{print > out}' file
  • Related