I am trying to use a commandline program to split a larger text file into chunks with:
- split on defined regex pattern
- filenames defined by a capturing group in that regex pattern
The text file is of the format:
# Title
# 2020-01-01
Multi-line content
goes here
# 2020-01-02
Other multi-line content
goes here
Output should be these two files with the following filenames and contents:
2020-01-01.md ↓
# 2020-01-01
Multi-line content
goes here
2020-01-02.md ↓
# 2020-01-02
Other multi-line content
goes here
I can't seem to get all the criteria right.
The regex pattern to split on (separator) is simple enough, something along the lines of ^# (2020-.*)$
Either I can't set up a multi-line regex pattern that goes over \n
newlines and stops at the next occurrence of the separator pattern.
Or I can split with csplit
on the regex pattern, but I can't name the files with what is captured in (2020-.*)
Same for awk split()
or match()
, can't get it to work entirely.
I'm looking for a general solution, with the parameter being the regex patterns that define the chunk beginnings (eg. # 2020-01-01
) and endings (eg. the next date heading # 2020-01-02
or EOF
)
CodePudding user response:
Using this regex, here is a perl to do that:
perl -0777 -nE 'while (/^\h*#\h*(2020.*)([\s\S]*?(?:(?=(^\h*#\h*2020.*))|\z))/gm) {
open($fh, ">", $1.".md") or die $!;
print $fh $1;
print $fh $2;
close $fh;
}' file
result:
head 2020*
==> 2020-01-01.md <==
2020-01-01
Multi-line content
goes here
==> 2020-01-02.md <==
2020-01-02
Other multi-line content
goes here
CodePudding user response:
Using any awk in any shell on every Unix box:
$ awk '/^# [0-9]/{ close(out); out=$2".md" } out!=""{print > out}' file
$ head *.md
==> 2020-01-01.md <==
# 2020-01-01
Multi-line content
goes here
==> 2020-01-02.md <==
# 2020-01-02
Other multi-line content
goes here
if /^# [0-9]/
isn't an adequate regexp then change it to whatever you like, e.g. /^# [0-9]{4}(-[0-9]{2}){2}$/
would be more restrictive. FWIW though I wouldn't have used a regexp at all for this if you hadn't asked for one. I'd have used:
awk '($1=="#") && (c ){ close(out); out=$2".md" } out!=""{print > out}' file