I have a number of .txt files arbitrarily named A001.txt
A002.txt
etc.
Files have the following structure:
<sps id="303544" url="https://.xyz.edu/=303544" title="Lawrence Bragg"></sps>
Lawrence Bragg
Sir William Lawrence Bragg, (31 March 1890 – 1 July 1971) was an Australian-born British physicist and X-ray crystallographer, discoverer (1912) of Bragg's law of X-ray diffraction, which is basic for the determination of crystal structure.
He was joint recipient (with his father, William Henry Bragg) of the Nobel Prize in Physics in 1915, "For their services in the analysis of crystal structure by means of X-rays";
I am trying to rename each file based on the value of title attribute. In the example above, I want to rename to Lawrence Bragg.txt
I do:
find . -maxdepth 1 -name '*.txt' -exec ~/scr/rename.sh{} \;
Where rename.sh
:
#!/bin/bash
title=$(xmllint --xpath '//sps/@title' "$1" | sed -r 's/[^"] "([^"] ).*/\1/')
mv -v "$1" "$title.txt"
The rename works only if the file has solely the first line, i.e., the file starts and ends with the <sps>
tag. If there are additional lines, it does not work—of course.
How do I run this script solely for the first line of each *.txt
file? I.e. ignore all the lines after the first one?
I've tried head -1
but can't seem to figure it out.
Or modify sed
?
CodePudding user response:
Parsing HTML with sed is easy; parsing HTML with sed in a foolproof way is difficult. That said, I suggest:
sed -n '1{s/.*title="\(.*\)".*/\1/;p;}'
CodePudding user response:
Using sed
#!/usr/bin/env bash
for filename in $(find . -name '*.txt' | sed 's|\./||'); do
sed -n "1s/.*title=\"\([^\"]*\).*/mv '$filename' '\1.txt'/p" < $filename
done
This dry run should give output like
mv 'ABC.txt' 'Lawrence Bragg.txt'
If it looks as expected, then you can execute the command to commit the changes.
#!/usr/bin/env bash
for filename in $(find . -name '*.txt' | sed 's|\./||'); do
sed -n "1s/.*title=\"\([^\"]*\).*/mv '$filename' '\1.txt'/pe" < $filename
done
$ cat Lawrence\ Bragg.txt
<sps id="303544" url="https://.xyz.edu/=303544" title="Lawrence Bragg"></sps>
Lawrence Bragg
Sir William Lawrence Bragg, (31 March 1890 – 1 July 1971) was an Australian-born British physicist and X-ray crystallographer, discoverer (1912) of Bragg's law of X-ray diffraction, which is basic for the determination of crystal structure.
He was joint recipient (with his father, William Henry Bragg) of the Nobel Prize in Physics in 1915, "For their services in the analysis of crystal structure by means of X-rays";