Extracting text from a txt file-CodePudding

I have a txt file with records on it. The records follow this pattern:

six lines, blank space, six lines, .....like this example:

string line 1
string line 2
string line 3
string line 4
string line 5 (year format yyyy)
string line 6 (can use several lines)
<blank space> (always a blank space when a new txt block begins)
string line 1
string line 2
string line 3
string line 4
string line 5 (year format yyyy)
string line 6

Here is a proper example: I need the title(line 2) and year(line5)

Hualong Yu, Geoffrey I. Webb,
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map,
Neurocomputing,
Volume 343,
2019,
Pages 141-153,
ISSN 0925-2312,
https://doi.org/10.1016/j.neucom.2018.11.098.
https://www.sciencedirect.com/science/article/pii/S0925231219301572

Antonino Feitosa Neto, Anne M.P. Canuto,
EOCD: An ensemble optimization approach for concept drift applications,
Information Sciences,
Volume 561,
2021,
Pages 81-100,
ISSN 0020-0255,
https://doi.org/10.1016/j.ins.2021.01.051.
https://www.sciencedirect.com/science/article/pii/S002002552100089X

I want to extract the string in line 2 and the year in line 5 all all blocks of text (separeted by blank spaces), save it to another txt file as this output:

string line2 , yyyy

I dont have exp'ed wih linux shell so I am here asking for some inputs to help me do this task.

Thanks

CodePudding user response：

If you don't care about the trailing comma in line 5, just do:

 awk '{print $2, $5}' RS= FS='\\n' input > output

This assumes that the blank line separating the records is indeed completely blank and does not contain any whitespace. If there is any whitespace in that line, you'll want to pre-filter the data to remove it.

eg:

$ cat input
Hualong Yu, Geoffrey I. Webb,
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map,
Neurocomputing,
Volume 343,
2019,
Pages 141-153,
ISSN 0925-2312,
https://doi.org/10.1016/j.neucom.2018.11.098.
https://www.sciencedirect.com/science/article/pii/S0925231219301572

Antonino Feitosa Neto, Anne M.P. Canuto,
EOCD: An ensemble optimization approach for concept drift applications,
Information Sciences,
Volume 561,
2021,
Pages 81-100,
ISSN 0020-0255,
https://doi.org/10.1016/j.ins.2021.01.051.
https://www.sciencedirect.com/science/article/pii/S002002552100089
$ awk '{print $2, $5}' RS= FS='\\n' input
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map, 2019,
EOCD: An ensemble optimization approach for concept drift applications, 2021,

CodePudding user response：

Something like:

perl -00 -nE 'my @ln = (split /,\n/)[1,4]; say join(",", @ln)'  input.txt > output.txt

should work as at least a starting point. Reads a paragraph at a time, splits up into lines, and prints the two you're looking for on the same line separated by a comma.