I am trying to read a pdf file and extract the first line of string. The name of this pdf, for example, is "myfile.pdf" and I am trying to extract the title from the file and rename the file. Like this "Test-title.pdf".
The pdf that I am working with has a structure like this:
- An image at the beginning of the file
- The title that I am trying to extract
- The author name
- The article content
Like this:
Now, when I run the following command:
file -bi myfile.pdf
It outputs the following:
application/pdf; charset=binary
I have tried this command to get the first line of string:
head -1 myfile.pdf
It outputs this:
%PDF-1.7
It will nott return the title I am looking for! It will return literary the first line of string it has found which would be the pdf header info.
I am stuck trying to skip the file header information and the image data in order to get to the title.
When I run the "cat" command:
cat myfile.pdf
The raw output would be:
%PDF-1.7
%����
22 0 obj
<</Linearized 1/L 159974/O 24/E 147279/N 3/T 159414/H [ 1376 364]>>
endobj
xref
22 54
0000000016 00000 n
0000001740 00000 n
0000001867 00000 n
0000002934 00000 n
0000002959 00000 n
0000003096 00000 n
0000003232 00000 n
0000003367 00000 n
0000003504 00000 n
0000003539 00000 n
0000003652 00000 n
0000003677 00000 n
0000003985 00000 n
0000004010 00000 n
0000004450 00000 n
0000005991 00000 n
0000006132 00000 n
0000006243 00000 n
0000006268 00000 n
0000006910 00000 n
0000007300 00000 n
0000038954 00000 n
0000041603 00000 n
0000041672 00000 n
0000041756 00000 n
0000045070 00000 n
0000045346 00000 n
0000045520 00000 n
0000045589 00000 n
0000045702 00000 n
0000059118 00000 n
0000059404 00000 n
0000060098 00000 n
0000060167 00000 n
0000060275 00000 n
0000067014 00000 n
0000067284 00000 n
0000067663 00000 n
0000082615 00000 n
0000082684 00000 n
0000082788 00000 n
0000088251 00000 n
0000088530 00000 n
0000088858 00000 n
0000088883 00000 n
0000089295 00000 n
0000094900 00000 n
0000095173 00000 n
0000095536 00000 n
0000095701 00000 n
0000097340 00000 n
0000147155 00000 n
0000147222 00000 n
0000001376 00000 n
trailer
<</Size 76/Root 23 0 R/Info 21 0 R/ID[<DCD3FF39B7B75344A3163B8206E477A4><A6B399FFB4F52F46B26C3AEC47243E5D>]/Prev 159403>>
startxref
0
%%EOF
75 0 obj
<</Filter/FlateDecode/I 370/Length 270/O 354/S 171/T 308>>stream
h�b```c``�"��21 �P�����cC������
8�Dq����EG<ME#�$3��V�P�2l�hr�e��q�:=q�����$�40
�Tt@��l��e�s��/SD1�6bS��$�
CodePudding user response:
Using pdftotext
:
To get the first line of the pdf:
pdftotext /path/to/myfile.pdf - | head -n 1
(YMMV with image-based pdf files)