Home > Net >  Linux bash how to read a pdf and get the first string of content
Linux bash how to read a pdf and get the first string of content

Time:11-12

I am trying to read a pdf file and extract the first line of string. The name of this pdf, for example, is "myfile.pdf" and I am trying to extract the title from the file and rename the file. Like this "Test-title.pdf".

The pdf that I am working with has a structure like this:

  1. An image at the beginning of the file
  2. The title that I am trying to extract
  3. The author name
  4. The article content

Like this:

enter image description here

Now, when I run the following command:

file -bi myfile.pdf

It outputs the following:

application/pdf; charset=binary

I have tried this command to get the first line of string:

head -1 myfile.pdf

It outputs this:

%PDF-1.7

It will nott return the title I am looking for! It will return literary the first line of string it has found which would be the pdf header info.

I am stuck trying to skip the file header information and the image data in order to get to the title.

When I run the "cat" command:

cat myfile.pdf

The raw output would be:

%PDF-1.7
%����
22 0 obj
<</Linearized 1/L 159974/O 24/E 147279/N 3/T 159414/H [ 1376 364]>>
endobj
              
xref
22 54
0000000016 00000 n
0000001740 00000 n
0000001867 00000 n
0000002934 00000 n
0000002959 00000 n
0000003096 00000 n
0000003232 00000 n
0000003367 00000 n
0000003504 00000 n
0000003539 00000 n
0000003652 00000 n
0000003677 00000 n
0000003985 00000 n
0000004010 00000 n
0000004450 00000 n
0000005991 00000 n
0000006132 00000 n
0000006243 00000 n
0000006268 00000 n
0000006910 00000 n
0000007300 00000 n
0000038954 00000 n
0000041603 00000 n
0000041672 00000 n
0000041756 00000 n
0000045070 00000 n
0000045346 00000 n
0000045520 00000 n
0000045589 00000 n
0000045702 00000 n
0000059118 00000 n
0000059404 00000 n
0000060098 00000 n
0000060167 00000 n
0000060275 00000 n
0000067014 00000 n
0000067284 00000 n
0000067663 00000 n
0000082615 00000 n
0000082684 00000 n
0000082788 00000 n
0000088251 00000 n
0000088530 00000 n
0000088858 00000 n
0000088883 00000 n
0000089295 00000 n
0000094900 00000 n
0000095173 00000 n
0000095536 00000 n
0000095701 00000 n
0000097340 00000 n
0000147155 00000 n
0000147222 00000 n
0000001376 00000 n
trailer
<</Size 76/Root 23 0 R/Info 21 0 R/ID[<DCD3FF39B7B75344A3163B8206E477A4><A6B399FFB4F52F46B26C3AEC47243E5D>]/Prev 159403>>
startxref
0
%%EOF
            
75 0 obj
<</Filter/FlateDecode/I 370/Length 270/O 354/S 171/T 308>>stream
h�b```c``�"��21 �P�����cC������
8�Dq����EG<ME#�$3��V�P�2l�hr�e��q�:=q�����$�40
�Tt@��l��e�s��/SD1�6bS��$�

CodePudding user response:

Using pdftotext:

To get the first line of the pdf:

pdftotext /path/to/myfile.pdf - | head -n 1

(YMMV with image-based pdf files)

  • Related