I am using document Oracle Outside In to output text content of pdf document.
I am using below parameters to pass to main function of CASample.c
file from content access of https://www.oracle.com/middleware/technologies/outside-in-technology-downloads.html#
C:\adobe-acrobat.pdf -u C:\adobe-acrobat.txt";
Which gives me text in below format.
SCCCA_TEXT: dwSubType = 0x08020001, Number of Characters = 8, Character Set = 0x00030100.
Outside
SCCCA_TEXT: dwSubType = 0x08020001, Number of Characters = 3, Character Set = 0x00030100.
In
SCCCA_TEXT: dwSubType = 0x08020001, Number of Characters = 8, Character Set = 0x00030100.
Unlocks
SCCCA_TEXT: dwSubType = 0x08020001, Number of Characters = 9, Character Set = 0x00030100.
Business
SCCCA_TEXT: dwSubType = 0x08020001, Number of Characters = 10, Character Set = 0x00030100.
Documents
SCCCA_TEXT: dwSubType = 0x08020001, Number of Characters = 4, Character Set = 0x00030100.
for
SCCCA_TEXT: dwSubType = 0x08020002, Number of Characters = 1, Character Set = 0x00030100.
So how do I only get text out of it without metadata? like instead of above entire metadata content I only need Outside In Unlocks Business Documents for
or do I have to make my own parser to get those data?
CodePudding user response:
There is a tademo.vcxproj
as well in their downloaded files which does the job to extract text. It is a desktop application that you can convert to a library.
https://www.oracle.com/middleware/technologies/outside-in-technology-downloads.html#
After converting it to a library, I created the following function in tademo.c
file which will take the input file and export the text file as output.
int callableMain(char* inputPath, char* outputPath) {
strncpy(szInputPath,inputPath, PATHSIZE);
DAInitEx(SCCOPT_INIT_NOTHREADS, OI_INIT_DEFAULT);
DoTextClose();
dwBlockNum = 0;
DoTextOpen(1);
DoSaveTextAs(outputPath);
DoTextClose();
return 1;
}