You are not logged in.

  • "txetxo" is male
  • "txetxo" started this thread

Posts: 24

Date of registration: Jun 19th 2011

Language Team: Bulgarian

Focus Group: Linguistic TechTeam Group
LTI Administration Group
LTI Development Group

Location: Spain

Thanks: 1662 / 97

  • Send private message


Monday, February 6th 2012, 1:15am

Extracting text from PDF's

The OCR stands for optical character recognition. This method aloud a software to recognize text and characters from PDF scanned documents (including multipage files), photographs and digital camera captured images. Depending on the software, it is possible the embedded images in the PDF or the picture to be recognized and exported with the text in the output document, otherwise they can be copied to the clipboard and saved in the default viewer program for the OS where the task is carried out. I will describe how I exported the text and the images from the page 11 of the Press Kit Nov 2011 v1.pdf on a Mac OS X platform, but the procedure is pretty similar on Windows, too. The idea is, when the source text is in PDF format and multiple pages and we want to extract the text from a particular page, or the software with OCR capabilities has restrictions like "1 Page max." or "2MB max." we need to export and process only one page at a time. For the purpose we have to open the PDF document, start "Print" of the viewer and choose the option to print to PDF file, then set the only one page we need to be printed and - voila! - the exported PDF contains only one page - exactly the one we needed. The next step is to process it with some software for OCR. There are plenty online services who do this for free, for example . The exported fie will contain the text and the pictures, eventually being embedded in the original document. In Mac from Preview app, go to 'File/Print…'. From 'PDF' button - 'Save as PDF...' and save only the page of interest. For this task here I uploaded the single page to (you need to create account there first) and requested it to be converted to a file in .doc format. In LibreOffice app the images were easy to select and copy/paste to the default Preview app using the first option "New from Clipboard" from menu 'File'. In fact, the OCR conversion could be done on the same way but in Google docs, where on upload we check the options in the dialog "Set your preferences for uploading files. We'll have to apply these settings to any files you upload to Google Docs" as follows: (Y) Convert documents, presentations, spreadsheets, and drawings to the corresponding Google Docs format (Y) Convert text from PDF and image files to Google documents (Y) Confirm settings before each upload.
Signature from »txetxo« txetxo, Linguistic TechTeam (LTT) "You never change things by fighting the existing reality. To change something, build a new model that makes the existing model obsolete" Buckminster Fuller.

1 registered user and 54 guests thanked already.

Users who thanked for this post:


Used tags


© Linguistic Team International 2019
Context In Motion