Rus Articles Journal

How quickly to scan the book in the PDF format (using ClearScan)

How quickly to scan the book in the PDF format (using ClearScan)

INTRODUCTION

In this short guide I would like to share the thoughts of scanning of books in the PDF format and impressions about the ClearScan technology available in Adobe Acrobat starting with version 9. 0. In my opinion, it is the remarkable technology doing (at last!) the PDF format suitable for the scanned text.

Actually, at destructive scanning (the book is broken off on sheets and the sheet scanner is used), scanning process → cleanings → transfer to PDF → OCR can be executed for the three of hours for chyorno - the white book of the average size. If you “glazier“, that is at you is enough to scan patience the book on scanner glass, scanning, obviously, will borrow longer.

It is necessary to tell that it is good to scan the color book more difficult than chyorno - white: the scanner spoils colors, and on their correction in the graphic editor time leaves and a certain skill is required. It is possible to imagine such scale of complexity at the beginning of which there are simplest books for scanning with chyorno - the white text without illustrations; gradually, illustrations become more and more, color so on other party of this scale there are books, most difficult for scanning, which have each page - a color illustration increases.

The ClearScan technology about which I will tell is calculated on the text. It does not influence illustrations, chyorno in any way - white or color. If you want to learn about scanning in more detail, and/or you are going to scan books with a large number of color illustrations and want to be able to correct their colors, then I can give the reference to the grant on scanning of books in high quality placed in Twirpx library. com which also includes instructions for work with the Photoshop program:

twirpx. com/file/1437636 /

my task is more modest. I assume at you existence of the book where the main pages - the text. It can be the textbook or the document, fiction or the technical, but not children`s book with pictures, not the book - the photo report. I count that you want to transfer such book to PDF and to receive decent quality and the small size of the file.

AS the BEGINNER

SCANS If there is a scanner, then there is a wish to scan something! And thank God. Look at abundance of electronic libraries. Who scanned thanks to all and laid out it for others.

Scanners are on sale today with the software package among which there is also a program for transformation to PDF. In the theory (and in brochures) it looks so: put sheets in the scanner, receive them at the exit in electronic form, in the PDF format! And it is sometimes the truth. There is a large amount of different papers (quantity 1, 2... 10 sheets) whom I and treat. And what with them to be cut? It is visible - will be. And bigger it is also not necessary. But here book... moreover for those who love books... unless it is possible to call the turned-out slanting rubbish with strips, spots, black points, with the broken-off font the book? Where the dog is buried? What option should be exposed what rychazhok to twist that all this became similar to the original?

In that is continually that there is no such one rychazhk. There is a four-stage process which each step requires some optimal solutions from the operator. The software package for the scanner working as “at one stroke all pobivakhy“ hides this four-stage process, doing of it one operation: standard sheet → electronic equivalent. But that there is something difficult, nevertheless it is possible to guess. For example, the scanner already stopped scanning, and the computer is not ready to continue yet; on it some programs open and closed; the bulb of access to the hard drive blinks... To scan the book qualitatively, it is necessary most to walk on steps of this process: scanning, cleaning, transfer to the necessary format and recognition of the text (OCR).

1. SCANNING

the Problem of this step to transfer paper pages of the book to the files corresponding to them in the TIFF format with the permission at least 300dpi. This permission is enough for the book text of the usual (“readable“) size. The small print or desire to transfer fine details of illustrations can demand bigger permission. Rummage in settings of the scanner. At the exit, you need to receive graphic files, in the TIFF format. One leaf - one file. And any multipage TIFF - ov (where in one TIFF file several pages)! Any PDFs - ov! Any OCR - ov (recognitions of the text)!

At this step the book needs also to make the decision on scanning in color (color) or in shades of gray (grayscale). Usually it is not recommended to scan the book in strictly chyorno - white option (b &w) even if the book chyorno - white as the scanner will have to decide then what to do black, and that white. Let`s tell, the bend on the page can be transferred black and will create black strips and spots, and still it is worse than that, these spots will close the black text. It is impossible to clean then it “black on black“. If a spot (a strip, other defect) gray (or another, at color scanning) colors, and the text black (other than defect) colors, then defect it is possible to clean at a cleaning stage by removal from the image of color of a spot. Therefore it is good to scan books with the turned yellow pages in color to have an opportunity to clean yellow color from the turned-out scan. Happens also, strictly chyorno - white scanning utonshat and breaks off lines and a font (that is when the letter, say, of “d“ looks as “cl“). Therefore, for high-quality scanning, you should not scan in strictly chyorno - white option (b &w). Nobody forbids to translate the page in chyorno - the white image then when the image is cleaned if such translation is necessary. As we will see, for the ClearScan technology such translation is not required: ClearScan perfectly works with the text in shades gray and with the big permission.

For my sheet scanner, scanning begins with cutting of a cover. The usual kitchen knife with a short edge and the convenient handle quite will approach. For a soft cover, the knife is pushed between a cover and the first page (at the closed cover) and the cover is cut off. If at the book the firm cover, then at an open cover is cut out from it the book. Pages or come off then on one, or are cut off. Torn edges can be removed then by means of the program at a cleaning stage. The main thing that torn edges did not climb on the text.

I write these lines, and in the head Marshak`s poem sounds:

At Skvortsov Grishki is
Once upon a time there were books -
Dirty, shaggy,
Fragmentary, humpbacked...

I have books, from the childhood which I love and will not cut. But often it is necessary to scan grants, often computer, often thick, and waste paper - the best place for them. And the time for scanning “on glass“ it is a pity to spend.

Once again about basic settings of the scanner. Permission - 300dpi or is more, the color mode “shades of grey“ (grayscale) or “color“ (color). A file format - TIFF. Having measured the page of the book in millimeters, it is possible to set length and width. Of course, “on glass“ it can be made only approximately as precisely it is impossible to put the book on glass. And the sheet scanner will suck in sheets from the flat party (or from above / from below or if sideways, it is necessary to put the flat party) and here everything will be precisely up to millimeter. On the sheet scanner I, lately, from - for congenital laziness, choose the option “improve the text“ (text enhancement) which “uzhirnyat“ and “uchernyat“ the text and spoils color illustrations (exaggerates) also the option “level images“ (deskew) as it is easier to process equal sheets then. But it is possible in general any other options except dpi and color not to choose and to leave all the rest on a cleaning stage.

2. CLEANING

the Problem of this step - to receive at the exit files with blank, beautiful pages in the same TIFF format and in the same quantity. It is “set“ of future e-book. Needless to say that it is necessary to process all (more true almost all) images on groups, i.e. in “batch mode“ (batch processing). Except covers and some other extraordinary pages, it is almost impossible to potter with each image of the page separately in the graphic editor (submit 700 pages of the text!) and it is not necessary.

For cleaning, I used the ScanKromsator v5 program earlier. 9. It can be found in the Internet.

References to the description of this program:

wikipedia. org/wiki/ScanKromsator
djvu - soft. narod. ru/kromsator /
twirpx. com/file/394016 /

the Program, especially for a beginner, difficult of - for the unusual interface, a large number of options and bad documentation. Not always it will be clear what result at the end. Lately, I use a combination of the Photoshop and Scan Tailor programs. Scan Tailor does not try to be the graphic editor as ScanKromsator, but from - for it it is simpler to them to use. Having united possibilities of the Photoshop and Scan Tailor programs, the impressive tools for correction of crude scans are gathered. Documentation to Scan Tailor is here:

net/apps/mediawiki/scantailor/index. php? title=Main_Page

What program would not be used, it is necessary to clean

an inclination of pages (deskew) of
to cut off uneven edges
to level illumination (to clean shadows from uneven illumination)
to clean points and other garbage (despeckle) of
separately to check / correct illustrations (including a cover)

Can correct such defects on villages as marginal notes (if of course, there is no purpose to keep them), to erase the pencil lines emphasizing the text (will disturb the OCR program, which will accept them for graphics), to clean strips, spots, and sometimes and a background. I scanned the book with blue letters on a blue background once; the background left ugly, and I just cleaned it, i.e. changed on white, the benefit it was slightly lighter than the text and it was possible to get rid of it, having cleaned its colors.

From the aforesaid it is clear that cleaning is the most technically difficult step. If you did not work with graphic editors earlier, then there is nothing and to think to make everything from the first for hundred percent. You do not despair! Even slightly improved file - it a step forward on the way to is better for the scanned book! It will be another time even better. And then, Russians just adore cleaning! Unfortunately, we even like to clean our own population. Or, as speak now, “to smooth out“. It was cleaned to so many people that if advance on the way to paradise really depends on it, we would live in paradise long ago. How here not to remember Sergey Mironovich Kirov:

“ChK - GPU is the body urged to punish and if simply to represent this business, - not only to punish, and to punish really that the increase in population, thanks to activity of our GPU was in the next world noticeable.“

In the next world, so arrived, and on it decreased. But they all bad were, those which decreased... what them not to shoot for a plokhota? Forgive for retreat, just in our aspiration to extremes we sometimes clean ourselves. Then we are surprised: “why we have an authoritarian regime?“ Because there is a wish for fast, cardinal, simple decisions for complex problems. Look how many people think in line with “yes to take all of them and [a way of cleaning out]“, and you agree that any other mode, except authoritative i.e. which is capable “to take all for... and...“ does not shine us.

3. TRANSFER TO the FINAL FORMAT

So... we transfer the book to the necessary format. I consider only the PDF format as only the simple, fast, cardinal solution of “a format question“ here... stand. Somewhere I already told it. Ah and. Well, well, there are many formats to which it is possible to transfer the book, including “text“, that is such where the recognized text separates from the book and is published without it. The program for recognition of the text is mistaken, and such separated text needs good reading. But it is pleasant to you to read the book - read. Only read properly, and that you will download from the Internet the book in a text format - there typographical errors the sea.

I will explain how to make the book in PDF, and using the ClearScan technology. ClearScan - advanced technology. If the PDF format in itself is not ideal for storage of the scanned text (the file, or if to squeeze more, the low-quality image turns out or the big size) that at application of ClearScan, this format approaches ideal.

Actually, basic options what to do not so much with the scanned book. It is possible just to leave it in TIFF files. By the way, it is possible to leave these files anyway. As it was already told, TIFF files - “sets“ of the book. From them then it is possible to mold other formats. To me them to store laziness, but then more than once I bit lokotka from - for the fact that originals were not any more. However, TIFF files are not convenient for an exchange. They take a lot of place, and it is necessary to watch them in the graphic editor. It is possible to transfer TIFF files to the JPEG format, so they will take less places. But the JPEG format not the best option for chyorno - the white text, especially when its several honeycombs of pages.

It is possible to translate the book in text or the mixed format: TXT, RTF, DOC at last, or in HTML - ny and XML - ny EPUB and FB2. But it - to separate the text and to republish the book anew. And it is possible, to lose everything or part of registration of the book when reprinting. Whether it is necessary if the book is already published? Of course, to solve to you. If it is a little registration, then it is possible and to republish. And if there is a lot of and it wants to be kept? And just there is no wish to waste time for reprinting? Then it is necessary or “to slap“ the book in DJVU, or in PDF (someone “claps“ also in PowerPoint, but it, I`m sorry, “yours“).

In the theory, before emergence of the ClearScan technology, the DJVU format was suitable more for the scanned books than PDF as files turned out less. But in practice, PDF is much more widespread (it is the fact), and programs PDFs allowing to read are much more attractive (this my opinion) than the fact that is created for DJVU that the choice was clear for me even before emergence of the ClearScan technology. And now - that....

The essence of the ClearScan technology consists in replacement of images of letters by OCR stages by the real font. This font is not any ready (system) font more - less similar to an original font, and the special font made by the Acrobat program “on the fly“ under a concrete letter of the text.

As a result, instead of the page of the book in a graphic format, there is a page with (almost) present text, in a form (almost) same as well as original.

The reference to article in English about the ClearScan technology:

adobe. com/acrolaw/2009/05/better_pdf_ocr_clearscan_is_smal /

As is told in this article and checked in practice, the best results turn out at high resolution of the original (600dpi) and absence on the original of collateral hindrances (garbage, artifacts).

Where to take Adobe Acrobat 9. 0 and above? In the head one right there begins to turn [the bad word]. But why to me to teach you to bad words? You know them also without me. Therefore as an exotic alternative way, I thought up to come on some auction, we will tell E - bay, to gather adobe acrobat 9 pro and to look whether it is possible to receive that there is a wish at reasonable price. Let`s allow - it is possible. And Acrobat at you.

Having started Acrobat, we choose all TIFF - y the cleanings which turned out later. For this purpose we press on File → Combine → Merge Files into a Single PDF. The window in which we on the right above choose the option Single PDF (it most likely opens and it is so chosen). We press Add Files → Add Files also we add all TIFF - y. To add all files at once, we press a mouse the first file, then we hold the Shift key and we press the last file. We press Combine Files and patiently we wait for result - one file in the PDF format.

4. OCR C the OPTION CLEARSCAN

This the simplest for us a step. Bo - the first to distinguish teksit is necessary for t (OCR) to replace images of letters with a font (ClearScan). In - the second if the text is recognized, the possibility of search in keywords appears. It is convenient in textbooks, reference books and it is possible also in fiction. OCR does not work for hundred percent, and recognizes the text not absolutely truly. But we - it also do not need that. We are not going to separate this recognized text from the book and to publish only it - it put those who chose text a format. Accuracy is necessary to us only for search in keywords, and for this purpose of accuracy of OCR usually is enough. Imagine some section in the textbook. Let`s tell, about a direct current. At first the title - “direct current“ will go. Then definition of a direct current. Then properties of a direct current. The combination of the words “direct current“ will occur in this section many times and even if OCR will be mistaken once, the second case of the use will not remain unnoticed, and your search in keywords will bring “direct current“ into the necessary section.

Well, we start OCR in the same Adobe Acrobat. For this purpose we do Document → OCR Text Recognition → Recognize Text Using OCR and in opened a window we press Edit in the section Settings. We choose

of Primary OCR Language - it is necessary to specify the main language of the document
PDF Output Style - we put ClearScan
Downsample Images - Low is usually possible (300dpi) for

This last control is responsible for final permission of not recognized images. Let`s say that you scanned the book in 600dpi that the text after ClearScan looked in the best way. But you in the book have not only a text, but also illustrations. They were scanned in 600dpi too. Let`s assume also that you do not want such high resolution for illustrations as your concrete illustrations do not need it, and 600dpi they will take places at permission much. Having exposed the Downsample Images control, you have an opportunity to lower permission of illustrations in the document.

We wait even more patiently former, and we go to have a rest better. At the exit required PDF turns out. Find in it some small letter and begin to increase. This small letter has to remain accurate at any increase.

It is ready. We do not forget to save the file.

And here that else... It is not necessary to pressurize - to press this file in Acrobat - e for the sake of economy of the place on a disk. I will not even tell as to make it. It is not necessary to spoil quality of the file and on mobile devices where the processor is weaker, and the program for survey of PDF not such clever, to watch such pressed book - torture.

Try to throw your book on the mobile device - for me it will be iPad with the iBooks eReader. As she looks good! As it is quickly possible to thumb through pages! There is a search in keywords! Students! Scan the textbooks! Mothers and fathers! Please, scan good children`s books with pictures!

And, do not forget to lay out them in electronic library.

companion Kuznetsov, Ivan Ivanovich,
according to companion Petrov Philip Fyodorovich wrote,
who heard all this from a gray mouse.
of 2012 - 2014.