OCR (Optical Character Recognition) from SmartLF scans
March 26th, 2008Digital scan information derived from a text document is still just a series of dots and although it can be read by people it is not understandable by the computer - it’s termed ‘unintelligent’. The operation to convert these meaningless dots into a form that can be reprinted and used properly by a computer is called optical character recogntion (OCR). OCR is often needed to complete the conversion of scans from a typical mechanical or architectural drawing fully into its vector form once the arcs, lines and circles have been changed using raster-to-vector conversion. CAD system files in their simplest form consist only of co-ordinate number data which is then displayed to the operator as vector information onscreen.
Optical character recognition (OCR) will need to be provided by additional software in addition to that supplied by Colortrac. Sometimes it can be part of larger CAD or indexing software applications which can themselves also control the SmartLF scanner.
Character recognition software generally takes one of two forms- recognition of machine printed text information (computer generated fonts) and handwritten characters, which of course can be almost infinitely variable in shape and form. OCR for handwriting is generally more complicated.
Many small format scanners come supplied with the first type of OCR but the technique to recognise handwritten characters is slightly different. Traditionally more costly, handwriting OCR has recently become more available and efficient mainly due to the emergence of PDAs (Personal Digital Assistants), Palm pcs and similar devices that use touch-sensitive screens with wands or pens. A few specialist companies have become world-leaders in the field of hand-written character recognition technology and often software houses who have also developed their own systems as part of their CAD-based raster to vector products offer inclusion (upgrades) of these systems as upgrade options.
Traditional CAD systems OCR embedded as part of the application generally require a series of hand-writing training sessions to be undergone by the user before the recognition software is properly useable. The bought-in (external licensed) software is often more efficient and will usually be good enough to use straight away.
Both types of system basically need to be directed to the areas on the scan that require converting. What happens then is that the recognition software begins the task of analysing the characters against its character shape library. If the software is fairly certain about a shape’s identity and the match is good it will convert it and replace the raster with the ASCII (computer text) character. If the software is not so sure of the match it will stop and request assistance from the operator. The amount of training and direction provided by the operator are directly linked to the success of this type of operation.
Many old pre-CAD drawings can be successfully modified and updated by adding vector information alongside or over the top of the original raster drawing - the so-called ‘hybrid’ drawing. The benefit of this approach is that old drawings do not need to be completely re-drawn. Old scans can even be re-scaled to view alongside the real-world sized drawings of regular CAD.
For more information on the providers of raster-to-vector and OCR software see the CLASP pages at
http://www.colortrac.com/scanning_software/software_clasp.htm
