This explains a lot
Replies: 0
This explains a lot
|
|
Posted: 24 Jun 2009 2:21PM GMT |
Classification: Query
From article on About.com
"Because many of the newspapers of greatest interest to genealogists are old, the content is difficult to digitize due to the wide variety of inks, type faces, and font sizes. There is often less contrast apparent between text and background on the aged papers, and the old ink may have caused the letters to 'bleed' together a bit, making it harder for the OCR program to interpret the letters correctly. In addition, fading, wrinkles, ink blots, and other imperfections on the original page can interfere with OCR results.
For those of us using these historic newspapers online, this basically means that name and keyword searches (most of which rely on the results of the OCR software) will often yield less than expected results. It is important to realize that these every word databases have not been indexed by a human eye, and adjust your search strategies appropriately. Yet because hand-indexing is so time consuming and expensive, OCR offers an affordable alternative. Even a less-than-perfect index, is better than no index at all."
The article says a page by page search is most effective. A lesson I've learned through bitter experience.
"Because many of the newspapers of greatest interest to genealogists are old, the content is difficult to digitize due to the wide variety of inks, type faces, and font sizes. There is often less contrast apparent between text and background on the aged papers, and the old ink may have caused the letters to 'bleed' together a bit, making it harder for the OCR program to interpret the letters correctly. In addition, fading, wrinkles, ink blots, and other imperfections on the original page can interfere with OCR results.
For those of us using these historic newspapers online, this basically means that name and keyword searches (most of which rely on the results of the OCR software) will often yield less than expected results. It is important to realize that these every word databases have not been indexed by a human eye, and adjust your search strategies appropriately. Yet because hand-indexing is so time consuming and expensive, OCR offers an affordable alternative. Even a less-than-perfect index, is better than no index at all."
The article says a page by page search is most effective. A lesson I've learned through bitter experience.