Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Tens or hundreds of billions of pages -- but many, many fewer images. Images repeat across a site, and change much less often than text. Header and navigational images are easy to pick out and also relatively easy to OCR. (These images are not trying to be inscrutable, like CAPTCHAs.)

Further: Google has an intense interest in OCR, adopting the open-source 'Tesseract' project, spending millions on scanning first catalogs then books and journals, and most recently announcing they are OCRing bitmaps in PDFs:

http://googleblog.blogspot.com/2008/10/picture-of-thousand-w...

Finally: any reasoning based on Google being miserly with cycles is going to be wrong. (They invest a lot in efficiency, yes, but that's so that they can spend cycles freely to collect data.)

It's possible they ran an experiment and found text in embedded/header images was no better than inline text. (I doubt they would find such a thing, because generally indicators of sustained effort -- careful design, site longevity, good writing -- are also indicators of site quality.) But there's zero chance the CPU cost or scale deterred Google from testing the idea.



Just one additional reason, to add to your own: the original intent of YouTube, as I recall, was to OCR video for search indexing, which was a lot more complicated and processor-intensive than even OCRing pictures. Google bought YouTube; obviously this tech, laying about somewhere in their archives, came with the acquisition.


I wouldn't be surprised they tested it. I agree with you on that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: