Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Since these are statistical classification problems, it seems like it would be worth trying some old-school machine learning (not an LLM, just an NN) to see how it compares with these manual heuristics.


I imagine that would work pretty well given an adequate and representative body of annotated sample data. Though that is also not easy to come by.


Actually, it is easy to come up with reasonably decent heuristics that can auto-tag a corpus. From that you can look for anomalies and adjust your tagging system.

The problem of getting a representative body is (surprisingly) much harder than the annotation. I know. I spent quite some time years ago doing this.


But if you believe in your manual heuristics enough to ship them, you must already have a body of tests that you're happy with, right?

Also seems like this is a case where generating synthetic data would be a big help. You don't have to use only real-world documents for training, just examples of the sorts of things real-world documents have in them. Make a vast corpus of semi-random documents in semi-random fonts and settings, printed from Word, Pandoc, LaTeX, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: