Semi-supervised learning is a good idea in this type of situation, given that unlabeled samples are far more abundant than labeled samples, but there are gotchas to watch out for. In general, SSL helps when your model of the data is correct and hurts when it is not.
Here's an example of what can go wrong in this particular application: let's say the word 'better' is mildly positive, but when it appears in high-confidence samples, it's usually because it appears together with the words 'business' and 'bureau', as in "I just reported Company X to the Better Business Bureau", i.e., strongly negative. This means that the new self-training samples containing the word better will all be negative, which will bias the corpus until eventually 'better' is treated as a strongly negative feature.
Occasional random human spot-checks of the high-confidence classifications would be useful :-) Also, self-training gives diminishing returns in accuracy, whereas the possibility for craziness remains, so turning it off after a while might be best.
Thanks very much for the links (I wrote the article). I'm new to this area of analysis, so knowing the names for things really helps! I'll check out those links and delve further down the rabbit hole - thanks again.
The problem with semi-supervised learning (or rather the way it's used here) is that, if you don't label the low-confidence samples yourself, but just leave it alone instead, it might diverge and start producing worse and worse results as it thinks it knows it guessed correctly, but in fact didn't.
Basically, the problem is that you can't make a closed system learn from itself, without any outside feedback. The information has to come from somewhere.
It's a bit like someone giving you two Chinese phrases and their translation (without you knowing any Chinese beforehand), and then leaving you to translate a whole book. You will start guessing, based on what you already know, and by the end you'll have arrived to a (totally incorrect) interpretation of what you think each ideogram means.
Single words may not be the best way to attack this problem though, multi-word expressions would do a better job. And then you could label the occurrence of 'reported*better business bureau' as a strong negative.
Context is everything in natural language processing, and by dropping all context the problem becomes harder to solve.
Learning becomes problematic if you get too complex in your features though, especially if you go up to learning things like regexes. Simple ramping up in complexity, from e.g. word counts to word-pair counts (or other low-n n-grams) does give you gains sometimes, but there've been a number of cases where increasing the space like that gave surprisingly little/no gains, which is one reason the simple models keep being used (besides simplicity and speed).
That's true. I built a 'chatbot' long ago that used simple regexes (words+wildcards) to match incoming patterns. It worked well because at the higher levels of the conversation you'd use single words to guide to a portion of the conversation tree, and lower down you could make decisions on very specific differences in the input.
For a classifier that's a less useful approach, but I think single words is too narrow. 3-grams is probably the sweet spot for something like this.
Very cool. That seemed to work better than I'd have expected.
The explanation of the 'naive' part of Naive Bayes isn't quite right, though. Throwing out the possibility of the animal being human based on the datapoint "four legs" is orthogonal to naivety. A more sophisticated Naive Bayes system could reject classifying an animal as human based on having four legs, and conversely a non-naive system i.e. one that used joint probabilities of the features, might not be more likely to.
I guess the most rational way to do that would be to express and calculate the conditional probabilities with some statistical distance, like the # of standard deviations. So 100k examples of humans without a single one having the "four legs" feature would make that a very strong indicator of being non-human. And that'd work just as well with a naive algorithm as any other.
You seem to be talking about smoothing: how many non-human four-legged animals and non-four-legged humans have you got to see before you can estimate that humans are not likely to have four legs? This is reasonably well-solved; and you either use dirichlet (laplace, add-one) smoothing or something more closely corresponding to your domain, like good-turing smoothing. This reweights the probabilities in your classifier to make sure that (by analogy) if you see a talking, thinking, four-legged banker, you will probably classify him as a human (since the other features overwhelmingly point in that direction). They all assume some measure of confidence in the features, and make sure that a feature that has been observed once or twice will not move the class boundaries too much.
Semi-supervised learning is a good idea in this type of situation, given that unlabeled samples are far more abundant than labeled samples, but there are gotchas to watch out for. In general, SSL helps when your model of the data is correct and hurts when it is not.
Here's an example of what can go wrong in this particular application: let's say the word 'better' is mildly positive, but when it appears in high-confidence samples, it's usually because it appears together with the words 'business' and 'bureau', as in "I just reported Company X to the Better Business Bureau", i.e., strongly negative. This means that the new self-training samples containing the word better will all be negative, which will bias the corpus until eventually 'better' is treated as a strongly negative feature.
Occasional random human spot-checks of the high-confidence classifications would be useful :-) Also, self-training gives diminishing returns in accuracy, whereas the possibility for craziness remains, so turning it off after a while might be best.
A survey of semi-supervised learning: http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf