Machine learning on 330k de-compiled Java methods from the Debian archive plus their corresponding Java Doc textual comments. Input are bytecodes, output are English keywords. If you wanted an excuse to play with machine learning, it hardly gets better (and more challenging) than this. Work in progress.
Copyright (C) 2012 Pablo Ariel Duboue. Data made available under CC-BY license.