Showing posts with label machine learning. Show all posts
Showing posts with label machine learning. Show all posts

Wednesday, April 14, 2010

What we can learn from google?

Since I'm stuck here waiting in the lobby I will talk a bit about something I've been thinking a lot about, large scale datasets and large scale machine learning. This poston google's blog a few days ago has had me thinking. A large number of biological problems are classification problems, does this genotype or this gene expression profile mean you'll get this phenotype, for instance is a hugely targeted goal in the research community. This is sort of relatedly the goal of a lot of GWAs studies, find SNPs X associated with phenotype Y.

However most of this data is walled off. Researchers don't want to share data, it takes extra time and it may help other people get papers out of the data that the original generator could have. In addition there have been problems with data across experiments being comparable, especially with microarrays.

So I'd like to propose something that won't actually happen but would be nice if it did. If we created a simple, easy to use data collection and annotation database that was maintained by both the authors and others. GEO has the issue of the data only being updatable by the author which can lead to really out of date information. ArrayWiki attempts to solve this problem but it only deals with a somewhat antiquated technology, microarrays, and hasn't caught on hugely in popularity. A real time curated database would be a substantial investment but it would allow for us to build and leverage large scale machine learning tools like google is currently developing, which I think would allow for substantial scientific discovery