Wednesday, April 14, 2010

What we can learn from google?

Since I'm stuck here waiting in the lobby I will talk a bit about something I've been thinking a lot about, large scale datasets and large scale machine learning. This poston google's blog a few days ago has had me thinking. A large number of biological problems are classification problems, does this genotype or this gene expression profile mean you'll get this phenotype, for instance is a hugely targeted goal in the research community. This is sort of relatedly the goal of a lot of GWAs studies, find SNPs X associated with phenotype Y.

However most of this data is walled off. Researchers don't want to share data, it takes extra time and it may help other people get papers out of the data that the original generator could have. In addition there have been problems with data across experiments being comparable, especially with microarrays.

So I'd like to propose something that won't actually happen but would be nice if it did. If we created a simple, easy to use data collection and annotation database that was maintained by both the authors and others. GEO has the issue of the data only being updatable by the author which can lead to really out of date information. ArrayWiki attempts to solve this problem but it only deals with a somewhat antiquated technology, microarrays, and hasn't caught on hugely in popularity. A real time curated database would be a substantial investment but it would allow for us to build and leverage large scale machine learning tools like google is currently developing, which I think would allow for substantial scientific discovery

3 comments:

  1. From my understanding this is where biology is really lacking. We do not have our technology together. Historically I think we are way behind physicists in terms of the amount of data we are generating and now we are going through growing pains. At NHGRI meeting the google guy basically said we should reimplement everything in Map/Reduce, apparently physics went through a rewrite period as well. By the same token though, it's not clear you'd get any gain from implementing BLAST with M/R, but I digress. At least Celera's plan to charge for sequence data never panned out.

    ReplyDelete
  2. As I was walking back from lunch today I thought of an idea that I think I (or we? yes yes?) should implement. I call it "Crowdsourcing For Fun". The idea is to accumulate datasets and questions into one place and let anyone have a go at them. The initial idea is a place for people like me who just took a class on, say, ML and have tools but no problem to attack them with. Initially I think it would just be for people who want to play with the tools they have learned but have no idea what question to ask and no data to do it on. But at some point you could link people who provide the data and questions to the people playing with it and perhaps get some collaboration. But right now I really just want to get some data to play with and some questions to answer!

    ReplyDelete
  3. yeah that's why i have this other plan for datasets

    ReplyDelete