Friday, April 16, 2010

Network Biology 2.0 part 5

Aviv Regev gave a talk "Unbiased Reconstruction of Mammalian Regulatory Networks". This was definitely one of my favorite talks of the conference. She had previously done work that I really liked with Daphne Koller (like the module network paper in nature genetics). She started by saying how she wanted to apply the lessons she had learned in yeast network reconstruction to mammalian models. She wanted a primary cell that actually reflects cell biology and a model where transcriptional responses played a major role in environmental responses.

She chose to work with dendritic cells as they sense large clasess of pathogens via a cohort of receptors and that a lot is known about the receptor pathways but not as much about the transcriptional response.

Her basic flow for doing this was to gather mRNA expression profiles in time course, select regulators, select a minimal signature of regulated genes that was most important, perturb each candidate regulator then measure signature of regulated genes after perturbation and derive network model.

It seemed to me that the main new thing from her previous work was the choosing the optimal genes to measure and the usage of a neat new expression technology for ~200 gene size. She showed real improvement and integration of data in her work.


Eric E. Schadt gave a talk on "Moving Toward an Understanding of the Molecular Networks Underlying Biological Hydrogen Production by Bacteria". Like the first pacbio talk this one was incredibly piolished and really impressive. As was pointed out in the first talk distinct nucleotide modifications (methylation for instance) create distinct changes in how their SMRT system reads a base. This resutls in letting them create "Kinetic signatures" for each genome. He created these kinetic signatures across 125 strains of R. palustris across the whole genome. He chose R. palustris as a possible bacteria that would be an efficient way to produce hydrogen.

He found that hydrogen production varied signifcantly from strain to strain making a population based systems genetic approach viable. He used the kinetic variation across the entire genome as covariants and mapped them like eQTLs. He then constructed a regulatory network from this variational data.

I found this talk (which I don't really do justice) immensely impressive and I think that pac bio's technology will definitely be something to watch out for in a huge way.

Nicholas Eriksson gave a talk on web-based parallel gwas. He is part of 23andme.com and talked about how they not only genotype their customers but also create a social network for them. In this network they create surveys that ask their customers questions. They use their 20,000 responding customer base with questions from their surveys. They found a few novel SNPs associated with various things like curly hair and I believe parkinsons.

They also stated in a follow up panel that they were indeed likely to patent the novel genes they discovered which I personally find completely ethically reprehensible.

Network Biology 2.0 part 4

The second day of talks at Network Biology 2.0 was yesterday and there were a lot of talks I found really insightful.

Andreas Califano gave a talk "Interrogating Cancer Interactomes to Optimize Therapy on an Individual Basis". He talked about how the initial great hope of cancer genetics was to go from genetics to science to distinguishing and curing human cancer. This paradigm has of course turned out to be completely wrong. He advocates that these things are all mediated by complex epigentic/cell dynamic/environmental/genetic factors wrapped up in dynamic regulatory logic.

He talked briefly about ARACNE a program for reconstructing regulatory networks from gene expression data that his group created and then MINDy a program for reverse engineering conditional transcription factor interactions and lastly MARINa: MAster Regulator INference Algorithm (you know if you keep making acronyms like that it's kind of meaningless) which seems to be looking for the most powerful transcriptional regulator nodes. He defined a master regulator as a gene that is necessary and or sufficient to induce a specific cellular transformation or differentiation event.

He made the argument that in cancer research it was compelling to either describe disease on the level of individual epigenetic alterations which are largely patient specific or via expressed phenotype which are largely homogenous and proposed that a better more useful way was to look at this master regulator model of abstraction.

Using ARACNE and MARINA he was able to reconstruct a regulatory network and identify 6 TF regulatory modules controlling ~80% of the MGES (mesenychmal?) genes.

He then introduced another program IDEA - Interactome-based Dysregulation Enrichment Analysis which looks at dysregulated interactions occuring by more than random chance as a ranking methodology.

He then summarized with 5 points on cancer medicine:

The current emphais on genes harboring epigenetic alterations is inappropriate
The current approach to biomarker discovery must be re-evaluated from GWAS to PWAS (Pathway Wide Association)
That statistical power cannot be sacrificed for coverage
We must fundamentally rethink clinical studies because we cannot use a sample of one
One drug for one disease paradigm needs to shift to a toolkit of target specific drugs

He also made the point that you don't need to fully silence a gene via shRNA for it to be an effective treatment but rather just partially, that in fact fully silencing a gene would probably be fatal.

I was really impressed by this talk, he seemed to really understand the algorithmic underpinnings of his work and I will be definitely investigating more about his methodology. His idea of pathway analysis I believe took some inspiration from GSEA and rightfully so.

Wednesday, April 14, 2010

Network Biology 2.0 part 3

The next speaker was Michael Cusick who talked about interactome networks and human diseases. He talked about basically doing yeast 2-hybrid studies crossing all proteins with all proteins. My mind still boggles at the scope of projects like this and I think it's incredible people do them. I find it interesting that they mention how their statisticians tell them they need something like ~10x coverage to fully recover the interactome but funding agencies seem to disagree. I think that these sorts of studies need the sort of tentpole funding that happened with the human genome project, the amount of funding they need is still an order of magnitude less than that and their resulting work is no less important it just doesn't have that sort of visceral appeal that the human genome project had to the public.

at first he only mentioned binary models which I took issue with but he seemed to indicate that they are moving towards a probabilistic model which is obviously the better choice.

He also spoke at length about literature curation and literature generated protein interaction networks, the literature of which I'm going to read in hopes of adapting it to genetic regulatory networks.

Lastly he talked about something called edgetics which is studying the phenotypic perturbations caused by removing a link between two proteins as opposed to knocking out a protein. He showed that edge perturbations gave rise to unique phenotypes which isn't so surprising, but I think it's fascinating that they were able to do this. I'm definitely going to be delving into these papers.

Robert weinberg then talked in great detail about cancer biology and morpohology which I frankly did not follow well at all, it seemed realy meaningful and his presentation and speaking was really great.

James Collins gave a talk on network biology and drug discovery. He stated that given inputs and outputs the network inside could be recreated, which isn't exactly true, some network that gives those outputs can be created but there are probably numerous equivalent networks that could create the same outputs from the given inputs.

He went on to discuss how synthetic biology could be used to perturb a cells regulatory states to learn the regulatory interaction network. However he did not elucidate what advantages this technique has over say knockdowns or knockout techniques (it may be more advantageous I just don't enough about it to say why).

He then went on to talk about how it seems likely that antibiotics that kill bacteria tend to do so by inducing the creation of hydroxyl radicals in their cells. This lead him to talk about how sub lethal levels of antibiotics didn't just not kill some bacteria but that they still induced hydroxyl radical formation which resulted in increased mutagenesis of the bacteria and increased resistance across antibiotics, scary!

Network Biology 2.0 part 2

The next talk was by Steve Turner the pacific biosciences founder where he talked about not surprisingly, pacific biosciences 3rd gen sequencing technology SMRT. This was I think my favorite talk. He had a super polished presentation and talk but in addition there was a lot of really awesome stuff in it.

He went through his single molecule technology which uses these things called zero mode wave guides to sequence. They appear to be focusing on laong read length and accuracy. In sequencing the e. coli genome they had an average read length of 586bp with a max read length of over 2000bp. He stated that the main cause of sequencing termination is damage by the lasers which I thought was interesting and said one thing that they were doing was pulsing the laser on and off. Since they circularize the dna strands in their technique this results in the polymerase continuing to sequence the strand over and over again, only actually recording the reading when the pulse was on. I don't entirely understand how this is useful but it seemed to be.

The really awesome thing that he described was how they methylation of a nucleotide causes a recognizable signature. That at this point they are able to identify methylation of base pairs via their SMRT technology and that this same principle can be applied to any sort of base pair modification potentially, although that appears to be a work in progress. I'm not sure if he mentioned or whether it's the case that the SMRT technology requires amplification. So I'm wondering if it does, won't the base pair modifications generally speaking be lost in that step, sort of negating that usefulness.

He additionally mentioned that they were working on a direct RNA sequencer using their SMRT technology. He mentioned it in the context of sequencing viruses but if you could do direct mRNA sequencing with it, and detect their post translational modifications I think that would be an incredibly revolutionary point in gene expression studies.

Direct RNA sequencing while recovering all post translational effects would allow for all kinds of crazy shit like correlating among inter-gene modifications. At this point with RNA-seq you can already do something like this with exons but because of the typically short reads it seems like it could be problematic.

In addition this would potentially exponentially increase the feature space of a transcriptional regulatory network, hopefully the increase in the ability of sequencing technology would allow for it to still be useful, but as it seems all of this stuff is in the next 5? or so years in the future we'll just have to wait and see. Regardless this left me really excited.

He closed with using ribosomes to do translational sequencing which is still not a solved problem for them but represents an amazing opportunity undoubtedly. I was too busy geeking out thinking about the RNA sequencing so I didn't pay attention to this as well as I should.

Network Biology 2.0 part 1

The first day of the network biology symposium ended and I have a few impressions about the talks so far and I'm going to break this up into multiple blog posts across speakers.

There was a much greater emphasis on the biological results than on the algorithmic machinery to produce those results. Perhaps as a biological seminar this shouldn't have come as a surprise to anyone but I felt that the emphasis may have been a bit too much on results and less on a demonstration of the accuracy of the methodology and the novelty of the methods used.

That being said the biological implications presented here were very cool if not completely in my area of interest at times and some of the methodology used and things attempted has far reaching implications I believe.

The first speaker was George Church and he talked in broad strokes about the current state of the art in genomics and next gen sequencing. He made the point that I'm sure has been made previously that DNA sequencing technology is increasing at a rate of 10x/year in comparison to moore's law which is a rate of 1.5x/year.

He talked in length about the growth of various -omes that will allow for an increase in understanding of cancer and also made a big big point about how so far every attempt to anonymize biological data has failed and that the answer is not to anonymize but to get the informed consent of the subjects prior to the fact. He pimped his attempt to do this personalgenomes.org which looks very cool.

I'm glad to see some of the really famous people in biology get behind the idea of open science. Walled gardens of data are already hurting scientific advancement and our abilities to build useful tools to analyze the data and create predictive models.

What we can learn from google?

Since I'm stuck here waiting in the lobby I will talk a bit about something I've been thinking a lot about, large scale datasets and large scale machine learning. This poston google's blog a few days ago has had me thinking. A large number of biological problems are classification problems, does this genotype or this gene expression profile mean you'll get this phenotype, for instance is a hugely targeted goal in the research community. This is sort of relatedly the goal of a lot of GWAs studies, find SNPs X associated with phenotype Y.

However most of this data is walled off. Researchers don't want to share data, it takes extra time and it may help other people get papers out of the data that the original generator could have. In addition there have been problems with data across experiments being comparable, especially with microarrays.

So I'd like to propose something that won't actually happen but would be nice if it did. If we created a simple, easy to use data collection and annotation database that was maintained by both the authors and others. GEO has the issue of the data only being updatable by the author which can lead to really out of date information. ArrayWiki attempts to solve this problem but it only deals with a somewhat antiquated technology, microarrays, and hasn't caught on hugely in popularity. A real time curated database would be a substantial investment but it would allow for us to build and leverage large scale machine learning tools like google is currently developing, which I think would allow for substantial scientific discovery

Network Biology 2.0

The main impetus behind starting this blog was that I wanted to follow in the vaunted footsteps of a friend and blog a conference. Now the time is upon us as I am sitting in the lobby of the broad institute waiting for the Network Biology 2.0 symposium to start. Network Biology 2.0is a symposium sponsored by GNSbiotech and the broad institute. I must say being somewhat of a country yokel in the wide world of science I feel somewhat awed being in this building.

Tuesday, April 6, 2010

Granularity

As the title of my blog suggests, I will from time to time write about object manipulation, an umbrella term used by the juggling cognoscenti to encompass things like toss juggling, contact juggling, poi, diabolo, etc... I'm starting to give lessons at a local community center in hula hooping and poi and I've been thinking about some differences in how various communities push their object manipulation forward.

To draw some very broad strokes I think that you can classify various types of object manipulation into two camps, fine grained manipulation or coarse grained manipulation. The distinction between the two is that with fine grained manipulation you can at a much lower level precisely control the object and that in addition your precise level of control creates a discernible difference while coarse grained manipulation requires far more effort to control and correspondingly has diminishing returns in noticeable effect. To illustrate this, consider hula hooping. Movement in Hula Hooping can be somewhat incompletely partitioned into three major parts. On the body hooping consists of movement of the hoop around a part of your body by a force other than your hands. The obvious example is what everyone thinks of with respect to hula hooping, hooping around your waist, or your neck, or your chest, or your knees and ankles and so on. This is a really excellent example of coarse grained object manipulation. You have a somewhat limited number of places to hoop from, basically: ankles, knees, waist, chest, neck, shoulders, head, arms. Now you can with enough practice and refinement be able to control where exactly the hoop goes, say for example you want to place it 2 inches above your belly button, or two inches below, you could probably do this with enough practice but it require a lot of work to create that precision and consequently wouldn't really create much of a visual difference. Correspondingly let's look at a contact juggling, particularly what I think of as the isolationist movement in contact juggling. As you can see in the video this form of contact juggling relies on lots of precise isolations to make striking visual patterns. Here the level of skill required to precisely move the object is lower, a consequence of the fact that the object is much lighter and can be directly manipulated by your hands, a hoop is manipulated in the previous context by precisely controlling your abdominal and hip muscles.

Another sort of corollary we can draw I think from this idea of fine grained and coarse grained manipulation is that their learning curves are very different. Because coarse grained movement doesn't easily allow one to differentiate between precise control and relatively imprecise control coarse grained movement can have a learning curve that's fairly flat followed by an exponential rise in difficulty, whereas fine grained manipulation has a learning curve that seems more like a logarithm, very steep at first followed by somewhat of a plateau. I think hooping is again an excellent example of this where to learn some basic body hooping techniques is really not that difficult but to gain exact and precise control over the movement seems a herculean effort.

So the question most people are asking right now is what the hell is this guy talking about, i don't know anything about juggling or object manipulation or whatever. I'm going to ignore those people and think about another question maybe a few people are asking right about now, so what, who cares? Well I care, because I think this says something interesting about how the different arts develop and how their communities develop. Hooping's community (from a distance) seems largely dominated by people who are interested in pushing forward the integration of hooping with body movement and dancing and costuming (not to say there aren't several really amazing people pushing the abilities of tech hooping) whereas say the toss juggling community is obsessed with creating the next cool sideswap pattern or the next impressive body throw trick.

I think this also has repercussions in how various groups view each other. Coarse grained manipulation seems to focus on the interplay of body movement with their object while fine grained manipulation tends to focus on exploring the space of visual patterns able to be created by their object. This can have the coarse grained manipulators see the fine grained manipulators as "tech nerds" while the fine grained manipulators view the coarse grained manipulators as "too dancey".

Friday, April 2, 2010

Graphical Models and Algebraic Statistics

So most people who know me in an academic setting know that I have a thing for algebraic statistics. I'm kind of like a cheerleader for it than an expert in it. I know some algebra and have read a few chapters of various books on the subject of algebraic statistics but I certainly have never grasped it enough to do anything meaningful. I wonder though if there is any cool relationship potentially between graphical models and algebraic statistics. My basis for this, absolutely nothing but that I like both of these objects and it would be neat for them to have cool intertwining properties.

I wish that algebraic statistics had more clear uses to it that would make the large amount of time I'd have to invest in it worthwhile but if someday these two things have cool properties, well I called it here first!

Thursday, April 1, 2010

GPUs and science?

I just attended an invited talk on how someone used GPU computing to process bioacoustic information and resulted in around a ~10x speed increase. Our research computing department has been kicking around the idea of if and how it should support or engage in GPU computing. Inevitably the question always arises, is this useful and when?

I've found myself sort of drawn to GPU computing and CUDA since I first found out about it. Maybe it's the pretty bar graphs showing the increase in speed or maybe it's the idea that it's really "cutting edge" computing but throughout this whole year I've been thinking about it a lot. A lot of people have made some very good points that for doing science, it may be too early to jump on board unless your problem is easily GPU-ized and you need it answered now. The major issues with GPU adoption are:

1) It's so young. What will be the dominant API is likely not whatever is available now. Nvidia released the CUDA SDK in 2007 and they've just released an update that pretty drastically changes it with the inclusion of C++ among other things.

2) GPU's are the next math co-processor. It's only a matter of time until they are fully integrated onto CPUs, AMD has plans to do this by 2012 with bulldozer and Intel attempted to do this recently with Larrabee. Once this happens you won't need any fancy-shmancy GPU code, everything will just happen automatically.

3) Which leads us to the last major issue. Coding for the GPU while definitely easier than it used to be pre-CUDA and openCL is still a major pain in the ass. So far I've mainly coded in high level languages like java, perl, python, ruby with fun, convenient things like garbage collection. With GPU computing I'm suddenly much closer to the hardware. I never learned lower level coding because to me it seemed tangential to what I wanted to do but now with the dramatic potential speed increases GPU computing offers it's really tempting to struggle with it.

Despite these major issues concerning GPU computing I'm still going to forge ahead I think. Thinking in these massively parallel ways has proven to be a lot of fun in its own right and that's the point of academia in my mind, to have fun and increase the knowledge of humanity.