Abstract:

Data in public repositories remains remarkably underused despite significant investments in open science. Making data available online turns out to be the easy part; making the data usable for science motivates ML algorithms to help automate curation tasks and enable longitudinal, integrative analysis.

But training data is scarce and expensive due to the specialized nature of the tasks, and semantic heterogeneity between datasets makes integration difficult.

We are exploring a number of services to address these issues. We combine distant supervision and co-learning methods to provide high-quality labels with zero training data, and show that this approach outperforms even the state-of-the-art (and expensive) supervised methods. We then use statistical claims extracted from the text of scientific papers to disambiguate schema mappings across disparate datasets. Finally, we automate experiments to verify extracted claims against the integrated data, to help researchers, journal editors, and curators hold scientists accountable for weakly reproducible results.

These approaches are already beginning to have impact: computational biologists are beginning to use our curated gene expression corpus as the gold standard to search for new cancer treatments, and social scientists are using our curated corpus of scientific figures to understand and optimize how researchers use visualization to communicate (an emerging field we call "viziometrics.")

Bio Bill Howe:

I am an Associate Professor in the Information School, Adjunct Associate Professor in Computer Science & Engineering, and Associate Director and Senior Data Science Fellow at the UW eScience Institute. I am a co-founder of [email protected], and with support from the MacArthur Foundation and Microsoft, I lead UW's participation in the MetroLab Network. I created a first MOOC on Data Sciencethrough Coursera, and I led the creation of the UW Data Science Masters Degree, where I serve as its first Program Director and Faculty Chair. I serve on the Steering Committee of the Center for Statistics in the Social Sciences.

My group's research aims to make the techniques and technologies of data science dramatically more accessible, particularly at scale. Our methods are rooted in database models and languages, though we sometimes work in machine learning, visualization, HCI, and high-performance computing. We are an applied, systems-oriented group, frequently sourcing projects through collaborations in the physical, life, and social sciences.