Lessons and Challenges of Building Data Repositories : Lessons and Challenges of Building Data Repositories Ken Buetow
NCICB/NCI/NIH/DHHS
Slide2 :
Experience with Diverse Communities : Experience with Diverse Communities Human gene mapping community
Cancer Genome Anatomy Project (CGAP)
Mouse Models of Human Cancer Consortium
Director’s Challenge consortium
SPORE community
Clinical trials community
Imaging community
Integrated Cancer Biology Program
Cancer Biomedical Informatics Grid (caBIG)
Lessons learned… : Lessons learned… Know what problem you are attempting to solve…
What is your goal?
Who is your “customer”?
What do they need?
Compared with what they want…
How will they use a given feature/attribute
Maintain focus/discipline
NCI biomedical informatics : NCI biomedical informatics Goal: A virtual web of interconnected data, individuals, and organizations that redefines how research is conducted, care is provided, and patients/participants interact with the biomedical research enterprise
Cancer Biomedical Informatics Grid (caBIG):the program… : Cancer Biomedical Informatics Grid (caBIG): the program… Common, widely distributed infrastructure permits cancer research community to focus on innovation
Shared vocabulary, data elements, data models facilitate information exchange
Collection of interoperable applications developed to common standard
Raw published cancer research data is available for mining and integration
caBIG: the pilot… : caBIG: the pilot… Workspaces
Clinical Trials Management System
Integrated Cancer Research
Tissue Banks and Pathology
Vocabulary and Common Data Elements
Architecture
Strategic Working Groups
Data Sharing and Intellectual Capital
Training
caBIG Strategic Planning
Special Interest Groups
23 groups focused on specific topics
caBIG pilot - participation : caBIG pilot - participation Pilot – NCI designated Cancer Centers
Members: 45 institutions – executed base agreements
developers
adopters
working group members
Statistics
Over 450 active participants
196 teleconferences
10 face-to-face meetings
Volunteers
academic Centers
industry
Partners
Affiliates
Lessons learned… : Lessons learned… Open is good!
Data sharing
Open source code
Open access
“Do no harm” licenses
Lessons Learned… : Lessons Learned… Today’s tools are not likely to be tomorrow’s
Killer app’s
Accessible, useful, user friendly apps critical to adoption
Not always the best approach (Eisen’s cluster analysis)
Design infrastructure that facilitates rapid exploration of new methods
Open source
Isolate data from applications
Component architecture
Components: software parts : Small parts are better for building flexible shapes
Have a uniform interface medium
Snap-together connectivity
Internals can be made from widely varying technologies Components: software parts
Boundaries and Interfaces : Boundaries and Interfaces focus on boundaries, interfaces, how things fit together,
not on the internal details of how they’re built: assume that will be diverse & changing
Lessons Learned… : Lessons Learned… Standards versus standardization
Data standards
Use established standards where they exist
Modify/extend existing standards where ever possible
Develop new standards “just in time”, based on practical experience of large-scale users
Create new standards as necessary – “just enough”
Standards can NOT be proprietary
caCORE – common ontologic representation environment : caCORE – common ontologic representation environment Metadata Infrastructure
Enterprise Vocabulary : Enterprise Vocabulary NCI Meta-Thesaurus (Cross-map standard vocabularies/ontologies, e.g. SNOMED, MEDRA, ICD)
Semantic integration, inter-vocabulary mapping
UMLS Metathesaurus extended with cancer-oriented vocabularies
800,000 Concepts, 2,000,000 terms and phrases
Mappings among over 50 vocabularies
NCI Thesaurus
Description logic-based
18,000 “Concepts”
Concept is the semantic unit
One or more terms describe a Concept – synonymy
Semantic relationships between Concepts biomedical objects common data elements controlled vocabulary
Common Data Elements : Common Data Elements Structured data reporting elements
Precisely defining the questions and answers
What question are you asking, exactly?
What are the possible answers, and what do they mean?
biomedical objects common data elements controlled vocabulary
Biomedical Information Objects : Biomedical Information Objects Data service infrastructure developed using OMG’s Model Driven Architecture approach
Object models expressed in UML represent actual biomedical research entities such as genes, sequences, chromosomes, sequences, cellular pathways, ontologies, clinical protocols, etc.
The object models form the basis for uniform APIs (Java, SOAP, HTTP-XML, Perl) that provide an abstraction layer and interfaces for developers to access information without worrying about the back-end data stores biomedical objects common data elements controlled vocabulary
Standards supporting infrastructure : Standards supporting infrastructure Enterprise Vocabulary Services (EVS)
Browsers
APIs
cancer Bioinformatics Infrastructure Objects (caBIO)
Applications
APIs
cancer Data Standards Repository (caDSR)
CDEs
Case Report Forms
Object models
ISO 11179 model
caCORE Software Development Toolkit
caBIG Compatibility Matrix : caBIG Compatibility Matrix
Lessons Learned… : Lessons Learned… Quality measures are transforming
Qualitative and quantitative
Objective measures critical
Should track with the data
Lessons Learned… : Lessons Learned… The devil is in the details
Experimental inputs can be as critical as important as outputs
Laboratory information management systems (LIMS)
Lessons Learned… : Lessons Learned… You really are going to want to connect these results to other outcomes!
Other data types
Clinical outcomes
Slide23 : etiology, treatment, prevention
caBIG pilot products : caBIG pilot products Tissue Bank and Pathology Tools Workspace
caTISSUE architecture and use cases
Federated Tissue Data Set White Paper
Data Sharing Federation Operational Guidelines (4th quarter 2004)
caTIES beta release (1st quarter 2005)
caTISSUE Lite prototype (2nd quarter 2005)
caTISSUE prototype (2nd quarter 2005)
External module connector prototype (2nd quarter 2005)
De-identification reports tool operational (4th quarter 2005)
caBIG pilot products : caBIG pilot products Integrated Cancer Research
Gene Annotation
PIR (2nd quarter 2005)
Cancer Molecular Pages (3rd quarter 2005)
Function Express (3rd quarter 2005)
GoMiner (3rd quarter 2005)
HapMap (3rd quarter 2005)
SEED (4th quarter 2005)
Data Analysis and Statistical Tools
Distance-Weighted Discrimination (2nd quarter 2005)
Magellan (2nd quarter 2005)
VISDA (2nd quarter 2005)
Gene Pattern (4th quarter 2005)
Translational (Clinical Integration)
TrAPSS (3rd quarter 2005) Informatics for Proteomics
LIMS (2nd quarter 2005)
Q5 (3rd quarter 2005)
RProteomics (4th quarter 2005)
Microarray Repositories
caArray (4th quarter 2004)
NCI-60 Data Sharing (2nd quarter 2005)
Zebrafish Mircroarray Data Sharing (2nd quarter 2004)
Pathways
Cytoscape/BioPAX/cPath (3rd quarter 2005)
QPACA (3rd quarter 2005)
Reactome (4th quarter 2005)
Interacting with caBIG : Interacting with caBIG Track activities and progress on caBIG Web site at http://caBIG.nci.nih.gov
Participate in caBIG open meetings to coordinate activities.
Work toward making your applications and solutions caBIG compatible. Current guidelines for caBIG compatibility are available on the caBIG Web site
Use caCORE infrastructure – use EVS, CDEs, and models where defined; register meta-data in caDSR (http://ncicb.nci.nih.gov/core )
Download and get familiar with the tools and applications already available on the caBIG Web site.
Submit tools, data infrastructure to caBIG repositories
Slide27 :