Using Probabilistic Models for Data Management in Acquisitional Environments : Using Probabilistic Models for Data Management in Acquisitional Environments
Sam Madden
MIT CSAIL
With Amol Deshpande (UMD), Carlos Guestrin (CMU)
Overview : Overview Querying to monitor distributed systems
Sensor-actuator networks
Distributed databases
Probabilistic models provide a framework for dealing with all of these issues Issues
Missing, uncertain data
High acquisition, querying costs I’m not proposing
a complete
system!
Outline : Outline Motivation
Probabilistic Models
New Queries and UI
Applications
Challenges and Concluding Remarks
Outline : Outline Motivation
Probabilistic Models
New Queries and UI
Applications
Challenges and Concluding Remarks
Not your mother’s DBMS : Not your mother’s DBMS Data doesn’t exist apriori
Acquisition in DBMS
Critical issue: given limited amount of noisy, lossy data, how can users interpret answers? Insufficient bandwidth
Selective observation
Sometimes, desired data is unavailable
Must be robust to loss
Data is correlated : Data is correlated Temperature and voltage
Temperature and light
Temperature and humidity
Temperature and time of day
etc.
Outline : Outline Motivation
Probabilistic Models
New Queries and UI
Applications
Challenges and Concluding Remarks
Solution: Probabilistic Models : Solution: Probabilistic Models Probability distribution (PDF) to estimate current state
Model captures correlation between variables
Directly answer queries from PDF
Incorporate new observations
Via probabilistic inference on model
Model the passage of time
Via transition model (e.g., Kalman filters)
Models learned from historical data
Architecture: Model-driven Sensornet DBMS : “SELECT nodeid,temp
FROM sensors
CONF .95 TO ± .5°”
Architecture: Model-driven Sensornet DBMS Probabilistic
Model Query New Query posterior belief Advantages vs. “Best-Effort Query-Everything”
Observe fewer attributes
Exploit correlations
Reuse information between queries
Directly deal with missing data
Answer more complex (probabilistic) queries
Outline : Outline Motivation
Probabilistic Models
New Queries and UI
Applications
Challenges and Concluding Remarks
New Types of Queries : New Types of Queries Architecture enables efficient execution of many new queries
Approximate queries
“Tell me the temperature to within ± .5 degrees with 95% confidence?”
Probabilistic Query Optimization Problem : Probabilistic Query Optimization Problem What observations will satisfy confidence bounds at minimum cost?
Must define cost metric and model
Sensornets: metric = power, cost = sensing + comm
Decide if a set of observations satisfies bounds
Choose a search strategy
Choosing observation plan : P(Xi[a,b]) > 1- Choosing observation plan Is a subset S sufficient? reward Pick your favorite search strategy
More New Queries : More New Queries Outlier queries
“Report temperature readings that have a 1% or less chance of occurring.”
Extend architecture with local filters: Transmit Outliers Update Models Issues:
Bias
Inefficiency
Even More New Queries : Even More New Queries Prediction queries
“What is the expected temperature at 5PM today, given that it is very humid?”
Influence queries
“What percentage of network traffic at site A is explained by traffic at sites B and C?” Queries could not be answered without a model!
UI Issues : UI Issues How to make probability “intuitive”?
How to allow users to express queries?
Issues
Query Language
UI
Outline : Outline Motivation
Probabilistic Models
New Queries and UI
Applications
Challenges and Concluding Remarks
Applications : Applications Sensor-based Building Monitoring
Often battery powered
100s-1000s of nodes
Example: HVAC Control
Tolerant of approximate answers
Reduction in energy significant
App: Distributed System Monitoring : App: Distributed System Monitoring Goal: detect/predict overload, reprovision
Many metrics that may indicate overload
Disk usage, CPU load, network load, network latency, active queries, etc.
Cost to observe
Problem: What metrics foreshadow overload?
Soln:
Train on data labeled w/ overload status
Choose obs. plan that predicts label
Other Apps : Other Apps Stream load shedding
Sensor network intrusion detection
Database statistics
See paper!
Outline : Outline Motivation
Probabilistic Models
New Queries and UI
Applications
Challenges and Concluding Remarks
Extension, Not Restriction : Extension, Not Restriction Acquisition Layer + Tabular Data Model 1 Model 2 System State Gaussians Discrete (Histograms) Integration Layer Possible to have many views of same data
Different models
Base data Number of architectural challenges
Every rose… : Every rose… Models can can fail to capture details
Models can be wrong
Models can be expensive to build
Models can be expensive to maintain
Paper suggests a number of known techniques from the ML community.
Whither hence? : Whither hence? See the paper for technical details
See other work
Probabilistic data models
Outlier and change detection
Generalize these ideas to:
New models
Non-numeric types
New environments, queries
Make some AI and stats friends
Conclusions : Conclusions Emerging data management opportunities:
Ad-hoc networks of tiny devices
Large scale distributed system monitoring
These environments are:
Acquisitional
Loss-prone
Probabilistic models are an essential tool
Tolerate missing data
Answer sophisticated new queries
Framework for efficient acquisitional execution
Questions : Questions
Example: TinyDB : Example: TinyDB Declarative queries for sensornets
SELECT roomNo, AVG(temp)
FROM sensors
GROUP BY roomNo
HAVING MAX(light) > 100 lux
SAMPLE PERIOD 1 s Queries flooded, reverse-flood aggregation
Best effort
TinyDB Limitations : TinyDB Limitations Difficult to interpret answers
Answering nodes can change between samples
Limited queries
No historical trends
No future predictions
No outlier detection
High overhead
Query flooding, full network traversal
No information sharing between sample periods
App: Value-Based Load Shedding : App: Value-Based Load Shedding User prioritizes some output values over others
May have to shed load
Issue: what inputs correspond to desired outputs?
Esp. hard for aggregates, UDFs
Can learn a probabilistic model that gives
P(output value | input tuple)
Requires source tuple references on result tuples
Use this model to decide which tuples to drop
Coping with Complexity : Coping with Complexity Graphical models
Coping with Mistakes : Coping with Mistakes Retraining models
Prior Work : Prior Work On probabilistic data models / statistics
Not addressing issue of data acquisition
On using models to decide what to capture
Tends to focus on performance issues
Our Concerns:
What data to acquire?
Interpretability given missing data