MauveDB: Supporting Model-based User Views in Data

Description

MauveDB: Supporting Model-based User Views in Database Systems Amol Deshpande, University of Maryland Samuel Madden, MIT Motivation Unprecedented, and rapidly increasing, instrumentation of our every-day world Overwhelmingly large raw data volumes generated continuously Data must be processed in real-time The applications have strong acquisitional aspects Data may have to be actively acquired from the environment Typically imprecise, unreliable and incomplete data Inherent measurement noises (e.g. GPS) and low success rates (e.g. RFID) Communication link or sensor node failures (e.g. wireless sensor networks) Spatial and temporal biases because of measurement constraints Traditional data management tools are ill-equipped to handle these challenges

Comments
Would you like to comment?

Sign In if already a member, or Join Now for a free account.

Presentation Transcript Presentation Transcript

MauveDB: Supporting Model-based User Views in Database Systems : MauveDB: Supporting Model-based User Views in Database Systems Amol Deshpande, University of Maryland Samuel Madden, MIT

Motivation : Motivation Unprecedented, and rapidly increasing, instrumentation of our every-day world

Motivation : Motivation Unprecedented, and rapidly increasing, instrumentation of our every-day world Overwhelmingly large raw data volumes generated continuously Data must be processed in real-time The applications have strong acquisitional aspects Data may have to be actively acquired from the environment Typically imprecise, unreliable and incomplete data Inherent measurement noises (e.g. GPS) and low success rates (e.g. RFID) Communication link or sensor node failures (e.g. wireless sensor networks) Spatial and temporal biases because of measurement constraints Traditional data management tools are ill-equipped to handle these challenges

Example: Wireless Sensor Networks : Example: Wireless Sensor Networks A wireless sensor network deployed to monitor temperature sensors select time, avg(temp) from sensors epoch 1 hour User {10am, 23.5} {11am, 24}

Example: Wireless Sensor Networks : Example: Wireless Sensor Networks A wireless sensor network deployed to monitor temperature sensors User

Typical Solution : Typical Solution Process data using a statistical/probabilistic model before operating on it Regression and interpolation models To eliminate spatial or temporal biases, handle missing data, prediction Filtering techniques (e.g. Kalman Filters), Bayesian Networks To eliminate measurement noise, to infer hidden variables etc Database insert into raw-data … Table raw-data Extract all readings into a file Run a statistical model (e.g. regression) using MATLAB Write output to a file Write data processing tools to process/aggregate the output Sensor Network User Databases typically only used as a backing store; All data processing done outside

Issues : Issues Can’t exploit commonalities, reuse/share computation No easy way to keep the model outputs up-to-date Lack of declarative languages for querying the processed data Large amount of duplication of effort Non-trivial Expert knowledge & MATLAB familiarity required ! Prevents real-time analysis of the data in most cases Why are databases not doing any of this ? We are very good at most of these things

Solution: Model-based User Views : Solution: Model-based User Views An abstraction analogous to traditional database views Provides independence from the messy measurement details

MauveDB System : MauveDB System Supports the abstraction of Model-based User Views Provides declarative language constructs for creating such views Supports SQL queries over model-based views Keeps the models up-to-date as new data is inserted into the database

MauveDB System : MauveDB System Supports the abstraction of Model-based User Views Provides declarative language constructs for creating such views Supports SQL queries over model-based views Keeps the models up-to-date as new data is inserted into the database

Outline : Outline Motivation Model-based views Details, view creation syntax, querying Query execution strategies MauveDB implementation details Experimental evaluation

Linear Regression : Linear Regression Models a dependent variable as a function of a set of independent variables Model temperature as a function of (x, y) E.g. temp = w1 + w2 * x + w3 * x2 + w4 * y + w5 * y2

Grid Abstraction : Grid Abstraction temperatures Use Regression to model temperature as: temp = w1 + w2 x + w3 x2 + w4 y + w5 y2 A Regression-based View raw-temp-data User User Consistent uniform view Apply regression; Compute “temp” at grid points

Creating a Regression-based View : Creating a Regression-based View CREATE VIEW RegView(time [0::1], x [0:100:10], y[0:100:10], temp) AS FIT temp USING time, x, y BASES 1, x, x2, y, y2 FOR EACH time T TRAINING DATA SELECT temp, time, x, y FROM raw-temp-data WHERE raw-temp-data.time = T

View Creation Syntax : View Creation Syntax Somewhat model-specific, but many commonalities CREATE VIEW IntView(t [0::1], sensorid [::1], y[0:100:10], temp) AS INTERPOLATE temp USING time, sensorid FOR EACH sensorid M TRAINING DATA SELECT temp, time, sensorid FROM raw-temp-readings WHERE raw-temp-readings.sensorid = M A Interpolation-based View

Outline : Outline Motivation Model-based views Details, view creation syntax, querying Query execution strategies MauveDB implementation details Experimental evaluation

Querying a Model-based View : Querying a Model-based View Analogous to traditional views So: select * from reg-view Lists out temperatures at all grid-points select * from reg-view where x = 15 and y = 20 Lists temperature at (15, 20) at all times …

Query Processing : Query Processing Two operators per view type that support get_next() API ScanView Returns the contents of the view one-by-one IndexView (condition) Returns tuples that match a condition e.g. return temperature where (x, y) = (10, 20) select * from locations l, reg-view r where (l.x, l.y) = (r.x, r.y) and r.time = “10am”

View Maintenance Strategies : View Maintenance Strategies Option 1: Compute the view as needed from base data For regression view, scan the tuples and compute the weights Option 2: Keep the view materialized Sometimes too large to be practical E.g. if the grid is very fine May need to be recomputed with every new tuple insertion E.g. a regression view that fits a single function to the entire data Option 3: Lazy materialization/caching Materialize query results as computed Generic options shared between all view types

View Maintenance Strategies : View Maintenance Strategies Option 4: Maintain an efficient intermediate representation Typically model-specific Regression-based Views Say temp = f(x, y) = w1 h1(x, y) + … + wk hk(x, y) Maintain the weights for f(x, y) and a sufficient statistic Two matrices (O(k2) space) that can be incrementally updated ScanView: Execute f(x, y) on all grid points IndexView: Execute f(x, y) on the specified point InsertTuple: Recompute the coefficients Can be done very efficiently using the sufficient statistic Interpolation-based Views Build and maintain a tree over the tuples in the TRAINING DATA

Outline : Outline Motivation Model-based views Details, view creation syntax, querying Query execution strategies MauveDB implementation details Experimental evaluation

MauveDB: Implementation Details : MauveDB: Implementation Details Written in the Apache Derby Java open source database system Support for Regression- and Interpolation-based views Minimal changes to the main codebase Much of the additional code (approx 3500 lines) fairly generic in nature A view manager (for bookkeeping) Query processing operators View maintenance strategies Model-specific code Intermediate representation Part of the view creation syntax

MauveDB: Experimental Evaluation : MauveDB: Experimental Evaluation Intel Lab Dataset 54-node sensor network monitoring temperature, humidity etc Approx 400,000 readings Attributes used Independent - time, sensorid, x-coordinate, y-coordinate Dependent - temperature

Spatial Regression : Spatial Regression Contour plot over the data obtained using: select * from reg-view where time = 2100

Interpolation : Interpolation Time

Comparing View Maintenance Options : Comparing View Maintenance Options 50000 tuples initially Mixed workload: insert 1000 records issue 50 point queries issue 10 average queries Brief summary: Intermediate representation typically the best Among others, dependent on the view properties, and query workload 112.6s

Ongoing and Future Work : Ongoing and Future Work Adding support for views based on dynamic Bayesian networks (e.g. Kalman Filters) A very general class of models with wide applicability Generate probabilistic data Developing APIs for adding arbitrary models Minimize the work of the model developer Query processing, query optimization, and view maintenance issues Much research still needs to be done

Conclusions : Conclusions Proposed the abstraction of model-based views Poweful abstraction that enables declarative querying over noisy, imprecise data Exploit commonalities to define, to create, and to process queries over such views MauveDB prototype implementation Using the Apache Derby open source DBMS Supports Regression- and Interpolation-based views Supports many different view maintenance strategies

Thank you !! : Thank you !! Questions ?

Related Online Classes

Copyrights © 2009 authorGEN. All rights reserved.