Pentaho ETL and Dimensional Modeling (Design Page, R&D)
OpenMRS has few tools in place allowing for easier analysis of concept, patient, location, encounter or visit data in an aggregated, dimensional manner. OLAP (Online Analytical Processing) is one technology encompassed under the umbrella of business intelligence that facilitates rapid answers to multi-dimensional querying of data.
Click on the image at right for a simple sample of what dimensional modeling looks like at a high level.
This functionality extends beyond traditional reporting in several ways:
- OLAP tends to be more adhoc in nature, although static queries can be issued to the underlying engine;
- OLAP deals specifically with querying at an aggregate level (whether that means simply counting patients meeting requested dimensional criteria or performing complex calculations on some other measure), and drilling into interesting intersections (or tuples) for more detailed data;
- OLAP requires a significantly different data model than the underlying OpenMRS data model in order to craft the dimensional models that the OLAP engine understands.
The community edition of the Pentaho Business Intelligence suite includes Pentaho Analysis, an OLAP engine (specifically ROLAP) project named Mondrian. This suite also includes the powerful and easy to use Pentaho Data Integration, an ETL engine project named KETTLE. The Pentaho BI platform is architected as a pluggable and componentized set of pillars (or engines) that can operate in the Pentaho BI Server as an integrated set of tools, or can be utilized as embedded engines in other applications individually.
The purpose of this project is to research the feasibility of integrating the Pentaho toolset with OpenMRS to provide advanced ETL and analytic modeling within an OpenMRS module.The design should take into consideration the fact that data semantics can vary significantly between OpenMRS implementations, which makes standard ETL and modeling challenging. The module will require rules definition and logic that guide the ETL processes in the movement of dimensional members and aggregation of fact data. Design considerations for this logic should include utilizing existing forms, rules engines, fuzzy matching, semantic modeling and other techniques to bridge the gap between concrete facts and data-relationship semantics.
The project will include ongoing development of a set of prototype ETL transformations and models in order to flesh out detailed requirements and validate design decisions. This requires a decent sized implementation of OpenMRS to partner with, and Andy Kanter from the Millenium Villages (Institute, Columbia University) project has agreed to participate.
There will be two sets of parallel objectives defined. The first set of objectives pertain specifically to the work that will be done with the MVP project. The second set pertain to the overall project, which aim to create functionality that will benefit the largest set of implementations of OpenMRS as possible. This approach is necessary to first understand the problem space and the requirements of a single implementation clearly in order to better understand the needs and thus the design required for the abstract (the broader set of implementers). The objectives we set for the prototypes developed for the MVP project compliment the objectives for the overall project.
MVP Project Objectives
- Analyze 5 - 10 dimensional problem cases, and model these cases using Pentaho Data Integration and Pentaho Analysis.
- Build an environment that can access an MVP OpenMRS populated data model.
- Create the ETL transformations and jobs to populate the MVP data-mart.
- Process the MVP data and transform it into the data-mart's tables
- Sufficient de-identification processing and security must be documented and agreed upon before the data can moved or visualized.
- Create the OLAP models against the data-mart schema.
- Build an environment for visualizing the OLAP models that can be accessed by Pentaho resources as well as MVP resources.
- Build a feedback mechanism into the visualization environment to allow iterative development of the models.
- Detailed design and documentation for the data schema and data relationship logic required to facilitate a standard analytics environment for the MVP project, typically termed a data-mart or data warehouse.
Overall Project Objectives
The primary objective at this time is to define a rational design for easily standing up and populating a data model for an OpenMRS installation that is conducive to analytic querying.
- Detailed design and documentation for the data schema required to facilitate a standard analytics environment for a base OpenMRS deployment, typically termed a data-mart or data warehouse.
- Investigate the requirements and document findings around dynamically expanding the warehouse or data-mart schema as new and/or unique attributes, dimensions and measures are introduced to the OpenMRS system.
- Research technologies that can facilitate the acquisition, processing and input of data relationship logic from a user interface (?) to the Pentaho Data Integration ETL engine, in order to accomplish the dynamic ETL processing necessary to populate the data-mart or warehouse according to the unique data requirements for an individual OpenMRS implementation.
- Build an OpenMRS module that will
- allow rules definition creation and editing, whether that be a user interface or other;
- process rules definitions and convert to ETL process input;
- harness the Pentaho Data Integration ETL engine for embedded execution of processes that will build and populate the schema.
- Research and document the most appropriate approach for querying the Pentaho Analysis engine and visualizing the new data model.
OpenMRS module development
Pentaho ETL and analysis skills
Pentaho ETL embedding development
Data warehouse modeling and development
The guys at Meteorite Consulting have offered hosting services for the analysis visualization prototyping; they may also be able to facilitate the MVP data-mart schemas.