Task 1 - Incorporate a process to validate de-duplication strategies.

Task Description

Configuring a de-duplication strategy to find potential duplicates is a moderately complex task. If configured incorrectly, the de-duplication process may fail, or the linkage results may be inaccurate. To ensure a properly configured linkage approach, we will incorporate a validation process that highlights errors in the linkage configuration. Examples of invalid matching configurations include undefined blocking field(s), undefined matching fields, etc.

Solution

Validate each criteria and help the user to fix any issues before the matching criteria is saved.

Approach

For new strategies

A checklist will be added to the left of the save button at the bottom of the page. The checklist will display a list of errors and warnings. The errors will be displayed in the red color whereas the warnings are displayed in green. The list will be dynamically updated at each action by the user. The "Save" button to save the strategy will be disabled initially and will be kept disabled as long as there are errors in the strategy. When the user fix all the errors the "Save" button will be enabled and user will be allowed to to proceed.

When proceeding to save, a confirmation window will be shown to the user. It will include the warnings about the strategy (as remaining in the checklist) if any, and the other details (related to other tasks). If the user is happy to proceed with the strategy, he/she can save the strategy (even if there are warnings) or else he/she can go back and do further modifications to the strategy.

Mock UI for the task:

For existing strategies

The existing strategies are checked at the update of the module and the user will be prompted to correct if the strategy fails the test. (Need feasibility study)

Additional notes

The space used by the check list will be used to display any other details related to other tasks (e.g. the number of pairs the strategy would create as in task 2) as necessary.
Not specifying at least one "Must match" field and at least one "Should match" field will be considered as an error. User need to specify and least one of each of them to save an strategy.
The warnings situations will be identified by the task 4.

Task 2 - Incorporate a process to calculate total number of potential pairs formed by particular blocking strategy.

Task Description

The de-duplication module searches through "record pairs" that have a high likelihood for being duplicates. "Records pairs" are formed using "blocking strategies", which are simple approaches to finding similar records by requiring that one or more corresponding fields exactly agree among 2 records. Occasionally, however, the user may choose a blocking strategy that results in very large or very low numbers of record pairs. Widely varying numbers of record pairs can result in unexpected results, including out-of-memory errors, excessively long runtime, confusing or inaccurate results, etc.

Solution

Give the user an estimation of how much potential pairs that the strategy would result before he uses the strategy.

Approach

Calculate the estimations of how many results it would produce. The estimations will be done when the user save the new strategy. The user will be shown the estimations on how much pairs whould be resulted with the new strategy. If the number of record pairs is more than 10 times (this is configurable) the original number of record pairs, the user will be warned to consider changing the strategy.

The number of total records (at the time of saving the strategy) is also saved with the strategy. When the user runs the de-duplication process the strategies will be re-estimated if the no of total records have been changed by more than 10%(configurable) of the records it had when the strategy is created.

In case the newly created strategy has same blocking fields as an existing strategy, the new strategy would use the same estimation value instead of recalculating the estimations.

Mock UI:

Task 3 - Upgrade the de-duplication reports from flat files to database persistence.

Task Description

The de-duplication module creates reports listing potentially duplicate records, which end-users can manually review and merge when necessary. Until recently, these "de-duplication reports" were stored as flat files. Unfortunately, flat files limit the ability to manage the data and hinder new creative ways to display the data. Therefore, upgrading from flat files to persisting the data in a relational database will help users and developers more meaningfully use this data.

Solution

Implement a functionality that persist reports into relational databases instead of flat-files

Approach

ER Diagram

Database Schema

patientmatching_report{
    id
    name : varchar(50)
    createdBy refers user.id
    createdOn : datetime
}

Implemented by now

patientmatching_matchingset{
    reportId refers patientmatching_report.id
    groupId
    patientId refers patient.id
    state
}

The "patientmatching_matchingset" table will store the sets that are identified as the duplicates. The groupId identifies the duplicate groups within a report. State column keep whether the match is “IGNORED”, “MERGED”, or “PENDING”. Partially implemented

patientmatching_report_configurations{
    reportId refers patientmatching_report.id
    configurationId refers patientmatching_configuration.id
}

This table will store the configurations(Strategies) that are used for the reports. Implemented

patientmatching_report_process_time{
    reportId refers patientmatching_report.id
    processSequenceNo
    process : varchar(50)
    timetaken
}

Partially implemented

Project Timeline

Task	Projected completion time	Comments
Task Number 1	One week (Deadline 28th May)	This is a basic task which is ideal to start off with. We allocated one week to it considering that this is the very fist task, and that it lays initial groundwork for tasks 2 and 4. Furthermore, this will be an ideal opportunity to get familiar with OpenMRS coding conventions
Task Number 2	Two weeks (Deadline 11th June)	Completing task one should prepare the student for task two, which is more complicated. We have allocated two weeks time because this task will also include a) extensive testing for data accuracy b) The first task involving Hibernate
Task Number 3	Two weeks (25th June)	Requirements for these tasks have not yet been assessed. Therefore, we can only present a 'projected' deadline. However, consiering that this task is already half completed, and that task 2 allowed the student to get familiar with hibernate, we assume that a two week period would be satisfactory for now
Task Number 4	(Projected - three weeks)	To be finalized
Task Number 5		To be finalized
task Number 6		To be finalized

Projects

Agreed Requirements

Task 1 - Incorporate a process to validate de-duplication strategies.

Task Description

Solution

Approach

For new strategies

For existing strategies

Additional notes

Task 2 - Incorporate a process to calculate total number of potential pairs formed by particular blocking strategy.

Task Description

Solution

Approach

Mock UI:

Task 3 - Upgrade the de-duplication reports from flat files to database persistence.

Task Description

Solution

Approach

ER Diagram

Database Schema

Project Timeline