Patient matching - Moving flat files into a database

Goals

Currently, the patientmatching module makes extensive use of flat files for storing patientmatching reports. 

Our aim is to move from a flat file system to a database.

However, to do so, we must first move all user defined strategies into a database as well.

Therefore, our task is two pronged,

  • 1. Move user defined patientmatching strategies into the database
  • 2. Store user generated reports in the database.

The aim of adding database support is to,

  • 1. Manage data in a convenient matter
  • 2. Allow users to understand their data easily
  • 3. Stop storing duplicate and unnecessary data in flat files

Points to ponder

The Current patientmatching module uses a single database table, 'persistreportinfo'. Ideally, this table should be modified, or removed completely by the end of our efforts.

Exactly what data is duplicated or unnecessary ?

Development work includes updating the patientmatching module to support hibernate, as well as enhancing many model classes to work with the hibernate data storage / retrieval process.

The work is carried out, tested and committed iteratively. 

Current status

Currently we have completed work on stage one of the project, which is to store user defined strategies in the database.
Previously, user defined strategies were stored in the Config.xml file as follows,

config.xml fragment
<run estimate="false" name="famName" npairs="0" null_EM="false" null_scoring="false"
        p="0.0"
        random-sample="true"
        sample-size="100000"
        ulocked="false">
      <row name="(Attribute) Birthplace" trinomial_EM="false">
         <BlockOrder>null</BlockOrder>
         <BlckChars>40</BlckChars>
         <Include>false</Include>
         <TAgreement>0.9</TAgreement>
         <NonAgreement>0.1</NonAgreement>
         <ScaleWeight>null</ScaleWeight>
         <Algorithm>Exact Match</Algorithm>
         <SetID>0</SetID>
      </row>
      <row name="(Attribute) Citizenship" trinomial_EM="false">
         <BlockOrder>null</BlockOrder>
         <BlckChars>40</BlckChars>
         <Include>false</Include>
         <TAgreement>0.9</TAgreement>
         <NonAgreement>0.1</NonAgreement>
         <ScaleWeight>null</ScaleWeight>
         <Algorithm>Exact Match</Algorithm>
         <SetID>0</SetID>
      </row>
      <row name="(Attribute) Civil Status" trinomial_EM="false">
         <BlockOrder>null</BlockOrder>
         <BlckChars>40</BlckChars>
         <Include>false</Include>
         <TAgreement>0.9</TAgreement>
         <NonAgreement>0.1</NonAgreement>
         <ScaleWeight>null</ScaleWeight>
         <Algorithm>Exact Match</Algorithm>
         <SetID>0</SetID>
      </row>
         -------------------
         --------------
         ----------
         ------

Given above is a sample xml file depicting how a strategy is stored. AS you can see, a single ConfigurationEntry is stored as a <row> tag attribute. Each Strategy has n-many ConfigurationEntries, which may or many not be used for the scoring.

However, not all these data columns should be persisted as many of them are merely hard coded values. Therefore, the database scheme for strategies are as follows,

sqldiff fragment
<sql>
CREATE TABLE `patientmatching_configuration` (
			`configuration_id` int AUTO_INCREMENT,
			`configuration_name` varchar(255) default NULL UNIQUE,
			`random_sample_size` int default NULL,
			`using_random_sample` smallint(6) NOT NULL DEFAULT '0',
			PRIMARY KEY (`configuration_id`)
			) ENGINE=InnoDB DEFAULT CHARSET=utf8;

			CREATE TABLE `patientmatching_configurationEntries` (
			`entry_id` int AUTO_INCREMENT PRIMARY KEY,
			`configuration_id` int default NULL,
			`field_view_name` varchar(255) default NULL,
			`field_name` varchar(255) default NULL,
			`inclusion` varchar(50) default NULL,
			`block_order` int NOT NULL DEFAULT '0',
			`flag` varchar(50) DEFAULT NULL,
			KEY `configurationEntries_id` (`configuration_id`),
			CONSTRAINT `configurationEntries_id` FOREIGN KEY (`configuration_id`)
			REFERENCES `patientmatching_configuration` (`configuration_id`)
			) ENGINE=InnoDB DEFAULT CHARSET=utf8;

		</sql>

Stage two of this effort would be to store reports in the database using the PatientMatching strategies created by the user.
We are leaving this phase open for our GSOC student !

Additional support

Students may find that the reports they generate often fail to identify any matching data. This is because their databases do not contain adequate Patient records for PatientMatching to run successfully. The successful student would be be granted access to an extensive test database for development purposes.

For now, we will paste the contents of several successful reports so that students can get a better idea of what the xml data file looks like.

1. Sample report number 1

sampe report
Group Id|Unique Id|patientmatching.org.openmrs.Patient.birthdate|patientmatching.org.openmrs.Patient.dead|patientmatching.org.openmrs.Patient.gender|patientmatching.org.openmrs.PersonName.familyName|patientmatching.org.openmrs.PersonName.givenName|patientmatching.org.openmrs.PersonName.middleName
0|10024|1991-01-01 00:00:00.0|false|f|maxine|catherine|j
0|10026|1991-01-01 00:00:00.0|false|f|j|catherine|maxine

As you can see, the above report has Identified that two patient records are possible duplicates, and has grouped them together

2. Sample report number 2

sampe report
Group Id|Unique Id|patientmatching.org.openmrs.Patient.birthdate|patientmatching.org.openmrs.Patient.dead|patientmatching.org.openmrs.Patient.gender|patientmatching.org.openmrs.PersonName.familyName|patientmatching.org.openmrs.PersonName.givenName|patientmatching.org.openmrs.PersonName.middleName
0|10024|1991-01-01 00:00:00.0|false|f|maxine|catherine|j
0|10026|1991-01-01 00:00:00.0|false|f|j|catherine|maxine
1|10005|1982-10-11 00:00:00.0|false|f|ganu|deepa|abhishek
1|10023|2009-03-12 00:00:00.0|false|f|ganu|deepa|
2|10031|1945-01-01 00:00:00.0|false|f|jane|mary|
2|10032|1991-01-01 00:00:00.0|false|f|jane|mary|j

So our question is, what is the best way to model the above xml files as sql tables ? Do note that the columns of every report is not going to be the same. Columns will change based on which Strategies are used to create the report.

Svn commit rev. Ids.