Medical data extraction from legacy databases – case study
Faculty of Postgraduate Medical Education
Abstract
Here we present extracting relevant medical information from a free text database. The
documentation included individual records of 19 694 patients treated at the Center for
Diagnosing and Treatment of Asthma and Allergy, Medical University of Lodz between years
1995 and 2006. The database was based on legacy engine with no export feature and
fragmentary documentation. The aim of the study was to data mine relevant clinical data for
asthmatic patients 12-months prior and 36 months after the index date. Index event was
defined as adding montelukast or salmeterol to present therapy or excluding salmeterol from
therapy. The results of this retrospective observational study were in agreement with previous
Introduction
In 2002, the English National Health Service (NHS) began the process of transforming its
health-care system with information technology. The experts assess the costs have already
doubled reaching $24 billion and for some the project is sleepwalking toward disaster.
Nevertheless wide adoption of databases in medicine is of paramount significance. It is not
only the matter of cutting down administrative inefficiencies in healthcare but also saving
lives having all the crucial patients information always at hand. Beyond the single Electronic
Health Record lies the true big picture however. That is the ability to query the whole
population of patients’ data for treatment-outcomes relationships on a truly Evidence-Based
Medicine basis. This could also mean detecting dangerous drugs interactions, undetectable at
the moment without large targeted randomized clinical trials.
The aim of the study was to data mine relevant clinical data for asthmatic patients 12-months
prior and 36 months after the index date. Index event was defined as adding montelukast or
salmeterol to present therapy or excluding salmeterol from therapy.
The documentation included individual records of 19 694 of patients treated at the Center for
Diagnosing and Treatment of Asthma and Allergy, Medical University of Lodz between years
1995 and 2006. It amounted to about 70 thousand pages of clinical data collected in a textual
database. Each entry was personally input by a doctor working at the Centre during patients’
visits and consisted of an interview, physical examination, laboratory results and prescribed
drugs. All fields were unstructured text only.
We began by importing the records into a Microsoft Access database. This was achieved
through a VBA routine that was tailored against the available documentation and reverse
engineering of the database rudimentary relational model. The database maker was no longer
reachable and original company ceased to exist.
The relevant entity extraction was linguistic based and employed heuristic rules and shallow
parsing techniques on specific parts of the text around certain keywords. It was inspired by
the works of Friedman et al. and their MedLEE extraction system. The aim was not only to
extract some basic symptoms like daytime dyspneas or wheeze but also prescribed doses,
occurrence of certain events (for instance asthma related hospitalizations) and pulmonary
function tests values. It was crucial for the scientific relevance of acquired data that all entities
are recalled. This, however, resulted in a high percentage (30%) of data being misclassified.
This first fully automated query resulted in a preliminary set narrowed down to about 250
records that met the stringed inclusion criteria. These were than manually checked for errors
of automation on case by case basis and the study query was repeated resulting in the final set
of 189 patients and their respective results for all the periods assessed as schematically shown
Because the fields were text only and actually no strict rules were imposed on filling in the
data, we found it difficult to automate the process of extracting information. The key
problems identified were: spanning the information over more than one field, typos,
ambiguous abbreviations, shorthand and hyphenation. To account for discrepancy between
visit date and outcomes that occurred days or months earlier, we found that a separate layer
must be created that stores the events on a day by day basis. This allowed for instance to
precisely compute the average prescribed daily doses (exposure).
During the observation period of ten years spirometry equipment was modernized, which
resulted in a different reference range for the pulmonary function tests. That is why we found
it more reasonable to extract the equivalents expressed in percentage of the predicted value.
As the time factor was involved, over the years there were also subtle changes in therapy
guidelines and also some proprietary drug forms have left the market. This all had to be taken
Overall we found the acquired data valid. Repeated Measures ANOVA used to statistically
analyze the results showed two already known trends in asthma. These were the presence of
synergy between salmeterol and inhaled corticosteroids and montelukast positive influence on
allergic rhinitis. The results were valuable as no observational study in asthma of such length
Conclusions
We concluded that extracting relevant medical information from legacy databases is possible,
but the measures taken may be unfeasible on a larger scale due to time and resources
involved. Also textual sources although offering far greater flexibility are not particularly well
Future ease of exporting data should always be considered when deciding how to store
biomedical data. This is unfortunately rarely the case, as cheaper Database Management
Systems are chosen over more expensive solutions built to last.
References
1. Leroy G, Chen H, Martinez JD., A shallow parser based on closed-class words to
capture relations in biomedical text. J Biomed Inform. 2003 Jun;36(3):145-58.
2. Friedman C, Shagina L, Lussier Y, Hripcsak G., Automated encoding of clinical
documents based on natural language processing. J Am Med Inform Assoc. 2004 Sep-
3. Long-acting beta2-agonists versus anti-leukotrienes as add-on therapy to inhaled
corticosteroids for chronic asthma., Cochrane Database Syst Rev. 2005 Jan
Figure 1. Schematic of the final dataset creation Natural Language Processing, pre-selection Manual reclassification, cases reduction ‘Day by day’ layer Final dataset
Drugs for Alzheimer’s disease How and where can you get the drugs? The drugs that are currently available are not a cure and do not stop the The drugs are available on NHS prescription from progression of the disease. They may, however, temporarily ease some approved hospital specialists according to strict of the symptoms of Alzheimer’s disease in some people. criteria. Treatme