i2b2/UTHealth 2014

These are the rules and the dictionaries used in the i2b2/UTHealth 2014 Challenge in Natural Language Processing for Clinical Data. The task involved the identification of heart disease risk factors from longitudianl clinical notes of diabetic records. The methodology is knowledge-driven and the system implements local lexicalised rules (based on syntactical patterns observed in notes) combined with manually constructed dictionaries that characterize the domain. The targeted heart disease risk factors were a total of eight classes, each one with its own specific indicators described above:

  • hyperlipidemia
    • mentions of the disease, high cholesterol levels and low-density lipoprotein levels
  • hypertension
    • mentions of the disease and high blood pressure
  • diabetes
    • mentions of the disease, high glucose levels and high haemoglobin levels
  • obesity
    • mentions of the disease, high body mass index (BMI) and waist circumference (WC)
  • coronary artery disease (CAD)
    • mentions of the disease, tests, events and related symptoms
  • medications
    • related prescribed medications to any of the above diseases, a total of 22 drug categories (e.g., sulfonylureas, meglitinides)
  • family history
    • first degree patient relatives who are/were diagnosed prematurely with CAD
  • smoking status
    • whether the patient is a “current”, “past”, “ever” or “never” smoker or they have an unknown smoking status.

The rules for the identification of these risk factors from clinical notes were created and implemented through MinorThird. Eight mixup files (Minorthird’s file format) contain the respective rules for each one of the targeted risk factors. You can download the files here:

  cad.mixup (24.0 KiB)

  diabetes.mixup (15.1 KiB)

  family.mixup (2.2 KiB)

  hyperlipidemia.mixup (11.5 KiB)

  hypertension.mixup (13.0 KiB)

  medications.mixup (13.9 KiB)

  obesity.mixup (12.0 KiB)

  smoking.mixup (15.6 KiB)

The rules were based on common lexical paterns that indicate the presence of the targeted factors (e.., “male with hypertension”, “pmh: diabetes, hypertension”). The rules were combined with a number of task-specific dictionaries that were manually tailored by observing the training set (provided by the organisers of the i2b2 challenge) for the usage of terms describing the associated risk factors and expressions related to their indicators (e.g., “blood pressure”, “high blood pressure”, “systolic blood pressure”, etc.), and by adding clinical synonyms and acronyms from the Unified Medical Language System21 (UMLS) for specific terms of interest. The dictionaries can be found inside the respective mixup files.

More details on the architecture and the performance of the tool can be found in the paper below.

  • Karystianis, G., Dehghan, A., Kovacevic, A., Keane, JA., Nenadic, G. Using Local Lexicalized Rules to Identify Heart Disease Risk Factors in Clinical Notes Corresponding Author: Dr. Goran Nenadic (Accepted 2015, JBI).
Contact George Christopher Karystianis, (karystianis@gmail) regarding any questions, bugs and/or suggestions.