gnTEAM | Resources / MinorThird

Summary: MinorThird is a collection of JAva classes for storing, annotating and categorizing text as well as extracting entities. It was written primarily by William W. Cohen, a professor at Carnegie Mellon University in the Machine Learning Department in the School of Computer Science . Contributions have been made by many other colleagues and students including Edoardo Airoldi, Vitor Rocha de Carvalho, Einat Minkov, Ensurance, Sunita Sarawagi, Kevin Steppe, Richard Wang, and Frank Lin. MinorThird is currently maintained by Frank Lin. Help available can be found at MT's Help Forum.

Mixup: Moderately complex hand-coded annotation programs can be implemented with a special-purpose annotation language called Mixup, which is part of MinorThird. Mixup is based on a the widely used notion of cascaded finite state transducers, but includes some powerful features, including a GUI debugging environment, escape to Java, and a kind of subroutine call mechanism. Mixup can also be used to generate features for learning algorithms, and all the text-based learning tools in MinorThird are closely integrated with Mixup. For instance, feature extractors used in a learned named-entity recognition package might call a Mixup program to perform initial preprocessing of text. Mostly, the use of MT in GN team is being done information extraction which comes in the form of spans. Span can be a token of a sentence of a group of words, depending the rules that are written in mixup.

Where? It is installed on gnode. To run, you just need to straight java the required command. All the classpaths are automatically set up so you don't even need to run the source scripts/setup.linux command. This is now a 64bit OS running 64bit Java, so you are free from 32bit memory addressing constraints when allocating the VM additional momory, e.g.:
java -Xmx5G edu.cmu.minorthird .... If required, you can browse the install directory location:
/usr/local/bin/minorthird

Version: 2011-04-11 It can be downloaded from the SourceForge page.

1.To run MT you need to type in the command line the following:

java edu.cmu.minorthird.ui.RunMixup -labels folder_name -mixup file_name.mixup -showResult (-saveAs file_name2.labels)

Next to -labels is the folder name that contains the corpus of interest.
Next to -mixup is the mixup program that you want MT to use on your corpus which is situated on your folder_name
Next to -saveAs is the name of the file in which you want to save your extracted spans - this is an optional field, thus the existence of ()

2.In case someone wants to use the gui instead of the traditional command line, pelase type:

java edu.cmu.minorthird.ui.RunMixup.gui

To define a span: DefSpanType span_name = ...[rules]...;

Note: everything that is inside the [] will be extracted if the rule is true.

To define a dictionary: DefDict dict_name = words of interest/external file;

Note: defDict diseases = diabetes, obesity, depression;

To define a multiword dictionary: DefDict dict_name =~ trie words of interest/external file;

Note: defDict multi_diseases=~ trie diabetes type II, cardiovascular disease, chronic depression;

To include a dictionary in a span: DefSpanType Span_name =: ... [ (rules) a(dict_name) (rules)]...;

Note: (rules) are optional.

To include a span in another span: DefSpanType Span_name =: ...[(rules) @span_name (rules)]...;

Note: (rules) are optional.

To match a particular token: eq('token_name').

Note: eqi('token_name') will match both lowercases and uppercases of the token. Same thing applies for ai(dict_name).

To include regular expressions: re('regex').

Note: Inside 'regex' type the regular expression of interest.

To find any word between a particular pattern/rule: any --------> defSpanType Population =: ...eqi('in') [any eqi('population')]...;

Note 1: eqi('in') looks for -----> IN, In, in

Note 2: eqi('population') looks for -----> Population, POPULATION, population etc.

Note 3: The any will return if present in text, spans like these:

in the population.
in Korean population.
in a population.

Note 4: If there is a desire to stop the extraction process when a particular word appears in the position of any, create a dictionary with these words and apply it like this:

defDict stopwords = the, a; defSpanType Population =: ...eqi('in') [!ai(stopwords) eqi('population')]...;

This will return only:

in Korean population.

Note 5: Any on its own refers to one word on text.

Note 6: If you want to extract more than one word, you can use any{[0-9],[0-9]}

Example: any{3} will return 3 words in a sequence

Example: any{0,2} will return if possible max 2 words. If not it, would be either one or zero (it's best to use it like this when the aim is to highlight particular words in semantic patterns)

To modify the above example:

defSpanType Population =: ...eqi('in') [any{0,2} eqi('population')]...;

This will result in the extraction of:

in Korean population.
in a population.
in the Greek population.
in population.