|
MinorThirdSummary: MinorThird is a collection of JAva classes for storing, annotating and categorizing text as well as extracting entities. It was written primarily by William W. Cohen, a professor at Carnegie Mellon University in the Machine Learning Department in the School of Computer Science . Contributions have been made by many other colleagues and students including Edoardo Airoldi, Vitor Rocha de Carvalho, Einat Minkov, Ensurance, Sunita Sarawagi, Kevin Steppe, Richard Wang, and Frank Lin. MinorThird is currently maintained by Frank Lin. Help available can be found at MT's Help Forum. Mixup: Moderately complex hand-coded annotation programs can be implemented with a special-purpose annotation language called Mixup, which is part of MinorThird. Mixup is based on a the widely used notion of cascaded finite state transducers, but includes some powerful features, including a GUI debugging environment, escape to Java, and a kind of subroutine call mechanism. Mixup can also be used to generate features for learning algorithms, and all the text-based learning tools in MinorThird are closely integrated with Mixup. For instance, feature extractors used in a learned named-entity recognition package might call a Mixup program to perform initial preprocessing of text. Mostly, the use of MT in GN team is being done information extraction which comes in the form of spans. Span can be a token of a sentence of a group of words, depending the rules that are written in mixup. Where?
It is installed on gnode. To run, you just need to straight java the required command. All the classpaths are automatically set up so you don't even need to run the Version: 2011-04-11 It can be downloaded from the SourceForge page. 1.To run MT you need to type in the command line the following: java edu.cmu.minorthird.ui.RunMixup -labels folder_name -mixup file_name.mixup -showResult (-saveAs file_name2.labels)
2.In case someone wants to use the gui instead of the traditional command line, pelase type: java edu.cmu.minorthird.ui.RunMixup.gui
Note: everything that is inside the [] will be extracted if the rule is true.
Note: defDict diseases = diabetes, obesity, depression;
Note: defDict multi_diseases=~ trie diabetes type II, cardiovascular disease, chronic depression;
Note: (rules) are optional.
Note: (rules) are optional.
Note: eqi('token_name') will match both lowercases and uppercases of the token. Same thing applies for ai(dict_name).
Note: Inside 'regex' type the regular expression of interest.
Note 1: eqi('in') looks for -----> IN, In, in Note 2: eqi('population') looks for -----> Population, POPULATION, population etc. Note 3: The any will return if present in text, spans like these:
Note 4: If there is a desire to stop the extraction process when a particular word appears in the position of any, create a dictionary with these words and apply it like this: defDict stopwords = the, a; defSpanType Population =: ...eqi('in') [!ai(stopwords) eqi('population')]...; This will return only:
Note 5: Any on its own refers to one word on text. Note 6: If you want to extract more than one word, you can use any{[0-9],[0-9]} Example: any{3} will return 3 words in a sequence Example: any{0,2} will return if possible max 2 words. If not it, would be either one or zero (it's best to use it like this when the aim is to highlight particular words in semantic patterns) To modify the above example: defSpanType Population =: ...eqi('in') [any{0,2} eqi('population')]...; This will result in the extraction of:
|