Recent Changes - Search:

Resources

TM tutorials


Internal


edit SideBar

LinnaeusDocumentParser

Project Wiki Page for discussion/notes/etc about the Linnaeus Document Parser development.


This is a subset of the total linnaeus files from Martin which can be used to load both PMC and MEDLINE document sets into databases on gnode. Present contributors: Chengkun (MEDLINE) and Geraint (PMC).


SVN access. I've set this up to be through your gnode1 ssh accounts - the set up isn't perfect, but should be sufficient.

Access (checkout) through the command line is via:
svn co svn+ssh://<gnode_username>@gnode1.mib.man.ac.uk/linnaeus

I presume you all have a reasonable idea of how to use svn (commit/add/del/etc). There are equivalent ways to get svn over ssh working for e.g. TortoiseSVN and Eclipse (see e.g. here for a start place).

[See here for a brief SVN usage tutorial]

I will, however, add that in its present (default) setup, it'll ask for your gnode account password on every svn access (add/del/commit/ etc). If you don't feel like entering this all the time, then you can setup an automatic access public/private key pair. See the previous link as a starting place for this.


Changes.

  • Initial version 2.0.
  • Version 2.1 changes:
    1. Edited PMC parser to read PMC IDs from the XML file rather than from the file names.
    2. Edited PMC parser to remove latex and table tags through the DOM tree instead of via a regex (which was failing in nested cases)
  • Version 2.2 changes:
    1. Edited Database access to force UTF-8 encoding to try to keep greek/etc. characters.
    2. Edited Document to: SQL.set(pstmt, 6, body.replaceAll( "[\\ud800-\\udfff]", "")); to avoid truncation error (4 byte Unicode characters) - change this back once using MySQL 6+
  • Version 2.3 changes:
    1. Edited EntityTagger to supply logger for correct logging info
    2. Edited DocumentIterator to avoid bad capture of an Exception stack-trace.
  • Version 2.4 changes:
    1. Edited Medline to correctly read abstracts with sub-headings (e.g. PMID: 17525486).
Edit - History - Print - Recent Changes - Search
Page last modified on June 10, 2014 at 11:33