EMBL - Bork Group

About
  • LSAT was generated semi-automatically using a two-step procedure.
  • In the Information Retrieval step, an SVM classifier was trained using inductive learning to identify sentences from MEDLINE describing generation of alternative transcripts.
  • In the Information Extraction step, information including gene names, tissues, species, specificity, number of isoforms, and experimental methods were extracted.
  • LSAT entries contain identifiers from databases like PubMed , SwissProt , Refseq , GenBank and Ensembl .
  • We used SVMlight, Bow Toolkit , PASBio , NLProt and Stanford lexical parser while generating LSAT.
  • Sentences extracted by SVM classifier and subsequently tagged by entity taggers are available here .
  • References
  • Shah PK, Jensen LJ, Boue S and Bork P.
    Extraction of Transcript Diversity from Scientific Literature.
    PLoS Computational Biology: 1(1) e10 .
  • Shah PK and Bork P
    Learning About Alternative Transcripts in MEDLINE using Support Vector Machines
    Bioinformatics. Submitted