Literature
Support
for
Alternative
Transcripts
LSAT was generated semi-automatically using a two-step procedure.
In the Information Retrieval step, an SVM classifier was trained using
inductive learning to identify sentences from MEDLINE describing generation of
alternative transcripts.
In the Information Extraction step, information including gene
names, tissues, species, specificity, number of isoforms, and experimental
methods were extracted.
LSAT entries contain identifiers from databases like
PubMed ,
SwissProt ,
Refseq ,
GenBank and
Ensembl .
We used
SVMlight,
Bow Toolkit ,
PASBio
,
NLProt
and
Stanford lexical parser while generating LSAT.
Sentences extracted by SVM classifier and subsequently tagged by
entity taggers are available here .
Shah PK, Jensen LJ, Boue S and Bork P.
Extraction of Transcript Diversity from Scientific Literature.
PLoS Computational Biology: 1(1) e10 .
Shah PK and Bork P
Learning About Alternative Transcripts in MEDLINE using Support Vector Machines
Bioinformatics. Submitted