Biomedical IE Heng Ji jih@rpi.edu.

Biomedical IE Heng Ji [email protected].

1 Biomedical IE Heng Ji ...

Author: Clyde Johns

0 downloads 3 Views

1 Biomedical IE Heng Ji

2 Text Annotation Interoperable Tools Task-neutral AnnotationTask-oriented Annotation Application annotated text User system development Defined by specific tasks Specific curation tasks in specific environments Mapping of Protein names to database IDs in specific text types Specific event types such as Protein-Protein Interaction Disease-Gene Association of specific diseases Task-neutral Annotation GENIA Corpus [U-Tokyo, NaCTeM] Development of generic tools Defined by theories Linguistics Tokens POS Phrase Structure Dependency Structure Deep Syntax (PAS) Biology Named Entities of various semantic types Events Linguistics + Biology Co-references Interoperable Tools

3 Annotation of GENIA corpus – Term&POSPart-of-speech annotation 2,000 abstracts Term (entity) annotation abstracts

4 Text semantic annotationannotation of events and involved named entities Example: “Regulation of Transcription events” BOOTSTrep project two different types of annotation levels linguistic annotation levels biological annotation level, in charge of marking the biological knowledge contained in the text Linking text with biological knowledge

5 Events and variables Biological events can be centred on:verbs, e.g. activate, nouns with verb-like meanings (nominalised verbs), e.g. transcription Different parts of sentence correspond to different types of variables in the event e.g. What caused event The narL gene product activates the nitrate reductase operon What was affected by event Analysis of mutants … Where event took place These fusions were formed on plasmid cloning vectors

6 Agent Characteristics Theme CharacteristicsVerb Frame Example “The narL gene product activates the nitrate reductase operon” activate Agent Characteristics protein Theme Characteristics operon

7 Drives or instigates event Entity or event Typically subject of verb, Role Name Description Phrase Type(s) Clues AGENT Drives or instigates event Entity or event Typically subject of verb, Follows by in passives The narL gene product activates the nitrate reductase operon THEME Affected by or results from event Typically object of verb, subject in passives recA protein was induced by UV radiation MANNER Method or way in which event is carried out Event (process), adverb, direction, in vitro, in vivo etc by, through, via, using cpxA gene increases the levels of csgA transcription by dephosphorylation of CpxR

8 Role Name Description Phrase Type(s) Clues INSTRUMENT Used to carry out event Entity with,with the aid of, via, by, through, using EnvZ functions through OmpR to control porin gene expression in Escherichia coli K-12 LOCATION Location of event in, on, near, etc Phosphorylation of OmpR by the osmosensor EnvZ modulates expression of the ompF and ompC genes in Escherichia coli SOURCE Start point of event from A transducing lambda phage carrying glpD''lacZ, glpR, and malT was isolated from a strain harbouring a glpD''lacZ fusion DESTINATION End point of event to, into Transcription of gntT is activated by binding of the cyclic AMP (cAMP)-cAMP receptor protein (CRP) complex to a CRP binding site

9 the nitrate reductase operonExample 1 operon the nitrate reductase operon The narL gene product protein the agent the theme (what is acted upon) activates

10 Linguistically Annotated CorporaGENIA Domain Mesh term: Human, Blood Cells, and Transcription Factors. Annotation: POS, named entity, parse tree Penn BioIE the molecular genetics of oncology the inhibition of enzymes of the CYP450 class. Yapex GENETAG a corpus of 20K MEDLINE® sentences for gene/protein NER

11 Ontology-driven annotationThe GENIA annotation Linguistic annotation Reveals linguistic structures behind the text Part-of-speech annotation annotates for the syntactic category of each word. Syntactic Tree annotation annotates for the syntactic structure of sentences. Semantic annotation Reveals knowledge pieces delivered by the text. Term annotation annotates domain-specific terms Event annotation annotates events on biological entities. Ontology-driven annotation

12 What about existing resources?Ontologies important for knowledge discovery They form the link between terms in texts and biological databases Can be used to add meaning, semantic annotation of texts

13 Link between text and ontologiesAdding new knowledge KEGG Ontological resources UMLS text Supporting semantics GO GENIA

14 Bridging the Gap– Integrating data, text and knowledgeDatabases Semantic Interpretation of data Adding new knowledge Ontological resources UMLS text Supporting semantics GO GENIA KEGG Semantic Interpretation of models in Systems Biology Mathematical Models

15 Resources for Bio-Text MiningLexical / terminological resources SPECIALIST lexicon, Metathesaurus (UMLS) Lists of terms / lexical entries (hierarchical relations) Ontological resources Metathesaurus, Semantic Network, GO, SNOMED CT, etc Encode relations among entities Bodenreider, O. “Lexical, Terminological, and Ontological Resources for Biological Text Mining”, Chapter 3, Text Mining for Biology and Biomedicine, pp.43-66

16 SPECIALIST lexicon UMLS specialist lexicon Each lexical entry contains morphological (e.g. cauterize, cauterizes, cauterized, cauterizing), syntactic (e.g. complementation patterns for verbs, nouns, adjectives), orthographic information (e.g. esophagus – oesophagus) General language lexicon with many biomedical terms (over 180,000 records) Lexical programs include variation (spelling), base form, inflection, acronyms

17 Lexicon record {base=Kaposi's sarcoma spelling_variant=Kaposi sarcomaentry=E cat=noun variants=uncount variants=reg variants=glreg} Kaposi’s sarcoma Kaposi’s sarcomas Kaposi’s sarcomata Kaposi sarcoma Kaposi sarcomas Kaposi sarcomata The SPECIALIST Lexicon and Lexical Tools Allen C. Browne, Guy Divita, and Chris Lu PhD NLM Associates Presentation, 12/03/2002, Bethesda, MD

18 Normalisation (lexical tools)Hodgkin Disease HODGKIN DISEASE Hodgkin’s Disease Hodgkin’s disease Disease, Hodgkin ... disease hodgkin normalise

19 Steps of Norm Remove genitive Replace punctuation with spacesHodgkin’s Diseases Replace punctuation with spaces Hodgkin Diseases Remove stop words Lowercase hodgkin diseases Uninflect each word hodgkin disease Word order sort disease hodgkin Lexical tools of the UMLS

20 The Gene Ontology (GO) Controlled vocabulary for the annotation of gene products 19,468 terms. 95.3% with definitions 10391 biological_process 1681 cellular_component 7396 molecular_function

21 Gene Ontology GOA database (http://www.ebi.ac.uk/GOA/) assigns gene products to the Gene Ontology GO terms follow certain conventions of creation, have synonyms such as: ornithine cycle is an exact synonym of urea cycle cell division is a broad synonym of cytokinesis cytochrome bc1 complex is a related synonym of ubiquinol-cytochrome-c reductase activity

22 GO terms, definitions and ontologies in OBOid: GO: name: mitochondrial genome maintenance namespace: biological_process def: "The maintenance of the structure and integrity of the mitochondrial genome.“ [GOC:ai] is_a: GO: ! mitochondrion organization and biogenesis

23 Metathesaurus organised by concept5M names, 1M concepts, 16M relations built from 134 electronic versions of many different thesauri, classifications, code sets, and lists of controlled terms "source vocabularies“ common representation

24 Are the existing knowledge resources sufficient for TM?Why? Limited lexical & terminological coverage of biological sub-domains Resources focused on human specialists GO, UMLS, UniProt ontology concept names frequently confused with terms

25 Naming conventions Update and curation of resourcesFlyBase gene name coverage 31% (abstracts) to 84% (full texts) Naming conventions and representation in heterogeneous resources Term formation guidelines from formal bodies e.g. HUGO, IPI not uniformly used Problems with integration of resources dystrophin used for 18 gene products “Dystrophin (muscular dystrophy, Duchenne and Becker types), included DXS143, DXS164, DXS206, …” HUGO

26 Term variation Terminological variation and complexity of namesHigh correlation between degree of term variation and dynamic nature of biomedicine Variation occurs in controlled vocabularies and texts but discrepancy between the two Exact match methods fail to associate term occurrences in texts with databases

27 What’s in a name? Breast cancer 1 (BRCA1) p53 Ribosomal protein S27Heat shock protein 110 Mitogen activated protein kinase 15 Mitogen activated protein kinase kinase kinase 5 From K. Cohen, NAACL 2007

28 Worst gene names sema domain, seven thrombospondin repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A K. Cohen NAACL 2007

29 Worst gene names sema domain, seven thrombospondin repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A K. Cohen NAACL 2007

30 Worst gene names sema domain, seven thrombospondin repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A SEMA5A K. Cohen NAACL 2007

31 Worst gene names sema domain, seven thrombospondin repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A SEMA5A Tyrosine kinase with immunoglobulin and epidermal growth factor homology domains tie K. Cohen NAACL 2007

32 Term ambiguity Neurofibromatosis 2 [disease]NF Neurofibromin 2 [protein] Neurofibromatosis 2 gene [gene] O. Bodenreider, MIE 2005 tutorial

33 Term ambiguity Gene terms may be also common English wordsBAD human gene encoding BCL-2 family of proteins (bad news, bad prediction) Gene names are often used to denote gene products (proteins) suppressor of sable is used ambiguously to refer to either genes and proteins Existing resources lack information that can support term disambiguation Difficult to establish equivalences between termforms and concepts

34 Homologues Cycline-dependent kinase inhibitor first introduced to represent a protein family p27 But it is used interchangeably with p27 or p27kip1, as the name of the individual protein and not as the name of the protein family (Morgan 2003). NFKB2 denotes the name of a family of 2 individual proteins with separate IDs in Swiss-Prot. These proteins are homologues belonging to different species, homo sapiens & chicken.

35 Terms Term: linguistic realisation of specialised concepts, e.g. genes, proteins, diseases Terminology: collection of terms structured (hierarchy) denoting relationships among concepts, part-whole, is-a, specific, generic, etc. Terms link text and ontologies Mapping is not trivial (main challenge)

36 Term variation and ambiguityTerm Term2 Term TEXT Term variation Term ambiguity Concept concept2 concept3 ONTOLOGY

37 Term mining steps Term recognition Tp53 Term classification GeneGenome Database, IARC TP53 Mutation Database Term mapping

38 Term recognition techniquesATR extracts terms (variants) from a collection of document Distinguishes terms vs non-terms In NER the steps of recognition and classification are merged, a classified terminological instance is a named entity The tasks of ATR and NER share techniques but their ultimate goals are different ATR for resource building, lexica & ontologies NER first step of IE, text mining

39 Overview papers S. Ananiadou & G. Nenadic (2006) Automatic Terminology Management in Biomedicine, Text Mining for Biology and Biomedicine, pp M. Krauthammer & G. Nenadic (2004) Term identification in the biomedical literature, JBI 37 (2004) J.C. Park & J. Kim (2006) Named Entity Recognition, Text Mining for Biology and Biomedicine, pp Detailed bibliography in Bio-Text Mining BLIMPhttp://blimp.cs.queensu.ca/ Book on BioText Mining S. Ananiadou & J. McNaught (eds) (2006) Text Mining for Biology and Biomedicine, Artech House. Other Bio-Text Mining tutorials Kevin Cohen (NAACL 2007 tutorial) U. Colorado

40 Biomedical IE/IR SystemsiHOP EBIMed GoPubMed PubFinder Textpresso

41 Acronyms Very productive type of term variationAcronym variation (synonymy) NF kappa B/ NF kB / nuclear factor kappa B Acronym ambiguity (polysemy) even in controlled vocabularies GR glucocorticoid receptor glutathione reductase

42 Acronym recognition Swartz, A. & Hearst, M. (2003) A simple algorithm for identifying abbreviation definitions in biomedical text, PSB 2003,8, Adar, E. (2004) SaRAD: a simple and robust abbreviation dictionary, Bioinformatics, 20(4) Chang, J.T. & Schutze, H. (2006) Abbreviations in biomedical text, Text Mining for Biology and Biomedicine, pp , Artech Tsuruoka, Y., Ananiadou, S. & Tsujii, J. (2005) A Machine learning approach to automatic acronym generation, ISMB, BioLink SIG, 25-31 Okazaki, N. & S.Ananiadou (2006) Acronym recognition based on term identification, Bioinformatics

43 The importance of acronym recognitionAcronyms are among the most productive type of term variation 64, 242 new acronyms are introduced in 2004 [Chang and Schütze 06] Acronyms are used more frequently than full terms 5,477 documents could be retrieved by using the acronym JNK while only 3,773 documents could be retrieved by using its full term, c-jun N-terminal kinase [Wren et al. 05] No rules or exact patterns for the creation of acronyms from their full form

44 Recognition Extracting pairs of short and long forms Distinguishing acronyms from parenthetical expressions Search for parentheses in text; single or more words; e.g. Ab (antibody) Limit context around ( ); limit number of words according to number of letters in acronym

45 Recognition (heuristics)Heuristics: match letters of acronym with letters of long form using rules, patterns letters from beginning of words combining forms carboxifluorescein diacetate (CFDA) Acronym normalisation to allow orthographic, structural and lexical variations morphological information, positional info Penalise words in long form that do not match acronym Accidental matching argininosuccitate synthetase (AS) A S

46 Letter matching Alignment: find all matches between letters of acronyms and their long forms and calculate likelihood (Chang & Schütze) Solves problem of acronyms containing letters not occurring in LF Choose best alignment based on features, e.g. position of letter etc. Finding optimal weight for each feature challenge

47 Acronym Recognition Okazaki, N., Ananiadou, S. (2006) Building an abbreviation dictionary using a term recognition approach. Bioinformatics. S.Ananiadou NaCTeM

48 A simple algorithm – Schwartz and Hearst (2003)Uses parenthetical expressions as a marker of a short form … long-form ‘(‘short-form ‘)’ … All letters and digits in a short form must appear in the corresponding long form in the same order We used hidden markov model (HMM) to … Early repolarization (ER) is an enigma. Misrecognition of forms

49 Problems of letter-matching approachHighly dependent on the expressions in the target text o acquired immuno deficiency syndrome (AIDS) x acquired syndrome (AIDS) x a patient with human immunodeficiency syndrome (AIDS) ? magnetic resonance imaging unit (MRI) ! beta 2 adrenergic receptor (ADRB2) ! gamma interferon (IFN-GAMMA) (These examples are obtained from actual MEDLINE abstracts) Naive with respect to term variations

50 AcroMine’s approach Extract a word or word sequence: Co-occurring frequently with an acronym (e.g., TTF-1) 1, factor 1, transcription factor 1, thyroid transcription factor 1 Does not co-occur with other surrounding words thyroid transcription factor 1 Not necessarily based on letter-matching Note that this is a difficult case for the letter-matching algorithm Prune unlikely candidates Nested candidates: transcription factor 1 Expansions: expression of thyroid transcription factor 1 Insertions: thyroid specific transcription factor 1

51 The contextual sentence of HMM and ASR.Short-form mining Enumerate all short forms in a target text Using parentheses as a clue: … ‘(‘short-form ‘)’ … Validation rules for identifying acronyms [Schwartz and Hearst 03] It consists of at most two words Its length is between two to ten characters It contains at least an alphabetic letter The first character is alphanumeric The contextual sentence of HMM and ASR. The present system consists of a hidden Markov model (HMM) based automatic speech recognizer (ASR), with a keyword spotting system to capture the machine sensitive words (registered in a dictionary) from the running utterances.

52 Enumerating long-form candidates for an acronymTokenize a contextual sentence by non-alphanumeric characters (e.g., space, hyphen, etc.) Apply Porter’s stemming algorithm [Porter 80] Extract terms that match the following pattern [:WORD:].*$ Empty string or words of any length We studied the expression of thyroid transcription factor-1 (TTF-1). 1 factor 1 transcript factor 1 thyroid transcript factor 1 expression of thyroid transcript factor 1 studi the expression of thyroid transcript factor 1 of thyroid transcript factor 1 thyroid transcript

53 Expansions for TTF-1

54 Top 20 acronyms in MEDLINE

55 Long-form candidates for acronym ADMLength Frequency Score Validity adriamycin 1 727 721.4 o adrenomedullin 247 241.7 abductor digiti minimi 3 78 74.9 doxorubicin 56 54.6 x effect of adriamycin 25 23.6 Expansion adrenodemedullated 19 17.7 acellular dermal matrix 17 15.9 peptide adrenomedullin 2 15.1 effects of adrenomedullin 15 13.2 resistance to adriamycin amyopathic dermatomyositis 14 12.8 brevis and abductor digiti minimi 5 11 9.8 minimi 83 5.8 Nested digiti minimi 80 3.9 automated digital microscopy 0.0 match adrenomedullin concentration

56 Long-form extraction Long-form candidates are sorted with their scores in a descending order A long-form candidate is considered valid if: It has a score greater than 2.0 The words in the long form can be rearranged so that all alphanumeric letters appear in the same order as the short form It is not nested or expansion of the previously chosen long forms

58 Acronym disambiguationLocal acronyms Accompany their expanded forms in documents Global acronyms Appear in documents without the expanded forms stated Need to be their correct expanded forms identified Immunomodulatory effects of CT were investigated in a rat model, and the effects of CT on rat renal allograft (from Lewis rat to WKAH rat) were also examined. Immunomodulatory effects of cholera toxin (CT) were investigated in a rat model, and the effects of cholera toxin (CT) on rat renal allograft (from Lewis rat to Wistar-King-Aptekman-Hokudai (WKAH) rat) were also examined.

59 Acronym disambiguationHow it works for local abbreviations: Identify local definitions of abbreviation by looking-up Acromine dictionary, using parenthesis as clues for short forms and matching the expressions before the parenthesis to the entries in Acromine dictionary. 2) If the short form is not register in the dictionary, apply SaRAD algorithm to find the long form before the short form. This processing is for the definitions of less frequent abbreviations, which are not covered by Acromine. * How it works for global abbreviation 1) Assume words that contain more than two capital letters to be possible global abbreviations, (e.g., HMM, PC, …) 2) Look-up Acromine dictionary to obtain a set of candidate definitions for the abbreviation. 3) Calculate the probability of definition for each candidate. This is done by naïve Bayes classifier trained on contextual sentences (sentences with abbreviations and their definitions expressed by MEDLINE authors). The classifier uses features based on the context words (words in the sentence where the global abbreviation appearred). 4) If the local definition of the abbreviation is found in the previous text, narrow the candidates into the local definitions. For example, finding a sentence, “The system uses hidden markov model (HMM)” before a target global abbreviation, HMM, the system prunes/withdraws other definitions (e.g., heavy meromyosin) from the candidates. This is based on the assumption that the definition of abbreviations in a document should be consistent. Sample text: Considerations in the identification of functional RNA structural elements in genomic alignments (Tomas Babak et al)

60 Term structuring term clustering (linking semantically similar terms) and term classification (assigning terms to classes from a pre-defined classification scheme) Hypothesis: similar terms tend to appear in similar contexts (patterns) combining various sources of similarity: lexical syntactic contextual Ontological (using external resources)

61 Term structuring Based on term similarities ontology-based similaritychoice of features: domain specific  ontology linguistic  text ontology-based similarity textual similarity internal features contextual features

62 Using ontologies two terms should match if they are:identified as variants siblings in the is-a hierarchy in the is-a or part-whole relation the distance between the corresponding nodes in the ontology should be transformed into the matching score ► I. Spasic presentation MIE Tutorial

63 Using text number of neologisms: terms are not in the ontologiesUse of text based techniques to calculate similarities edit distance (ED) – the minimal number (or cost) of changes needed to transform one string into the other edit operations: insertion deletion replacement transposition ...a-c abc abc abc... ...abc a-c adc acb... use of dynamic programming

64 Pattern-matching IE Usual limitations with non inclusion of semantic processing Large amount of surface grammatical structures = too many patterns (Zipf’s law) Cannot explore syntactic generalisations (active, passive voice) Systems extract phrases or entire sentences with matched patterns; restricted usefulness for subsequent mining

65 Pattern-matching systems (1)BioIE uses patterns to extract sentences, protein families, structures, functions.. Presents user with relevant information, improvement from classic IR BioRAT uses “deeper” analysis, tagging, apply RE over POS tags, stemming, gazetter categories etc Templates apply to extract matching phrases, primitive filters (verbs are not proteins, etc)

66 Pattern matching systems (2)RLIMS-P (Hu) protein phosphorylation by looking for enzymes, substrates, sites assigned to agent, theme, site roles of phosphorylation relations Pos tagger, trained on newswire, chunking, semantic typing of chunks, identification of relations using pattern-matching rules Semantic typing of NPs: using combination of clue words, suffixes, acronyms etc Semantically typed sentences matched with rules Patterns target sentences containing phosphorylate

67 Full parsing approachesLink Grammar applied for protein-protein interactions; general English grammar adapted to bio-text Link Grammar finds all possible linkages according to its grammar Number of analyses reduced by random sampling, heuristics, processing constraints relaxed 10,000 results permitted per sentence 60% of protein interactions extracted Problems: missing possessive markers & determiners, coordination of compound noun modifiers

68 Full parsing IE (2) Not all parsing strategies suitable for bio-text mining Text type, abstracts, “ungrammaticality” related with sublanguage characteristics? Ambiguity and full parsing; fragmentary phrases (titles, headings, text in table cells, etc) CADERIGE project used Link grammar but on shallow parsing mode Kim & Park (BioIE) use combinatorial categorial grammar, annotated with GO concepts, extract general biological interactions 1,300 patterns applied to find instances of patterns with keywords

69 Full parsing (3) Keywords indicate basic biological interactionsPatterns find potential arguments of the interaction keywords (verbs or nominalisations) Validated arguments mapped into GO concepts Difficult to generalise interaction keyword patterns BioIE’s syntactic parsing performance improved after adding subcategorisation frames on verbal interaction keywords

70 Full parsing (4) Daraselia(2004) use full parsing and domain specific filter to extract protein interactions All syntactic analyses discovered using CFG and variant of LFG Each alternative parse mapped to its corresponding semantic representation Output= set of semantic trees, lexemes linked by relations indicating thematic or attributive roles Apply custom-built, frame based ontology to filter representations of each sentence Preference mechanism controls construction of frame tree, high precision, low recall (21%)

71 Sublanguage-driven IE (1)Language of a special community (e.g. biology) Particular set of constraints re GL Constraints operate at all linguistic levels Special vocabulary (terms) Specialised term formation rules Sublanguage syntactic patterns Sublanguage semantics These constraints give rise to the informational structure of the domain (Z. Harris) See JBI 35(4) Special Issue on Sublanguage

72 GENIES system Employs SL approach to extract biomolecular interactionsUses hybrid syntactic-semantic rules Syntactic and semantic constraints referred to in one rule Able to cope with complex sentences Frame-based representation Embedded frames Domain specific ontology covers both entities and events

73 GENIES system Default strategy: full parsingRobust due to sublanguage constraints Much ambiguity excluded If full parse fails, partial parsing invoked Maintains good level of recall Precision: 96%, Recall: 63%

74 Ontology-driven IE Until recently most rule based IE have used neither linguistic lexica nor ontologies Reliance on gazetteers Small number of semantic categories Gazetteer approach not well suited in bioIE Ontology based vs ontology driven Passive use of ontologies, map discovered entity to concept Active use, ontology guides and constrains analysis, fewer rules Examples: PASTA, GenIE not SL GENIES, SL and ontology driven

75 Summary: simple pattern matchingOver text strings Many patterns required, no generalisation possible Over POS Some generalisation but ignore sentence structure POS tagging, chunking, semantic p-m, typing Limited generalisation, some account taken of structure, limited consideration of SL patterns

76 Summary: full parsing Full parsing on its own, parsing done in combination with chunking, partial parsing, heuristics) to reduce ambiguity, filter out implausible readings GL theories not appropriate Difficult to specialise for biotext Many analyses per sentence Missing information due to sublanguage meaning

77 Summary: sublanguage approachExploits a rich SL lexicon Describes SL verbs in detail Syntactic-semantic grammar Current systems would benefit from adopting ontology-driven approach

78 Ontology-driven Uses event concept frames to guide processingIntegration of extracted information Current systems would benefit from adopting also SL approach

79 Domain Adaptation from News to Biomedical?

80 What is domain adaptation?

81 Example: named entity recognitionpersons, locations, organizations, etc. train (labeled) test (unlabeled) standard supervised learning NER Classifier 85.5% New York Times New York Times

82 Example: named entity recognitionpersons, locations, organizations, etc. train (labeled) test (unlabeled) New York Times labeled data not available non-standard (realistic) setting Reuters NER Classifier 64.1% New York Times

83 Domain difference  performance droptrain test ideal setting NER Classifier NYT NYT 85.5% New York Times New York Times realistic setting NER Classifier NYT Reuters 64.1% Reuters New York Times

84 Another NER example 54.1% 28.1% train test ideal settinggene name recognizer 54.1% mouse mouse realistic setting gene name recognizer 28.1% fly mouse

85 Other examples Spam filtering: Sentiment analysis of product reviewsPublic collection  personal inboxes Sentiment analysis of product reviews Digital cameras  cell phones Movies  books Can we do better than standard supervised learning? Domain adaptation: to design learning methods that are aware of the training and test domain difference.

86 How do we solve the problem in general?

87 domain-specific featuresObservation 1 domain-specific features wingless daughterless eyeless apexless …

88 No! Observation 1 domain-specific featureswingless daughterless eyeless apexless … describing phenotype in fly gene nomenclature feature “-less” weighted high CD38 PABPC5 … feature still useful for other organisms? No!

89 generalizable featuresObservation 2 generalizable features …decapentaplegic and wingless are expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues.

90 Observation 2 feature “X be expressed” generalizable features…decapentaplegic and wingless are expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. feature “X be expressed”

91 General idea: two-stage approachdomain-specific features Source Domain Target Domain generalizable features features

92 Goal Source Domain Target Domain features

93 Regular classificationSource Domain Target Domain features

94 Generalization: to emphasize generalizable features in the trained modelSource Domain Target Domain features Stage 1

95 Adaptation: to pick up domain-specific features for the target domainSource Domain Target Domain features Stage 2

96 Regular semi-supervised learningSource Domain Target Domain features

97 Experiments(Jiang and Zhai, 07)domain-adaptive SSL is more effective, especially with a small number of pseudo labels

Biomedical IE Heng Ji [email protected].

Recommend Documents