Marko Grobelnik, Dunja Mladenic J.Stefan Institute Slovenia

1 Marko Grobelnik, Dunja Mladenic J.Stefan Institute Slov...
Author: Cody Charles
0 downloads 1 Views

1 Marko Grobelnik, Dunja Mladenic J.Stefan Institute SloveniaACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY Knowledge Discovery Marko Grobelnik, Dunja Mladenic J.Stefan Institute Slovenia

2 Contents Knowledge Discovery Large Scale Topic Ontology populationExtraction of Semantic Networks from Text Active Learning for efficient using of human interventions Methods Addressing Different Aspects of Ontology Construction Final Remarks

3 Why is Knowledge Discovery appropriate for Semantic Web?Idea: let a computer search for knowledge whereas the humans give just broad directions about where and how to search Knowledge discovery (KD) could be defined as a research area with several subfields: Machine Learning, Data Mining and Data bases (Mitchell, 1997; Fayyad et al., 1996; Witten and Frank, 1999; Hand et al., 2001) KD techniques mainly about discovering structure in the data can serve as one of the key mechanisms for structuring knowledge into an ontological structure being further used in Knowledge management process Data and corresponding semantic structures change in time sub-field of KD called “stream mining” deals with these kinds of problems Semantic Web is ultimately concerned with real-life data on the web which have exponential growth scalability is one of the central issues in KD

4 Machine Learning view to Ontology Generation

5 Knowledge Discovery TechniquesKnowledge discovery technologies can be used to support different phases and scenarios for ontology generation Observations: Completely automatic construction of ontologies is in general not possible for: theoretical reasons (e.g., information bottleneck) and practical reasons (e.g., the soft nature of the knowledge being conceptualized). Human interventions are necessary but costly in terms of resources …therefore the technology should help in efficient utilization of human interventions. Document databases are the most common data type conceptualized in the form of ontologies

6 What is Ontology? In most ML contexts we can refer to an ontology as being a graph/network structure consisting from: a set of concepts (vertices in a graph) …each concept Ci is described by a membership-function ci(x) a set of relations connecting concepts (directed edges in a graph) …each relation Ri is described by a membership-function ri(Ci, Cj) a set of instances (data records assigned to concepts or relations) …each instance Ii is described by a set of features Fi,j

7 We have 7 concepts (C1…C7), and 3 relations (R1…R3)…each of the concept and relation is populated by a number of instances (data records) R1 C1 C2 R3 C4 C3 R3 R2 R3 R1 R2 C5 C7 R1 C6

8 Ontology Definition Ontology is defined as a tuple with 5 sets of objects: Ontology …in short: O …where Classes – set of labels Ci Relations – set of labels Ri Instances – set of instance feature vectors Ii Class-Definitions – set of class membership functions CDi Relation-Definitions – set of relation membership functions RDi …the idea is to describe “ontology learning tasks” in above terms

9 Ontology Learning Ontology learning is a set of tasks based on the previous ontology definition We define ontology learning tasks in terms of mappings between ontology components where some of the components are given and some are missing and we want to induce the missing ones Some typical scenarios: Inducing classes/Clustering of instances: C, CD=f(I) Ontology population: CD, RD=f(C, R, I) Ontology generation: C, R, CD, RD=f(I) (hardest task)

10 Representational languageWhen performing learning of function f, we need to select language for representation of membership function f Examples: Linear functions (Support-Vector-Machines, …) Propositional logic (decision trees, rules, …) First order logic (Inductive Logic programming) …by selecting different representation languages we decide about …the power of the descriptions …complexity of computation

11 Ontology Quality For the same set of instances I we can have multiple ontologies OI We need a function q for measuring the quality of a given ontology OI …function q returns numerical value …the best ontology is the one with the highest quality Possible evaluation measures: (1) analysis of statistical properties of structured data, (2) agreement to the properties derived from manually built ontologies, (3) optimization of efficiency of the user's behaviour when using an ontology, (4) using background knowledge, and (5) building hybrid measures (combination of various approaches).

12 Search for “optimal” OntologyGiven set of instances I, we develop a series of ontologies O1, O2, O3, … …where we have set of transformation operators (refinement operators) going from Oi to Oi+1 Good search procedure would select such transformations which would lead efficiently towards the highest quality q(Oi) …this formulation is in line with “machine learning with structured output” …we could use human in the loop by using active learning techniques

13 Contents Knowledge Discovery Large Scale Topic Ontology populationExtraction of Semantic Networks from Text Active Learning for efficient using of human interventions Methods Addressing Different Aspects of Ontology Construction Final Remarks

14 Large Scale Topic Ontology population

15 Text categorization into large topic ontologyCategorization of documents into large topic ontology is one of the problems in text mining: …needs to be scalable …e.g. being able to handle DMoz’s 600K categories and 4M docs. …needs to be accurate …having accuracy on the level of inter-human agreement (60-80%) …needs to be robust …taking into account nature of web pages (typically mixed quality content and often high quality context)

16 Approaches for handling hierarchy of categoriesThere are several topic ontologies (taxonomies) of textual documents: Yahoo, DMoz, Medline, … Different people use different approaches: …series of hierarchically organized classifiers …set of independent classifiers just for leaves …set of independent classifiers for all nodes

17 Yahoo! topic ontology (taxonomy)human constructed hierarchy of Web-documents exists in several languages easy to access and regularly updated captures most of the Web topics English version includes over 2M pages categorized into 50,000 categories contains about 250Mb of HTML-files

18 Document to categorize:CFP for CoNLL-2000

19 Some predicted categories

20 (from Yahoo! hierarchy)System architecture Feature construction Web vectors of n-grams Subproblem definition Feature selection Classifier construction labeled documents (from Yahoo! hierarchy) unlabeled document category (label) ?? Document Classifier

21 Content categories For each content category generate a separate classifier that predicts probability for a new document to belong to its category

22 Summary of experimental results on Yahoo!

23 largest topic ontologyDMoz / ODP is largest topic ontology on the web: 4M sites 68k editors 600k concepts

24 Categorization into DMozOn input we take DMoz RDF taxonomy data …from …we preprocess it into efficient binary structure …next, we build a classification model consisting from models for individual categories We take hierarchical nature into account Using classification model we classify new documents into taxonomy On output we get for a given document text and URL Set of most relevant categories from DMoz Set of most relevant keywords calculated from DMoz category names (segments from the path names)

25 What is used for learning?Currently the system uses hierarchical nearest neighbor …in the past we experimented with Naïve Bayes for Yahoo taxonomy (http://kt.ijs.si/Dunja/yplanet.html) …heavy feature selection was needed …we plan to experiment with Support Vector Machine (SVM) algorithms …we plan to use this for ACM KDD Cup 2005 Challenge Scalability is a problem for learning and classification when dealing with 600K classes and 4M documents Approaches still needs to be properly evaluated

26 Performance issues Preprocessing of DMoz (from RDF to classification model) takes approx. 1h For classification into the whole DMoz we need Win64 with at least 6Gb memory …subsets of DMoz run on Win32 with 2Gb Classification into DMoz is fast … ~20 document classifications per second …e.g. whole Wikipedia was classified into DMoz in several hours

27 Demos Demo software for classification into available at (~40Mb) …includes AVI file with demo movie …demo runs at Demo for classification into the whole DMoz (all 600K classes) runs at

28 Hubble telescope web pageExample classification of URL of a web page keywords categories classification of Hubble telescope web page

29 Example classification of URL + text of a web page

30 Contents Knowledge Discovery Large Scale Topic Ontology populationExtraction of Semantic Networks from Text Active Learning for efficient using of human interventions Methods Addressing Different Aspects of Ontology Construction Final Remarks

31 Extracting Semantic Graph from text

32 Summarization with semantic graph (Leskovec, Grobelnik, Milic-Frayling 2005)Idea: extract semantic network from text documents and identify relevant parts of the semantic network to represent summary “Semantic graph” representation is used for summarization task (DUC Challenge) The main research result is the finding that topology of extracted semantic graph helps in determining importance of the content triples (which Subject-Predicate-Object triple is relevant) …joint collaboration with Microsoft Research, Cambridge

33 Approach Description Approach:Learn a machine learning model for selecting sentences Use information about semantic structure of the document (concepts and relations among concepts) Results are promising achieved 70% recall of and 25% precision on extracted Subject-Predicate-Object triples on DUC (Document understanding conference) data

34 Summarization Human built document summary Original DocumentCracks Appear in U.N. Trade Embargo Against Iraq. Cracks appeared Tuesday in the U.N. trade embargo against Iraq as Saddam Hussein sought to circumvent the economic noose around his country. Japan, meanwhile, announced it would increase its aid to countries hardest hit by enforcing the sanctions. Hoping to defuse criticism that it is not doing its share to oppose Baghdad, Japan said up to $2 billion in aid may be sent to nations most affected by the U.N. embargo on Iraq. President Bush on Tuesday night promised a joint session of Congress and a nationwide radio and television audience that ``Saddam Hussein will fail'' to make his conquest of Kuwait permanent. ``America must stand up to aggression, and we will,'' said Bush, who added that the U.S. military may remain in the Saudi Arabian desert indefinitely. ``I cannot predict just how long it will take to convince Iraq to withdraw from Kuwait,'' Bush said. More than 150,000 U.S. troops have been sent to the Persian Gulf region to deter a possible Iraqi invasion of Saudi Arabia. Bush's aides said the president would follow his address to Congress with a televised message for the Iraqi people, declaring the world is united against their government's invasion of Kuwait. Saddam had offered Bush time on Iraqi TV. The Philippines and Namibia, the first of the developing nations to respond to an offer Monday by Saddam of free oil _ in exchange for sending their own tankers to get it _ said no to the Iraqi leader. Saddam's offer was seen as a none-too-subtle attempt to bypass the U.N. embargo, in effect since four days after Iraq's Aug. 2 invasion of Kuwait, by getting poor countries to dock their tankers in Iraq. But according to a State Department survey, Cuba and Romania have struck oil deals with Iraq and companies elsewhere are trying to continue trade with Baghdad, all in defiance of U.N. sanctions. Romania denies the allegation. The report, made available to The Associated Press, said some Eastern European countries also are trying to maintain their military sales to Iraq. A well-informed source in Tehran told The Associated Press that Iran has agreed to an Iraqi request to exchange food and medicine for up to 200,000 barrels of refined oil a day and cash payments. There was no official comment from Tehran or Baghdad on the reported food-for-oil deal. But the source, who requested anonymity, said the deal was struck during Iraqi Foreign Minister Tariq Aziz's visit Sunday to Tehran, the first by a senior Iraqi official since the gulf war. After the visit, the two countries announced they would resume diplomatic relations. Well-informed oil industry sources in the region, contacted by The AP, said that although Iran is a major oil exporter itself, it currently has to import about 150,000 barrels of refined oil a day for domestic use because of damages to refineries in the gulf war. Along similar lines, ABC News reported that following Aziz's visit, Iraq is apparently prepared to give Iran all the oil it wants to make up for the damage Iraq inflicted on Iran during their conflict. Secretary of State James A. Baker III, meanwhile, met in Moscow with Soviet Foreign Minister Eduard Shevardnadze, two days after the U.S.-Soviet summit that produced a joint demand that Iraq withdraw from Kuwait. During the summit, Bush encouraged Mikhail Gorbachev to withdraw 190 Soviet military specialists from Iraq, where they remain to fulfill contracts. Shevardnadze told the Soviet parliament Tuesday the specialists had not reneged on those contracts for fear it would jeopardize the 5,800 Soviet citizens in Iraq. In his speech, Bush said his heart went out to the families of the hundreds of Americans held hostage by Iraq, but he declared, ``Our policy cannot change, and it will not change. America and the world will not be blackmailed.'' The president added: ``Vital issues of principle are at stake. Saddam Hussein is literally trying to wipe a country off the face of the Earth.'' In other developments: _A U.S. diplomat in Baghdad said Tuesday up to 800 Americans and Britons will fly out of Iraqi-occupied Kuwait this week, most of them women and children leaving their husbands behind. Saddam has said he is keeping foreign men as human shields against attack. On Monday, a planeload of 164 Westerners arrived in Baltimore from Iraq. Evacuees spoke of food shortages in Kuwait, nighttime gunfire and Iraqi roundups of young people suspected of involvement in the resistance. ``There is no law and order,'' said Thuraya, 19, who would not give her last name. ``A soldier can rape a father's daughter in front of him and he can't do anything about it.'' _The State Department said Iraq had told U.S. officials that American males residing in Iraq and Kuwait who were born in Arab countries will be allowed to leave. Iraq generally has not let American males leave. It was not known how many men the Iraqi move could affect. _A Pentagon spokesman said ``some increase in military activity'' had been detected inside Iraq near its borders with Turkey and Syria. He said there was little indication hostilities are imminent. Defense Secretary Dick Cheney said the cost of the U.S. military buildup in the Middle East was rising above the $1 billion-a-month estimate generally used by government officials. He said the total cost _ if no shooting war breaks out _ could total $15 billion in the next fiscal year beginning Oct. 1. Cheney promised disgruntled lawmakers ``a significant increase'' in help from Arab nations and other U.S. allies for Operation Desert Shield. Japan, which has been accused of responding too slowly to the crisis in the gulf, said Tuesday it may give $2 billion to Egypt, Jordan and Turkey, hit hardest by the U.N. prohibition on trade with Iraq. ``The pressure from abroad is getting so strong,'' said Hiroyasu Horio, an official with the Ministry of International Trade and Industry. Local news reports said the aid would be extended through the World Bank and International Monetary Fund, and $600 million would be sent as early as mid-September. On Friday, Treasury Secretary Nicholas Brady visited Tokyo on a world tour seeking $10.5 billion to help Egypt, Jordan and Turkey. Japan has already promised a $1 billion aid package for multinational peacekeeping forces in Saudi Arabia, including food, water, vehicles and prefabricated housing for non-military uses. But critics in the United States have said Japan should do more because its economy depends heavily on oil from the Middle East. Japan imports 99 percent of its oil. Japan's constitution bans the use of force in settling international disputes and Japanese law restricts the military to Japanese territory, except for ceremonial occasions. On Monday, Saddam offered developing nations free oil if they would send their tankers to pick it up. The first two countries to respond Tuesday _ the Philippines and Namibia _ said no. Manila said it had already fulfilled its oil requirements, and Namibia said it would not ``sell its sovereignty'' for Iraqi oil. Venezuelan President Carlos Andres Perez dismissed Saddam's offer of free oil as a ``propaganda ploy.'' Venezuela, an OPEC member, has led a drive among oil-producing nations to boost production to make up for the shortfall caused by the loss of Iraqi and Kuwaiti oil from the world market. Their oil makes up 20 percent of the world's oil reserves. Only Saudi Arabia has higher reserves. But according to the State Department, Cuba, which faces an oil deficit because of reduced Soviet deliveries, has received a shipment of Iraqi petroleum since U.N. sanctions were imposed five weeks ago. And Romania, it said, expects to receive oil indirectly from Iraq. Romania's ambassador to the United States, Virgil Constantinescu, denied that claim Tuesday, calling it ``absolutely false and without foundation.''. Cracks appeared in the U.N. trade embargo against Iraq. The State Department reports that Cuba and Romania have struck oil deals with Iraq as others attempt to trade with Baghdad in defiance of the sanctions. Iran has agreed to exchange food and medicine for Iraqi oil. Saddam has offered developing nations free oil if they send their tankers to pick it up. Thus far, none has accepted. Japan, accused of responding too slowly to the Gulf crisis, has promised $2 billion in aid to countries hit hardest by the Iraqi trade embargo. President Bush has promised that Saddam's aggression will not succeed. Manual summarization semantic network Creation of Semantic net of Subj-Pred-Obj triples Automatically built document summary (not done by us) 70% recall, 40% precision of selected triples according to human generated summaries Automatic summarization by selecting relevant triples Cracks appeared in the U.N. trade embargo against Iraq. The State Department reports that Cuba and Romania have struck oil deals with Iraq as others attempt to trade with Baghdad in defiance of the sanctions. Iran has agreed to exchange food and medicine for Iraqi oil. Saddam has offered developing nations free oil if they send their tankers to pick it up. Thus far, none has accepted. Japan, accused of responding too slowly to the Gulf crisis, has promised $2 billion in aid to countries hit hardest by the Iraqi trade embargo. President Bush has promised that Saddam's aggression will not succeed. Nat. Lang. Generation Mapping between graphs learned with ML methods Semantic net of Subj-Pred-Obj triples

35 Detailed Summarization ProcedureLinguistic analysis of the text - Deep parsing of sentences Refinement of the text parse - Named-entity consolidation Determine that ’George Bush’ = ‘Bush’ = ‘U.S. president’ - Anaphora resolution Link pronouns with name-entities Extract Subject–Predicate–Object triples Tom Sawyer went to town. He met a friend. Tom was happy. … Tom Sawyer went to town. He [Tom Sawyer] met a friend. Tom [Tom Sawyer] was happy. … Tom  go  town Tom  meet  friend Tom  is  happy Compose a graph from triples Describe each triple with a set of features for learning Learn a model to classify triples into the summary Generate a summary graph Use summary graph to generate textual document summary

36 Named entities consolidationConsolidating different surface forms that refer to the same entities – only for names of people, places, companies, etc. Example: Hillary Rodham Clinton, Hillary Clinton, Hillary Rodham, Mrs. Clinton  Hillary Clinton Heuristic based on the overlap in the surface form of name variances Accuracy on a subset of the data set ~90%.

37 Pronomial anaphora resolutionLink pronouns with their references Mary likes Paul. She went to buy him a present.  Mary likes Paul. She [Mary] went to buy him [Paul] a present. Method: restrict to 5 pronouns: she, he, who, I, they. from the pronoun, traverse the text searching for candidate references and assign a score the score is based on the distance from the pronoun and semantic information assume that pronouns refer only to named entities found in the document Problem: One passenger in King's car said they had been drinking liquor. Average accuracy on 1,500 hand labeled pronouns: 81.2%

38 Anaphora resolution evaluationPronoun Frequency Frequency [%] Accuracy [%] He 681 45.22 86.9 They 244 16.20 67.2 It 204 13.55 I 64 4.25 82.8 You 50 3.32 We 44 2.92 That What 27 1.79 She 24 1.59 62.5 This 22 1.46 Who 11 0.73 63.6 Total 1506 100 81.2 Accuracy on 5 selected 81.2% (55.2% if counting all pronouns)

39 Extracting triples Enhanced parse tree is traversed to identify Subject–Predicate–Object triples Example: “Conservatives embraced the nomination while liberals were cautious or hostile” Resulting triples: conservative  embrace  nomination liberal  is  cautious liberal  is  hostile

40 Detailed Summarization ProcedureLinguistic analysis of the text - Deep parsing of sentences Refinement of the text parse - Named-entity consolidation Determine that ’George Bush’ = ‘Bush’ = ‘U.S. president’ - Anaphora resolution Link pronouns with name-entities Extract Subject – Predicate – Object triples Tom Sawyer went to town. He met a friend. Tom was happy. … Tom Sawyer went to town. He [Tom Sawyer] met a friend. Tom [Tom Sawyer] was happy. … Tom  go  town Tom  meet  friend Tom  is  happy Compose a graph from triples Describe each triple with a set of features for learning Learn a model to classify triples into the summary Generate a summary graph Use summary graph to generate textual document summary

41 Training of summarization modelModel ranks Subject-Predicate-Object triples according to their importance Document Semantic network Summary semantic network

42 Composing a graph Graph consists of nodes, referred as concepts, which can be subjects or objects and edges which are predicates and capture relations among concepts. Use Word net to identify and compact synonym nodes – as they correspond to the same concepts.

43 Feature construction Features used in the learning process include triples described by the following attributes: Positional information Of the sentence from which the triple was derived relative to the document text Of the triple relative to the beginning of the sentence Linguistic attributes of the nodes in the triple (NLP): 18 syntactic attributes 100 semantic attributes 14 graph attributes: PageRank, In/Out Degree, reachable neighbours, etc. Dataset this yield: TOTAL of 466 attributes On average 72 non-zero attributes per triple.

44 Experiments Machine learning with Linear SVM to classify triples into relevant or not-relevant for the summary Positive examples are triples from the sentences which were marked as summary sentences by experts Negative examples are all other triples Data: 147 documents from the DUC 2002 for which we had extracted summaries. Evaluation: Report microaveraged values of precision, recall and F1 for the extracted triples using 10-fold cross validation.

45 Performance for various attribute setsTraining Set Test Set Precision Recall F1 Sentence Position + Terms 65.87 92.48 76.94 28.87 37.08 32.46 only Position (triple + sentence) 31.21 52.49 39.15 31.05 52.58 39.05 only Graph 27.78 57.46 37.46 27.25 56.90 36.85 only Linguistic 29.77 61.79 40.18 22.29 47.52 30.29 Position + Linguistic 31.16 67.00 42.54 28.67 62.57 39.33 Position + Graph 33.51 63.85 43.95 42.71 63.02 43.07 + Linguistic 35.82 72.69 47.99 31.41 64.88 42.33

46 Performance for various attribute setsBaseline performance (sentence position + selected terms from the sentence) F1=32.46 is lower than in any of the other runs, except for ‘only linguistic’ attributes (F1=30.29). ‘only linguistic’ run includes only generic syntactic and semantic labels - not expected to be good discriminators on their own. Attribute set Training Set Test Set Precision Recall F1 Sentence Position + Terms 65.87 92.48 76.94 28.87 37.08 32.46 only Position (triple + sentence) 31.21 52.49 39.15 31.05 52.58 39.05 only Graph 27.78 57.46 37.46 27.25 56.90 36.85 only Linguistic 29.77 61.79 40.18 22.29 47.52 30.29 Position + Linguistic 31.16 67.00 42.54 28.67 62.57 39.33 Position + Graph 33.51 63.85 43.95 42.71 63.02 43.07 + Linguistic 35.82 72.69 47.99 31.41 64.88 42.33

47 Performance for various attribute setsAdding generic linguistic attributes reduces precision Position of triples and sentences  P=31.05 Adding linguistic attributes  P=28.67 but consistently increases recall Performance for various attribute sets Attribute set Training Set Test Set Precision Recall F1 Sentence Position + Terms 65.87 92.48 76.94 28.87 37.08 32.46 only Position (triple + sentence) 31.21 52.49 39.15 31.05 52.58 39.05 only Graph 27.78 57.46 37.46 27.25 56.90 36.85 only Linguistic 29.77 61.79 40.18 22.29 47.52 30.29 Position + Linguistic 31.16 67.00 42.54 28.67 62.57 39.33 Position + Graph 33.51 63.85 43.95 32.71 63.02 43.07 + Linguistic 35.82 72.69 47.99 31.41 64.88 42.33

48 Performance for various attribute setsInformation about the graph structure helps Position of triples and sentences  F1=39.05 Adding structure information  F1=43.07 Attribute set Training Set Test Set Precision Recall F1 Sentence Position + Terms 65.87 92.48 76.94 28.87 37.08 32.46 only Position (triple + sentence) 31.21 52.49 39.15 31.05 52.58 39.05 only Graph 27.78 57.46 37.46 27.25 56.90 36.85 only Linguistic 29.77 61.79 40.18 22.29 47.52 30.29 Position + Linguistic 31.16 67.00 42.54 28.67 62.57 39.33 Position + Graph 33.51 63.85 43.95 42.71 63.02 43.07 + Linguistic 35.82 72.69 47.99 31.41 64.88 42.33

49 Insights We determine the median and quartiles of the ranks across 10 runs. Most highly ranked features in SVM normal: Attribute 1st quartile Median 3rd quartile Object – Authority weight 1 2 Object – size of weakly connected component 2.5 3 Object – degree of a node Object – is name of a country 4 5 Subject – size of weakly connected component 6 7 9 Subject – degree of a node 10.5 12 Object – PageRank weight 11 Object – is name of a geographical location 8 13 16 Subject – Authority weight 18.5 23

50 Example of summarizationCracks Appear in U.N. Trade Embargo Against Iraq. Cracks appeared Tuesday in the U.N. trade embargo against Iraq as Saddam Hussein sought to circumvent the economic noose around his country. Japan, meanwhile, announced it would increase its aid to countries hardest hit by enforcing the sanctions. Hoping to defuse criticism that it is not doing its share to oppose Baghdad, Japan said up to $2 billion in aid may be sent to nations most affected by the U.N. embargo on Iraq. President Bush on Tuesday night promised a joint session of Congress and a nationwide radio and television audience that ``Saddam Hussein will fail'' to make his conquest of Kuwait permanent. ``America must stand up to aggression, and we will,'' said Bush, who added that the U.S. military may remain in the Saudi Arabian desert indefinitely. ``I cannot predict just how long it will take to convince Iraq to withdraw from Kuwait,'' Bush said. More than 150,000 U.S. troops have been sent to the Persian Gulf region to deter a possible Iraqi invasion of Saudi Arabia. Bush's aides said the president would follow his address to Congress with a televised message for the Iraqi people, declaring the world is united against their government's invasion of Kuwait. Saddam had offered Bush time on Iraqi TV. The Philippines and Namibia, the first of the developing nations to respond to an offer Monday by Saddam of free oil _ in exchange for sending their own tankers to get it _ said no to the Iraqi leader. Saddam's offer was seen as a none-too-subtle attempt to bypass the U.N. embargo, in effect since four days after Iraq's Aug. 2 invasion of Kuwait, by getting poor countries to dock their tankers in Iraq. But according to a State Department survey, Cuba and Romania have struck oil deals with Iraq and companies elsewhere are trying to continue trade with Baghdad, all in defiance of U.N. sanctions. Romania denies the allegation. The report, made available to The Associated Press, said some Eastern European countries also are trying to maintain their military sales to Iraq. A well-informed source in Tehran told The Associated Press that Iran has agreed to an Iraqi request to exchange food and medicine for up to 200,000 barrels of refined oil a day and cash payments. There was no official comment from Tehran or Baghdad on the reported food-for-oil deal. But the source, who requested anonymity, said the deal was struck during Iraqi Foreign Minister Tariq Aziz's visit Sunday to Tehran, the first by a senior Iraqi official since the gulf war. After the visit, the two countries announced they would resume diplomatic relations. Well-informed oil industry sources in the region, contacted by The AP, said that although Iran is a major oil exporter itself, it currently has to import about 150,000 barrels of refined oil a day for domestic use because of damages to refineries in the gulf war. Along similar lines, ABC News reported that following Aziz's visit, Iraq is apparently prepared to give Iran all the oil it wants to make up for the damage Iraq inflicted on Iran during their conflict. Secretary of State James A. Baker III, meanwhile, met in Moscow with Soviet Foreign Minister Eduard Shevardnadze, two days after the U.S.-Soviet summit that produced a joint demand that Iraq withdraw from Kuwait. During the summit, Bush encouraged Mikhail Gorbachev to withdraw 190 Soviet military specialists from Iraq, where they remain to fulfill contracts. Shevardnadze told the Soviet parliament Tuesday the specialists had not reneged on those contracts for fear it would jeopardize the 5,800 Soviet citizens in Iraq. In his speech, Bush said his heart went out to the families of the hundreds of Americans held hostage by Iraq, but he declared, ``Our policy cannot change, and it will not change. America and the world will not be blackmailed.'' The president added: ``Vital issues of principle are at stake. Saddam Hussein is literally trying to wipe a country off the face of the Earth.'' In other developments: _A U.S. diplomat in Baghdad said Tuesday up to 800 Americans and Britons will fly out of Iraqi-occupied Kuwait this week, most of them women and children leaving their husbands behind. Saddam has said he is keeping foreign men as human shields against attack. On Monday, a planeload of 164 Westerners arrived in Baltimore from Iraq. Evacuees spoke of food shortages in Kuwait, nighttime gunfire and Iraqi roundups of young people suspected of involvement in the resistance. ``There is no law and order,'' said Thuraya, 19, who would not give her last name. ``A soldier can rape a father's daughter in front of him and he can't do anything about it.'' _The State Department said Iraq had told U.S. officials that American males residing in Iraq and Kuwait who were born in Arab countries will be allowed to leave. Iraq generally has not let American males leave. It was not known how many men the Iraqi move could affect. _A Pentagon spokesman said ``some increase in military activity'' had been detected inside Iraq near its borders with Turkey and Syria. He said there was little indication hostilities are imminent. Defense Secretary Dick Cheney said the cost of the U.S. military buildup in the Middle East was rising above the $1 billion-a-month estimate generally used by government officials. He said the total cost _ if no shooting war breaks out _ could total $15 billion in the next fiscal year beginning Oct. 1. Cheney promised disgruntled lawmakers ``a significant increase'' in help from Arab nations and other U.S. allies for Operation Desert Shield. Japan, which has been accused of responding too slowly to the crisis in the gulf, said Tuesday it may give $2 billion to Egypt, Jordan and Turkey, hit hardest by the U.N. prohibition on trade with Iraq. ``The pressure from abroad is getting so strong,'' said Hiroyasu Horio, an official with the Ministry of International Trade and Industry. Local news reports said the aid would be extended through the World Bank and International Monetary Fund, and $600 million would be sent as early as mid-September. On Friday, Treasury Secretary Nicholas Brady visited Tokyo on a world tour seeking $10.5 billion to help Egypt, Jordan and Turkey. Japan has already promised a $1 billion aid package for multinational peacekeeping forces in Saudi Arabia, including food, water, vehicles and prefabricated housing for non-military uses. But critics in the United States have said Japan should do more because its economy depends heavily on oil from the Middle East. Japan imports 99 percent of its oil. Japan's constitution bans the use of force in settling international disputes and Japanese law restricts the military to Japanese territory, except for ceremonial occasions. On Monday, Saddam offered developing nations free oil if they would send their tankers to pick it up. The first two countries to respond Tuesday _ the Philippines and Namibia _ said no. Manila said it had already fulfilled its oil requirements, and Namibia said it would not ``sell its sovereignty'' for Iraqi oil. Venezuelan President Carlos Andres Perez dismissed Saddam's offer of free oil as a ``propaganda ploy.'' Venezuela, an OPEC member, has led a drive among oil-producing nations to boost production to make up for the shortfall caused by the loss of Iraqi and Kuwaiti oil from the world market. Their oil makes up 20 percent of the world's oil reserves. Only Saudi Arabia has higher reserves. But according to the State Department, Cuba, which faces an oil deficit because of reduced Soviet deliveries, has received a shipment of Iraqi petroleum since U.N. sanctions were imposed five weeks ago. And Romania, it said, expects to receive oil indirectly from Iraq. Romania's ambassador to the United States, Virgil Constantinescu, denied that claim Tuesday, calling it ``absolutely false and without foundation.''. Human written summary Cracks appeared in the U.N. trade embargo against Iraq. The State Department reports that Cuba and Romania have struck oil deals with Iraq as others attempt to trade with Baghdad in defiance of the sanctions. Iran has agreed to exchange food and medicine for Iraqi oil. Saddam has offered developing nations free oil if they send their tankers to pick it up. Thus far, none has accepted. Japan, accused of responding too slowly to the Gulf crisis, has promised $2 billion in aid to countries hit hardest by the Iraqi trade embargo. President Bush has promised that Saddam's aggression will not succeed. 7800 chars, 1300 words

51 Full document semantic graph

52 Automatically generated summary graph

53 Findings on summarization with semantic graphsExperiments show that attributes that characterize the document semantic graph improve selection of triples for summarization This results need to be verified on additional data sets Need to perform comparison with additional summarization methods Explore various strategies for extracting and generating summaries based on extracted triples. No combination of features that was examined lead to good separation of positive and negative triples in the feature space Opportunity for further investigations and improvements.

54 Contents Knowledge Discovery Large Scale Topic Ontology populationExtraction of Semantic Networks from Text Active Learning for efficient using of human interventions Methods Addressing Different Aspects of Ontology Construction Final Remarks

55 Active Learning / Dealing with unlabeled data

56 The idea of Active LearningThe idea of Active Learning is if a student asks smart questions, it comes faster to the required model of knowledge as by asking random questions The goal is to use Active Learning algorithms for semiautomatic construction of models for labeling data and for ontology learning

57 Quick Intro to Active LearningData & labels Teacher passive student We use this methods whenever hand-labeled data are rare or expensive to obtain Interactive method Requests only labeling of “interesting” objects Much less human work needed for the same result compared to arbitrary labeling examples query Teacher active student label Active student asking smart questions performance Passive student asking random questions number of questions

58 Algorithms tested Uncertainty sampling (efficient)select example closest to the decision hyperplane (or the one with classification probability closest to P=0.5) (Tong & Koller 2000 Stanford) Maximum margin ratio change select example with the largest predicted impact on the margin size if selected (Tong & Koller 2000 Stanford) Monte Carlo Estimation of Error Reduction select example that reinforces our current beliefs (Roy & McCallum 2001, CMU) Random sampling as baseline Experimental evaluation (using F1-measure) of the four listed approaches shown on three categories from Reuters-2000 dataset average over 10 random samples of 5000 training (out of 500k) and 10k testing (out of 300k)examples the last two methods a rather time consuming, thus we run them for including the first 50 unlabeled examples experiments show that active learning is especially useful for unbalanced data

59 Category with balanced class distribution having 47% of positive examplesLimited advantage over random sampling

60 Category with fairly unbalanced class distribution having 20% of positive examplesBest performance with Uncertainty and MarginRatio, Uncertainty is simpler and much more efficient

61 Category with very unbalanced class distribution having 2Category with very unbalanced class distribution having 2.7% of positive examples Uncertainty seems to outperform MarginRatio

62 Illustration of Active learningstarting with one labeled example from each class (red and blue) select one example for labeling (green circle) request label and add re-generate the model using the extended labeled data Illustration of linear SVM model using arbitrary selection of unlabeled examples (random) active learning selecting the most uncertain examples (closest to the decision hyperplane)

63 Uncertainty sampling of unlabeled example

64

65 Contents Knowledge Discovery Large Scale Topic Ontology populationExtraction of Semantic Networks from Text Active Learning for efficient using of human interventions Methods Addressing Different Aspects of Ontology Construction Final Remarks

66 Methods Addressing Different Aspects of Ontology Construction

67 Methods addressing different aspects of ontology constructionCollecting data focused crawling with Google and DMoz in the loop Dealing with different natural languages map the documents into a language-independent semantic-space Going directly from the data semi-automatic creation of an ontology directly from the data under predefined conditions/scenarios Annotation of text

68 Focused Crawler Focused crawler which finds in a relatively short time web pages related to the given web page The solution uses DMoz topic ontology to get content context, and Google to get web linkage context …the main idea is to use browse web-graph as bi-directional graph using “link:” query in Google Algorithm: For efficient initial set of candidate pages we use Google and DMoz From initial set pages are crawled in breadth-first fashion …priority in the crawler-queue is given to more similar pages …after some stopping condition is met, the crawler returns the list of candidate web pages Usage: serves as a technique for collecting the data for the next stages of data processing such as building and populating ontologies for the Semantic Web, improved knowledge access

69 Example Focused Crawl Focused crawl for the BT home page (http://www.bt.com): 1. - BT 2. - UK's local search engine 3. - AT&T: The World's Networking Company 4. - Cisco Systems, Inc 5. - Microsoft Corporation 6. - BBC 7. - HP United States 8. - Broadband cable internet access 9. - Deutsche Telekom EPSRC Cable & Wireless Royal Mail Ericsson BP Global Telewest Broadband PLC Verizon Nokia BT.com At Home IBM United States SBC Communications Inc. France Telecom MCI Home Siemens AG Motorola Vodafone UK

70 Language-independent document representationFrom aligned corpora we learn mappings between documents into “language independent representation” using “Kernel Canonical Correlation Analysis” method …such representation could be used for multilingual classification, multilingual IR, … On-going work on learning mappings between all European languages using CELEX corpus of European legislation in 21 lang

71 Two views of the same data – find the direction with maximal correlation

72 Corelation = 0.17 View 1 View 2

73 Correlation = 0.44 View 1 View 2

74 Correlation = 0.97 View 1 View 2

75 Correlated directions found with KCCA when applied to financial news articlesZENTRALBANK BP MILLIARDE DOLLAR BANK BP CENTRAL DOLLAR VERLUST EINKOMMEN FIRMA VIERTEL LOSS INCOME COMPANY QUARTER ZAHLUNG VOLLE GEWERK-SCHAFT VERHAND-LUNGSRUNDE WAGE PAYMENT NEGOTIATI-ONS UNION GESCHICHTEN MILLION SAGT BORSEN STORIES MILLION SAYS EXCHANGES

76 Modelling directly from the data – getting semantic classes with LSICELLS GENE CANCER GENOMIC MOLECULAR SERVICES GRID USER MOBILIZATION CONTENT CELLS STEM_CELLS STEM VACCINES WEB CONTENT MEDIA GRID MULTIMEDIA DIGITAL ENERGY OPTICS WASTE FUEL NUCLEAR SECURITY ROBOT EMBEDDED BIOMETRICS VECTOR WEB WEB_SERVICES SEMANTIC CONTENT MEDIA ROBOT LEARNING COGNITIVE HUMAN INTERACTIVE

77 Visualization of 6FP IST project (English)

78 Modeling relationships between companies from the news

79 Annotation of text Annotation based on examplesAnnotation using clustering Annotation based on thesaurus

80 Annotate text based on examplesProblem: Annotation of text by assigning predefined labels to text fragments Given: examples of annotated text fragments learn annotation rules from already annotated documents (.xml, ...) – similar to learning IE learn to classify sentences into semantic roles

81 Annotate text using clusteringProblem: Annotation of text by finding labels and assigning them to to text fragments Given: text to annotate split documents into sentences, represent each sentence as word-vector cluster sentences and label them by the most characteristic words from the sentences e.g., using local frequency of words, clustering with SOM and using neural network weights of words

82 Annotate text based on thesaurusProblem: Annotation of text by finding labels and assigning them to to text fragments Given: text to annotate, thesaurus a) apply NLP on text to find noun-groups and map them upon concepts of (medical) thesaurus b) split document into sentences, cluster them and map clusters upon concepts of a general thesaurus (WordNet) the concepts are used as semantic labels (XML tags) for annotating documents

83 Ontology evaluation directionsAnalysis of information-theoretic properties of structured data instances Measure of the agreement to the characteristics derived from manually built ontologies Optimization of efficiency of the user's behaviour when using an ontology (e.g., minimizing the number of user clicks)

84 Contents Knowledge Discovery Large Scale Topic Ontology populationExtraction of Semantic Networks from Text Active Learning for efficient using of human interventions Methods Addressing Different Aspects of Ontology Construction Final Remarks

85 Ontology Learning ChallengeAcademic challenge on DMoz data (Science part) for 3 tasks: Taxonomy Population Given taxonomy with documents, the task is to classify new documents into taxonomic categories Naming Categories Given taxonomic categories with documents, the task is to (semi)automatically propose names for categories Constructing Taxonomy from Documents Given a set of documents, the task is to (semi)automatically propose taxonomic structure The goal is to model human skills when dealing with large amounts of data Data: DMoz/Science (10k concepts, 100k instances) Tourist ontology (from KU) (70 concepts, ~1000 instances) The challenge will be funded through “PASCAL Network of Excellence” European project (http://www.pascal-network.org/)

86 Ideas / Future plans (1) DMoz categories as standard web meta-data dictionary …the idea is to use DMoz categories/keywords as a standardized dictionary for meta-data labeling of general Web pages …because of dynamic and adaptive nature of DMoz categorization (reflecting all major topics on the web) this could be interesting as a baseline for “semantic web” style annotation …e.g. could be deployed as a tool for (semi)automatic generation of tags for web pages

87 Ideas / Future plans (2) DMoz classifier as an annotation tool…the idea is to use DMoz-classifier tool for meta-data (keyword) generation …some other popular databases (e.g. Wikipedia) could have attached automatically generated DMoz categories …could be accessible as a web service (e.g. SOAP interface)

88 Ideas / Future plans (3) DMoz Visualizer…the idea is to create a tool for visualization and browsing through DMoz structure …browsing tools could combine other public and commercial sources (such as Wikipedia, Google, Amazon, eBay, …) …could appear as e.g. web-browser toolbar

89 Ideas / Future plans (4) Analysis of DMoz DynamicsFuture research plan is to model dynamics of DMoz taxonomy based on data from DMoz Archive (http://rdf.dmoz.org/rdf/archive/) …the idea is to model decision process when and how the editors decide to split the category nodes …currently the repository includes 120 snapshots of DMoz from year 2000 on

90 Ideas / Future plans (5) Focused crawling for DMoz…the idea is to use focused crawler for proposing new web sites for particular categories (as editorial tool) …at JSI we developed focused crawler for fast and efficient crawling for a focused content, can be further extended …to use Google and DMoz in the loop …to use user-hints (positive & negative examples of content pages) …based on Corpus-Builder project at CMU

91 Ideas / Future plans (6) Classification of non English documents…we use string kernels for avoiding problems with morphology …submitted paper at ECML/PKDD2005 (Fortuna & Mladenic) for classification into major Slovenian and Croatian taxonomies …we plan to use use Canonical Correlation Analysis (CCA) for efficient identification of similar content written in different languages

92 Text-Garden software library (in development over the last 5 years)

93 Text-Garden data Set of C++ classes for “industrial strength” text mining problem solving Currently organized in ~50 command line utilities covering Machine learning/Data mining on text Web related functionality Profiling, Visualization, … Currently works on Windows, to be ported to Linux

94 Text Garden – Architecture of clustering, visualization, classification

95 Text Garden Web site www.textmining.net