Generating Preview Tables for Entity Graphs

1 Generating Preview Tables for Entity GraphsNing Yan, Ab...
Author: Ernest Nash
0 downloads 3 Views

1 Generating Preview Tables for Entity GraphsNing Yan, Abolfazl Asudeh, Chengkai Li Department of Computer Science and Engineering The University of Texas at Arlington

2 outline Background Definitions of schema graph, preview tables and preview preview scoring Algorithms Evaluation

3 Background unprecedented proliferation of big, complex entity graphs that capture entities and their relationships in many domains such as YAGO, DBpedia, Freebase… Challenging to select entity graphs for a particular need while sources often provide dataset descriptions ,cannot get a direct look at an entity graphs before fetching it. The authors propose methods to automatically produce preview tables for compact presentation of important entity types and rough preview of the data to assist users attaining a quick and rough view of the data.

4 A simple example

5 Definitions Fig. 2 is a possible preview of the entity graph in Fig. 1. It consists of two preview tables—the upper table has attributes FILM, Director and Genres, and the lower table has attributes FILM ACTOR and Award Winners. In this preview, entities of types FILM and FILM ACTOR are deemed of central importance in the entity graph. Hence, FILM and FILM ACTOR are the key attributes of the two tables, respectively, marked by the underlines beneath them. Attributes Director and Genres in the upper table are considered highly related to FILM entities. Similarly, Award Winners in the lower table is highly related to FILM ACTOR entities. The two tables contain 4 and 2 tuples, respectively. For instance, the first tuple of the upper table is t1 = Men in Black, Barry Sonnenfeld, {Action Film, Science Fiction}. The tuple indicates that entity Men in Black belongs to type FILM and has a relationship Director from Barry Sonnenfeld and has relationship Genres to both Action Film and Science Fiction.

6 Problem statement

7 Preview scoring 这个定义给与了key attribute很高的重要性,是因为key attribute处于一个preview table的核心。按照preview table的定义,non-key attribute描述的是key-attribute与它们的关系

8 Attribute scoring Key Attribute Scoring Non-Key Attribute ScoringCoverage-based Random-walk based Non-Key Attribute Scoring Coverage-based Entropy-based 有了 score 函数后,我们就可以根据score来选择合适的key attribute 和non-key attribute 加入preview table,下面介绍一下三种选择attiribute的算法

9 Algorithms- definitions for𝑃 𝑜𝑝𝑡 Concise preview Concise, tight preview Concise, diverse preview The distance between two preview tables T1 and T2 (denoted ) is the length of the shortest undirected path between their key attributes and in schema graph Gs. 𝑑𝑖𝑠𝑡( 𝑇 1 , 𝑇 2 𝑇 1 .𝑘𝑒𝑦 𝑇 2 .𝑘𝑒𝑦 在介绍算法之前,因为preview的 选择有很多,空间有限,我们只能选择最适合,也就是最好的给用户看,因此, 文章定义了三种最好的preview

10 A Brute-Force algorithmApplicable for concise, tight, diverse previews

11 A dynamic-programming algorithmApplicable only for concise preview

12 An Apriori-style algorithmTwo steps: Step1 (k-cliques problem): finding k-subsets of entity types (i.e., vertices in Gs) satisfying the distance constraint (Lines 1–14); Step2: for each qualifying k-subset of entity types, forming a preview under the size constraint, computing its score and choosing a preview with the highest score (Lines 15– 20). Step1是在一个图中找到k-个集群的问题,有很多成熟的算法都可以做到,这批文章采用的是比较著名的用在频繁项目集挖掘的先验算法,

13 Evaluation Entity graph: Freebase, 8 domainsall numeric attribute values from the data dump have been removed.

14 Accuracy of preview scoring measuresExperiment 1: ----for key attributes gold standard: 6 entity types listed in the Freebase entrance pages of a domain Measure: percentage of its top-K results that belong to the aforementioned gold standard top-6 key attributes Results: both the coverage-based and the random-walk based scoring measures had close to 0.6 in 4 out of the 5 domains.

15 Accuracy of preview scoring measuresExperiment 2: ---for non-key attributes Gold standard: the original 3 attributes for that entity type. Measure: Mean Reciprocal Rank(MRR)---the reciprocal rank is the multiplicative inverse of the rank of the first gold standard non-key attribute among its ranking results. MRR is the average reciprocal rank across all entity types with at least 5 candidate non-key attributes Results: In every domain except “film” and for both the coverage-based and the entropy-based measures, MRR is above 0.5. This means in average a gold standard non-key attribute appeared in the top-2 ranked results.

16 Accuracy of preview scoring measuresExperiment 3 : ---user study Target: measured the correlation between our scoring measures and users’ opinions with regard to key and non-key attributes ranking Measure : Pearson Correlation Coefficient(PCC) Results: at least a medium([0.3, 0.5]) positive correlation between our scoring measures and users.

17 Efficiency Results: both the Dynamic-Programming and the Apriori-style algorithms outperformed the Brute-Force algorithm by orders of magnitude in most cases

18 Sample Optimal Previews

19 thanks