Search as Research sunfeier.

1 Search as Research sunfeier ...
Author: Edith Gilmore
0 downloads 2 Views

1 Search as Research sunfeier

2 Abstract The amount of data generated on the Web is increasing day by day, but it is not always accessible and usable: developers have to go through numerous steps from obtaining the data to building applications on top of it. The mission of SpazioDati is to eliminate these steps from the development process. SpazioDati pulls the data from various disparate data sources together in one big knowledge graph, Dandelion, and provides a set of APIs over it.

3 Problem now W3C’s Linked Data activity1 promotes interoperability and facilitates machine access to data produced by different providers. A growing pressure on Governments and Public Administrations to publish their data on the Web and to promote its reuse has generated an increasing amount of valuable open information. Despite the intrinsic value of this information, its usability is hampered by the proliferation of several formats, which often lack the explicit semantics necessary to facilitate its programmatic reuse.

4 Problem now Private and corporate information producers have emerged as key data producers: they sell or freely offer data for third-party consumption via REST APIs. Yet,each API has its own interface, data semantics and terms of use, which complicates the re-use of information

5 Problem now A first step to initiate a growth-enhancingmarket process is the emergence of a cluster of data brokers performing a twofold role. they need to guarantee a simple and coherent access to information; they need to certify its quality for reuse within the enterprise. These valuable “Private Linked Data Clouds”, often complemented with tools that facilitate linkage of unstructured documents to the graph, are considered as core assets and not made accessible to third-parties. We discuss existing related work

6 Related work 1.focus on facilitating accessibility, making data quickly usable by means of rich APIs Factual/DataMarket both access to the data through APIs and useful visualisations built automatically on top of them. However, the data remains locked in tables 2.focus on the semantics of the data, allowing very powerful interactions and applications. Dbpedia/Freebase/Yago follows a graph-approach, provide cross-domain knowledge by means of a knowledge graph. allow horizontal applications and give developers the ability to access data that broadly covers many different parts of the human knowledge Products in the first category usually work on tabular data, e.g., Factual3. In some cases they allow one to upload custom data and manipulate it, e.g., DataMarket4, i.e., verticals. It requires additional efforts to integrate this data with other sources.

7 Related work There are also technologies that try to merge the two worlds: the Linked Data Platform is a W3C candidate recommendation for a Linked Data architecture; Apache Marmotta is one of its open implementations, and allow developers to build their own knowledge graph. But the whole process is complicated, and often not affordable by small companies. Products in the first category usually work on tabular data, e.g., Factual3. In some cases they allow one to upload custom data and manipulate it, e.g., DataMarket4, i.e., verticals. It requires additional efforts to integrate this data with other sources.

8

9 1. Data Normalisation 1.data cleaning (OpenRefine)We apply various rules to transform data into standard representation formats. For example, to represent geographical information, we re-use standards developed by the Open Geospatial Consortium. Our Refine extension, geoXtension, enables conversion between various projections. Another Refine extension that SpazioDati is actively developing is the Named-Entity Recognition (NER) extension

10 1. Data Normalisation 2.data harmonisationhighly heterogeneous, We rely on ontologies to resolve semantic heterogeneity of the data. To build ontology we use Neologism We implement a master ontology and local ontologies, that formalise original data sources. The purpose of the master ontology is to provide a common shared vocabulary for original data sources. We implement a master ontology and local ontologies, that formalise original data

11 1. Data Normalisation To add new data into our knowledge graph, we, first,formalise it in a local ontology. Second, we define mappings from local to the master ontologies. Currently, the most developed domain of the Ontology is organisations. To build vocabulary of organisations, we re-used existing ontologies and vocabularies, such as the W3C Organization Ontology and the Registered Organization Vocabulary .

12 1. Data Normalisation To overcome syntactic and structural data heterogeneity, we transform all data into RDF using the Refine extension for exporting RDF. RDF mappings implement a “table-to- class and column-to-predicate” approach: – One row corresponds to one entity of a class defined by a table. Every entity in our knowledge graph is uniquely identified by a URI, called acheneID. Normally,we build acheneIDs out of unique identifiers that can be found in a data source. – Columns in the table correspond to the properties of the entities. We map columns to properties of the master ontology. We implement a master ontology and local ontologies, that formalise original data

13 2. Entity Deduplication The next step in our data curation pipeline is entity deduplication. This is an essential step in obtaining a connected knowledge graph out of disperse independent data sources. After entity mapping performed at the end of data normalisation, each source is transformed in a single, separated graph stored in Virtuoso. Before inserting entities in the knowledge graph, we need to make sure that two entities in two different sources, representing semantically the same object, are merged in a single entity containing the data of both.

14 2. Entity Deduplication The Silk Framework is used to deduplicate entities. With Silk it is possible to quickly calculate a similarity measure on any two entities, given a set of matching rules defined by the user; such rules are also used to automatically create a blocking algorithm to reduce the number of comparisons needed to match two large data sources Before importing new data from a source, the data itself is matched with Silk against the whole knowledge graph to find duplicated entities. As output Silk produces a list of owl:sameAs links that are then used to merge entities when importing the new data.

15 3. Data Storage The knowledge graph is stored in a graph database, Titan, which runs on top of a Cassandra cluster, and allows to store key-value maps on either vertices and edges and to submit queries using Gremlin for fast traversals. Entities are not simply stored as a single vertex, because it is important to keep track of the provenance of each information stored on it. In the knowledge graph therefore four different kinds of vertices can be found

16 3. Data Storage

17 3. Data Storage – achene nodes – the square nodes, they represent an entity; they do not store any kind of information with the only exception of the PURL of the entity they represent; – bristle nodes – the circular nodes, they are connected to achene nodes and store the actual data associated to an entity; an achene node may have multiple bristle nodes, one for each source from which it was imported; – provenance nodes – the triangular nodes, they are connected to bristle nodes to represent the source that provided the information stored on each bristle; – type nodes – the hexagonal nodes, they represent our entity taxonomy and are used to keep track of the type of each entity (e.g., Company, Person, POI, Geographical Location, etc.)

18 3. Data Storage Entities in the knowledge graph are therefore represented by one achene node and multiple bristle nodes. Bristles and edges store semantic information using the master ontology as well as other public ontologies. The entity name will therefore be stored on the bristle as name, while link between type nodes will be labelled rdfs:subClassOf.

19 4. Data Access Our idea is to provide a simplified access to the data, by means of slices of the graph we identify a specific partition of graph, in terms of: types of nodes, set of properties for each node, and set of properties for each linked node, usually by traversing the graph with a limited number of steps. Then data is formatted into a table where rows are nodes of the graph and columns are filled with values of selected properties. Possibly, properties are collections or collections of objects, that contain information coming from graph traversal. We can create any number of slices, changing the way to select nodes and the set of properties to be included. These slices are called dataGEM: developers can access them using a REST-like web API and standard HTTP parameters to query the data in a simple manner. We argue that this model has several benefits with respect to a single and complex SPARQL endpoint: Once data is available as a consistent and welldefined graph, we have to let developers access the data, so that they can get and browse the data programmatically. As we mentioned at the beginning, this is not a simple task because not all developers are able to deal with graph data and build queries to browse and get what they really need for. Given this, the ultimate goal is to simplify as much as possible accessing the data: we argue that the most basic data structure that can be easily integrated and digested by an average programmer is the table.

20 Future Work simplifying access to semantically structured data to let all developers that are not Semantic Web specialists add value to their applications. In addition to this approach, we will also let developers access the graph using our text analytics API, called dataTXT. dataTXT is the evolution of a state-of-the-art algorithm , and it is able to identify on-the-fly and with high accuracy, meaningful sequences of terms in unstructured text and link them to a pertinent DBpedia resource. dataTXT solves ambiguity and synonymy by means of the knowledge graph extracted from DBpedia, but in the future we plan to extend this approach. DBpedia will be kept as a backbone graph that is used to disambiguate and provide context for common topics, but thanks to extended Dandelion’s knowledge graph, dataTXT will be able to link also specific entities coming from external sources. Actually, this is another approach for accessing the graph: given an unstructured text, dataTXT can help developers to link it to the Dandelion’s knowledge graph adding structure and semantics to plain texts.

21 Tag clouds Tag clouds are visual representations of social tags, displayed in paragraph-style layout, usually in alphabetical order, where the relative size and weight of the font for each tag corresponds to the relative frequency of its use

22 Tag clouds are becoming increasingly popular as visualizations on personal and commercial web pages, blogs, and social information sharing sites such as flickr. The data used as input to tag clouds are usually social tags (the unstructured annotation of information by authors or readers of thatinformation, using short textual labels known as “tags”) , although search engine query terms, word frequencies within documents, and pre- existing category labels are also currently visualized in this manner. On the web, tag clouds are increasingly popular, but their exact purpose is unclear, especially since their ability to accurately convey information is debatable. what are designers’ intentions in creating or using tag clouds, and how do they expect their readers to interpret them? To address this question, we performed a qualitative assessment of the current use and perceived advantages and drawbacks of tag clouds.

23

24 Related Work Rivadeneira et al. conducted two studies. In the first they compared tag layout along three dimensions: tag size, tag proximity to a tag with a large font, and position of tag within the display when broken into quadrants. The study included 13 participants whose task was to recall if a tag was seen after viewing a distractor task. They found effects for tag size and quadrant location (those in the upper left were recalled better, as were those displayed with larger tags). Rivadeneira et al. used these results to inform a second study with 11 participants, in which they compared the following four views (descriptions reworded from the original to improve clarity):

25 Related Work 1. A paragraph-style tag cloud with varying font size, tags shown in alphabetical order. 2. A paragraph-style tag cloud with varying font size, tags shown in descending frequency order. 3. A variation on standard tag clouds with a specialized layout that is more cloud-like and spatial (there was no fixed baseline for the tags, which differs from standard paragraph-style tag clouds), but still using varying font sizes and still somewhat alphabetically ordered. 4. A vertical single column list with no font size variation, shown in frequency order rather than alphabetical.

26 Related Work 1. Horizontal list, only one font size, order not speci- fied. 2. Horizontal list, only one font size, alphabetical. 3. Vertical list, only one font size, order not specified. 4. Vertical list, only one font size, alphabetical. 5. Spatial layout, three different font sizes used, order not specified. 6. Spatial layout, three different font sizes used, order alphabetical.

27 Thus, although the experimental work is limited, the results trend towards the conclusion that spatial tag clouds are a poor layout compared to lists for information recognition and recall tasks. Unfortunately, these studies did not record subjective reactions to the different layouts.

28 PubMed system for bioscience literature searchPubMed system for bioscience literature search. The tags were words automatically extracted from the retrieved abstracts. The font size was used to indicate term frequency and font color used to indicate recency (computed as average publication date for the documents containing the word). Only the most frequent words were shown as tags, and these were hyperlinked to the articles containing those words. In the usability study, 20 people each ran two queries with only one of the interfaces (between participants design). The quality of participants’ answers were higher on a descriptive task with the tag cloud interface, but less accurate on a relational task (e.g., name three genes involved in process P). Overall, the participants were slower with the tag cloud view. Participants rated the tag clouds as less “helpful” but with higher “satisfaction” than the PubMed interface.

29 Result One of the most surprising results was that a significant proportion of interviewees did not realize that tag clouds are regularly organized into alphabetical order. One participant strongly disliked the focus on the popular and the marginalizing of the less popular implied by the visualization. Finally, two interviewees pointed out that tag clouds are easy to code, suggesting that might be one reason for their popularity.

30 After an initial pass over about 140 discussions, we developed a set of 28 Existence of (a) tag cloud(s) On average, we extracted 1.4 comments per posting, and as signed 1.9 categories per posting.

31 Twenty-eight of the comments that we coded simply mentioned or defined tag clouds, or pointed the reader to existing designs. Twenty-five described implementation details or ideas, or described alternative methods of presentation (e.g., suggestions about varying the colors of the tags, discussions of how to make the size scaling look better, suggestions of alternatives such as heat maps or ordering by some metric other than alphabetical). For the purposes of this discussion, we are interested in those comments that discussed what was perceived as being good or bad about tag clouds, and what they are useful for, or not useful for.

32 none explicitly commented on this being a useful aspect of tag clouds and two implied that the alphabetical ordering specifically is not helpful. .In the web page analysis, six postings talked about tag clouds as allowing for time or trend comparisons, but four of these mentioned alternative displays to better show trends It may also be this sense that people intend when discussing tag clouds; they may help suggest the main tendencies of a person or a site in termsof what subject matter they discuss.

33 Discussion If one accepts the premise that tag clouds are used specifically for portraying human mental activity, either of an individual or of a group of people, then what might be considered design flaws from a data visualization perspective make sense in terms of what information is intended to be conveyed. As noted by interviewees and designers’ writings, a large part of the appeal of the visual appearance of tag clouds are its fun, non-conformist view, and the feeling that it evokes of human activity.

34 Conclusion We have concluded that tag clouds are primarily a visualization used to signal the existence of tags and collaborative human activity, as opposed to a visualization used for data analysis. The flipside of this idea is the use of data analysis visualizations as settings for social activity. The Name Voyager baby names visualization tool by Wattenberg yielded surprisingly social behavior in its use . This work has inspired the new area of social data exploration, much of which uses information visualization, as exemplified by the Many Eyes system and the experiments with census data exploration of Heer et al.

35 We have attempted to characterize the current writings and thinking of web designers and information visualization experts about the tag cloud visual representation. The limited research on the usefulness of tag clouds for understanding information and for other information processing tasks suggests that they are (unsurprisingly) inferior to a more standard alphabetical listing. This could perhaps be remedied by adjusting white space, font, and other parameters, or by more fundamentally changing the layout. That said, it seems that the main value of this visualization is as a signal or marker of individual or social interaction with the contents of an information collection, and functions more as a suggestive device than as a precise depiction of the underlying phenomenon. Designers who like them praise their fun, informal, and dynamic appearance, thinking they help characterize trends and invite exploration of and participation in the tagging community.