Processing Aggregate Queries in a Federation of SPARQL Endpoints

1 Processing Aggregate Queries in a Federation of SPARQL ...
Author: Andra Newman
0 downloads 0 Views

1 Processing Aggregate Queries in a Federation of SPARQL EndpointsDilshod IBRAGIMOV, KATJA HOSE, TORBEN BACH PEDERSEN, ESTEBAN ZIMÁNYI.

2 Motivating Example Earthquake in the Pacific in March tsunami a nuclear accident Hourly observation of radioactivity statistics at 47 prefectures Observations (March 16, 2011 – March 15, 2012) converted to RDF data (places represented by URI from GeoNames) Interesting analyses: AVG radioactivity separately for each prefecture in Japan The MIN and MAX radioactivity for each prefecture (changes within one-year observations) Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

3 Motivating Example - ObservationProcessing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

4 Motivating Example - QueryEx: Show average radioactivity values for each prefecture SELECT ?regName ( AVG (?floatRV ) AS ?average ) WHERE { ?s ev:place ?placeID ?s ev:time ?time ?s rdf:value ?radioValue . SERVICE { ?placeID gn:parentFeature ?regionID ?regionID gn:name ?regName . } BIND (xsd:float (?radioValue ) as ?floatRV ) . GROUP BY ?regName Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

5 Motivating Example - ResultsVirtuoso v , Sesame v2.7.11, and Jena Fuseki v1.0.0 (based on ARQ) timed out Network traffic analyzer showed that: Virtuoso and Fuseki query GeoNames for each radioactivity observation (more than 400,000 requests) Sesame is trying to download all triples that match the SERVICE query pattern (more than 7.8 million triples) Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

6 Outline Basic Strategies CODA Test Case Conclusion and Future WorksProcessing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

7 Basic Strategies - Mediator JoinThe mediator/federator receives the query from the user The query optimizer sends separate queries to endpoints and merges the results Strong point – parallelization Weak point – expensive for large intermediate results/datasets Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

8 Basic Strategies - Semi-JoinMain principle is to execute the subquery with the smallest result first and use the retrieved results as bindings for the join variable in other subqueries (SPARQL structure) Efficient for highly selective subqueries (with FILTER statement) SELECT ?regName ( AVG (?radioValue ) AS ?average ) WHERE { ?s ev:place ?placeID ?s ev:time ?time ?s rdf:value ?radioValue . SERVICE { ?placeID gn: parentFeature ?regionID ?regionID gn:name ?regName . } FILTER(?radioValue < 0.08) . } GROUP BY ?regName Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

9 Basic Strategies - Semi-Join (Cont)Weak point - VALUES is not yet widely adopted in existing endpoints. SPARQL compliant alternatives of UNION (or FILTER) must often be used Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

10 Basic Strategies - Partial AggregationIf results are grouped by SERVICE query variables, further optimization is possible (motivating query example) 1) First group by the observation place (?placeID) SELECT ?placeID (SUM (?floatRV ) AS ?avgSUM) (COUNT (?floatRV ) AS ?avgCNT ) WHERE { ?s ev:place ?placeID ?s ev:time ? time ?s rdf:value ? radioValue . BIND (xsd:float (?radioValue ) as ?floatRV ) . } GROUP BY ?placeID Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

11 Basic Strategies - Partial Aggregation (Cont)Then execute SERVICE query SELECT ?placeID ?regName WHERE { ?placeID gn:parentFeature ?regionID . ?regionID gn:name ?regName . VALUES (?placeID) { () …. } Final step – join the intermediate results and compute the final result (distributed/algebraic functions) Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

12 CODA – Cost-based Optimizer for Distributed Aggregate QueriesDecomposes the original query into multiple subqueries (query 𝑄 𝑀 and SERVICE queries 𝑄 𝑒1 … 𝑄 𝑒𝑁 ) Estimates query execution costs for different query execution plans Chooses the one with minimum costs Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

13 CODA - Costs Overall costs 𝐶 𝑄 𝐶 𝑄 = 𝐶 𝑃 + 𝐶 𝐶𝐶 𝑄 = 𝐶 𝑃 + 𝐶 𝐶 Communication costs 𝐶 𝐶 for subquery 𝑆 𝑖 : 𝐶 𝐶 𝑆 𝑖 = 𝐶 𝑂 + 𝑐 𝑆 𝑖 ∗ 𝐶 𝑚𝑎𝑝 ; 𝐶 𝑂 - communication establishing overhead , 𝑐 𝑆 𝑖 - result size, and 𝐶 𝑚𝑎𝑝 - single result transfer cost Processing costs 𝐶 𝑃 = 𝑐 𝑎𝑔𝑔 𝑖 ∗ 𝐶 𝐴𝐺𝐺 ; 𝑐 𝑎𝑔𝑔 𝑖 - number of aggregated observations, 𝐶 𝐴𝐺𝐺 - cost for processing a single observation Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

14 CODA - Estimating Constants𝐶 𝑚𝑎𝑝 - estimated using “SELECT * WHERE { ?s #p ?o . FILTER(?o = #o) } LIMIT #L”; different values for #L, #o and #p 𝐶 𝑂 - estimated with multiple “ASK {}” or “SELECT (1 AS ?v) {}” 𝐶 𝐴𝐺𝐺 - estimated based on multiple “SELECT COUNT(?s) WHERE {?s ?p ?o } GROUP BY ?o” Not perfectly accurate but the aim is to find out which execution plan is more efficient (not to predict the execution costs) Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

15 CODA - Result Size EstimationResult size estimation - VoID statistics (dataset, property partition, class partition) 𝑐 𝑡 - total number of triples (void:triples), 𝑐 𝑠 - total number of distinct subjects (void:distinctSubjects), 𝑐 𝑜 - total number of distinct objects (void:distinctObjects) Single patterns - 𝐶 𝑟𝑒𝑠 for (?s ?p ?o) is given by 𝑐 𝑡 , (s ?p ?o) estimated as 𝑐 𝑡 / 𝑐 𝑠 , (?s ?p o) as 𝑐 𝑡 / 𝑐 𝑜 , and (s ?p o) as 𝑐 𝑡 / (𝑐 𝑠 ∗ 𝑐 𝑜 ); FILTER influence estimates Joins - estimates depend on shape (star vs path). Formulas taken from “Resource Planning for SPARQL Query Execution on Data Sharing Platforms”. Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

16 CODA – Motivating ExampleDecomposed into 3 queries SELECT ?placeID (AVG(?floatRV) AS ?average) WHERE { ?s ev:place ?placeID . ?s rdf:value ?radioValue . BIND(xsd:float(?radioValue) AS ?floatRV) . ?s ev:time ?time . } GROUP BY ?placeID SELECT ?placeID ?floatRV WHERE { ?s ev:place ?placeID . ?s rdf:value ?radioValue . BIND(xsd:float(?radioValue) AS ?floatRV) . ?s ev:time ?time . } SELECT ?placeID ?regName WHERE { ?placeID gn:parentFeature ?regionID . ?regionID gn:name ?regName . } Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

17 CODA – Motivating ExampleEstimates for Radioact query: number of aggregated triples: estimated cost: 15 number of returned triples: estimated cost: 129 Estimates for GeoNames query: number of returned triples: estimated cost: 1956 Selected plan – Partial Aggregation Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

18 Test Case – SSB as RDF Star Schema Benchmark converted to RDF (strongly resembling SSB tabular structure) We generated data for different scale factors (1 to 5 - 6M to 30M observations, 110,5M to 547,5M triples) Different configurations two endpoints (one endpoint containing main observation data and one SERVICE endpoint containing supporting data) three endpoints (two SERVICE endpoints containing supporting data) four endpoints (three SERVICE endpoints containing supporting data) All datasets and queries are available at Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

19 SSB RDF schema (partial)Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

20 Test Case – SSB Queries Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

21 Test Case – Results (One SERVICE Endpoint)Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

22 Test Case – Results (One SERVICE Endpoint Q2.3)Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

23 Test Case – Results (One to Three SERVICE Endpoints Q4.3)Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi

24 Conclusion and Future WorkEfficiently processing aggregate queries in a federation of SPARQL endpoints Processing strategies (MedJoin, SemiJoin, PartialAgg) Cost-based Optimizer for Distributed Aggregate queries (CODA) efficient and scalable chooses the best query processing plan in different situations significantly outperforms current state-of-the art triple stores Future Work: Using more complex statistics with precomputed join result sizes and correlation information for better cardinality estimation Optimizing more complex queries, e.g., with optional patterns or complex aggregation functions Processing Aggregate Queries in a Federation of SPARQL Endpoints. D.Ibragimov, K.Hose, T.B.Pedersen, E.Zimanyi