1 Fatih Akdag and Christoph F. Eick University of Houston Interestingness Hotspot Discovery in Spatial Datasets using a Graph-based Approach Fatih Akdag and Christoph F. Eick University of Houston Department of Computer Science Organization Interestingness Hotspots Related Work Hotspot Discovery Framework Experimental Evaluation Conclusion
2 1. Interestingness HotspotsInterestingness hotspots are contiguous regions in space which are interesting based on a domain expert’s notion of interestingness captured by a (plug-in) interestingness function. We propose a novel methodology for discovering interestingness hotspots in spatial datasets using a graph-based algorithm. Traditional hotspots are typically defined using notions of density; interestingness hotspots on the other hand, are spatial regions whose interestingness is above an interestingness threshold and frequently refer to non spatial attributes when computing interestingness.
3 Example Earthquake Correlation HotspotsInterestingness Hotspots of Earthquakes whose absolute value of the correlation between earthquake depth and severity is above 0.6. The earthquakes in the green hotspot have a correlation of 0.83 and those in the orange hotspot have a correlation of
4 Interestingness FunctionsFor high correlation hotspots: For low variation hotspots:
5 Contributions We propose a methodology for finding hotspots in spatial datasets that consists of 4 steps: 1) building neighborhood graph 2) finding hotspot seeds 3) growing hotspot seeds 4) generating polygon models. We propose methods for creating a neighborhood graphs between spatial objects. A heap-based hotspot growing algorithm is proposed to find interestingness hotspots using the neighborhood graph in spatial datasets. We propose an approach to generate a polygon model for two-dimensional hotspots based on Voronoi diagram. The proposed interestingness hotspot discovery framework is evaluated in a case study involving a two-dimensional earthquake dataset.
6 2. Related work Spatial Scan Statistics: Searches for spatial circular hotspots Spatial Clustering algorithms: DBSCAN, SNN; they work distance based. Other spatial spatial clustering algorithms have been used to find interestingness hotspot in spatial datasets using interestingness functions (CLEVER and MOSAIC) In this work we present an alternative approach which grows seed regions, and capable of finding better hotspots: Objective function maximized while growing Overlapping hotspots are detected.
7 3. Hotspot Discovery FrameworkInterestingness hotspots are contiguous areas in space for which an interestingness function i assigns a reward w, indicating “news-worthy” regional associations. We assume that we have an interestingness measure i that assesses the interestingness of subsets of the objects in data set O by assigning rewards to a particular cluster H. Moreover, we assume a spatial neighboring relationship N is given that describes which objects belonging to O are neighbors. An interestingness threshold is given that defines which patterns are interesting.
8 Hotspot Discovery FrameworkWe find interestingness hotspots 𝐻 𝜖 𝑂 where 𝐻 is an interestingness hotspot with respect to i, if the following 2 conditions are met: i(H) > theta H is contiguous with respect to a neighborhood relation N; that is, for each pair of objects (o,v) with o,v in H, there has to be a path from o to v that traverses neighboring objects (w.r.t. N) belonging to H. In summary, interestingness hotspots H are contiguous regions in space that are interesting.
9 Interestingness FunctionsFor high correlation hotspots: For low variation hotspots:
10 General Hotspot Growing AlgorithmIdentify some small regions with high interestingness as seed regions Grow these seed regions by adding neighboring objects which increase the reward most when added. Eliminate overlapping hotspots
11 Graph-based hotspot growing algorithmBuild a neighborhood graph Find “small” hotspot seed regions (subgraphs) with high interestingness Find hotspots by growing the hotspot from the seed regions Generate a polygon model for the hotspots found in Phase 3 Post-process the obtained hotspots usually removing highly overlapping hotspots.
12 Popular Neighborhood GraphsGabriel graphs strike a good balance: many edges between distant points in the DT are eliminated, yet edges between close points are preserved. Thus, we use Gabriel graphs to identify neighboring objects in spatial datasets
13 Gabriel Graphs Any points P and Q in a dataset are adjacent in the Gabriel graph if P and Q are distinct and the closed disc D, of which the line segment PQ is a diameter, contains no other points. Unlike Delaunay graphs, Gabriel graphs generalize to higher dimensions, with the empty disks replaced by empty closed balls. For 2-dimensional data, the Gabriel graph can be computed from the Delaunay graph in O(n) time [12], in a total of O(nlogn) complexity. For higher dimensional data, Gabriel graph can be computed in O(n3) time by brute force; faster approximate algorithms exist! Example
14 Identifying Hotspot SeedsWe visit each vertex in the graph, and create a region consisting of the vertex and all of its neighbors The interestingness value for each so generated region is calculated by applying the plugin interestingness function on the set of objects in the region. Regions with high interestingness are identified as hotspot seeds.
15 Merge Hotspot Seeds Many seeds grow to the same or very similar hotspot; so why not eliminate some seeds before growing? Procedure: Merge neighboring seeds as long as merging does not decrease the interestingness value more than a percentage threshold; e.g. i(m)>0.9*(i(m1)+i(m2)) Start with the best merge candidates and merge as long as there are more merge candidates left.
16 Growing Hotspot seeds Starting with each seed, search the best neighbor among all neighbors in each step, and add it into the region. Continue this procedure as long as the region’s interestingness is positive. Keep a reference to the best interestingness value found and output it at the end of the procedure.
17 Optimizing hotspot growingWhen new neighbors are encountered as a result of growing the hotspot, we assign each new neighbor a fitness value by evaluating the reward gain in case the neighbor is added to the region We use a max-heap data structure to keep the list of neighbors where the neighbor with the highest fitness value is the root of the heap tree We add each new neighbor into the heap using the fitness value as the priority Root of the heap tree is added in each step, eliminating search for the best neighbor
18 Generating Polygon ModelsWe present a method to create polygon models for 2-dimensional hotspots;: Create voronoi diagram for the point set Each point in the hotspot will either be in a Voronoi polygon, if the point is on the convex hull of the dataset, it will not be enclosed by a Voronoi polygon We propose enclosing points in Voronoi diagram by intersecting the convex hull of the dataset with the Voronoi cells; non-finite cells will be bounded by edges of the convex hull.
19 Creating Polygons Find the Voronoi polygons or edges for each point in the hotspot. For each point P in the hotspot: If P is in a closed Voronoi cell (Voronoi polygon), check if it crosses with the convex hull: If the convex hull does not cross Voronoi polygon, then add this polygon into the polygon model for the hotspot. Else if the convex hull crosses the Voronoi polygon, then the convex hull splits this polygon into 2 polygons. In this case, the point will be inside one of these polygons. Add the polygon with the point into the polygon model. If the point is not in a Voronoi polygon: find the intersection of the Voronoi edges around the point and the convex hull. The intersection will create a polygon; add this polygon into the polygon model. Merge all polygons obtained for all points. Example
20 Example: Generated PolygonsConvex hull P2 P3 P1
21 Experimental EvaluationDataset: all earthquakes of magnitude 6.0 or higher in Japan and Korea region from January 1st 2000 to January 1st 2016 236 earthquakes
22 Experimental EvaluationReward Function Used: Reward(R) = interestingness(R) x size(R)β where β1 is a real number determining the degree of preference for larger regions. In the case study, we set β to 1.01. Experiment 1: Find hotspots in the area in which depth and magnitude of the earthquake is highly correlated
23 Identifying hotspot seedsUsed 0.95 as the seed threshold to identify regions with very high correlation of depth and magnitude. The correlation of these variables in the dataset is 0.029, which is very low. Out of 235 regions evaluated in the dataset (1 region around each object), 33 regions had an absolute correlation value greater than 0.95. After applying seed merge operation, we obtained 30 seed regions which were grown in the next phase.
24 Hotspot growing step Out of 30 seed regions grown, 29 of them had high positive correlation (greater than 0.75) and only 1 region had very high negative correlation (-0.93). Average positive correlation was 0.86. 3 seeds grew to the hotspots which were already discovered so they were deleted
25 Hotspot polygons 2 hotspots visualized in 2 different scales: Green hotspot: Negative correlation, Orange hotspot: Positive correlation
26 Experiment 2 We find low variation hotspots in the same geographic area in which the variance of the depth of the earthquakes is lower than 5 We used 3 as the seed interestingness threshold to find small regions with variance less than 3 There were 10 seed regions in the dataset. 2 of the seed regions were merged and the resulting 9 seed regions were grown. 2 of them grew to the already discovered hotspots 7 hotspots detected
27 Table 1. Listing of discovered low variation hotspotsExperiment 2 hotsp ot size varia nce 1 4 4.25 2 3.9 3 7 2.41 2.21 5 8 1.79 6 1.48 0.32 Table 1. Listing of discovered low variation hotspots Three low variation hotspots (2:blue, 6:green, 7:red) and their locations on a map
28 Conclusion To the best of our knowledge, this is the only algorithm that grows hotspots from seed regions using a reward function The proposed methodology is evaluated in a case study for a 2-dimensional earthquake dataset. The methodology proved to be successful in finding hotspots based on plugin interestingness and reward functions We plan to extend our framework for higher dimensional datasets in which we create higher dimensional Gabriel graphs and polygonal models.
29 Conclusion We believe that our approach has more potential to compute “better”, more interesting hotspots, as the clustering approach searches for all hotspots in parallel, being forced to make compromises, as switching one sub region from one to another cluster might increase the reward of one cluster but decrease the reward of the other cluster. We plan to compare our approach to clustering approaches in a future work.