Skip to content

Commit 24e70be

Browse files
authored
Merge pull request #147 from halfabrane/RELEASE-v1.0.5
Bump up version for release + Add docs
2 parents e7f91ce + ed70d96 commit 24e70be

File tree

2 files changed

+106
-24
lines changed

2 files changed

+106
-24
lines changed

README.md

Lines changed: 105 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -4,48 +4,59 @@
44
[![codecov.io](http://codecov.io/github/harsha2010/magellan/coverage.svg?branch=master)](http://codecov.io/github/harsha2010/magellan?branch=maste)
55

66

7-
Geospatial data is pervasive, and spatial context is a very rich signal of user intent and relevance
8-
in search and targeted advertising and an important variable in many predictive analytics applications.
9-
For example when a user searches for “canyon hotels”, without location awareness the top result
10-
or sponsored ads might be for hotels in the town “Canyon, TX”.
11-
However, if they are are near the Grand Canyon, the top results or ads should be for nearby hotels.
12-
Thus a search term combined with location context allows for much more relevant results and ads.
13-
Similarly a variety of other predictive analytics problems can leverage location as a context.
7+
Magellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries.
148

15-
To leverage spatial context in a predictive analytics application requires us to be able
16-
to parse these datasets at scale, join them with target datasets that contain point in space information,
17-
and answer geometrical queries efficiently.
9+
The application developer writes standard sql or data frame queries to evaluate geometric expressions while the execution engine takes care of efficiently laying data out in memory during query processing, picking the right query plan, optimizing the query execution with cheap and efficient spatial indices while presenting a declarative abstraction to the developer.
1810

19-
Magellan is an open source library Geospatial Analytics using Spark as the underlying engine.
20-
We leverage Catalyst’s pluggable optimizer to efficiently execute spatial joins, SparkSQL’s powerful operators to express geometric queries in a natural DSL, and Pyspark’s Python integration to provide Python bindings.
11+
Magellan is the first library to extend Spark SQL to provide a relational abstraction for geospatial analytics. I see it as an evolution of geospatial analytics engines into the emerging world of big data by providing abstractions that are developer friendly, can be leveraged by anyone who understands or uses Apache Spark while simultaneously showcasing an execution engine that is state of the art for geospatial analytics on big data.
12+
13+
# Version Release Notes
14+
15+
You can find notes on the various released versions [here](https://github.com/harsha2010/magellan/releases)
2116

2217
# Linking
2318

24-
You can link against this library using the following coordinates:
19+
You can link against the latest release using the following coordinates:
2520

2621
groupId: harsha2010
2722
artifactId: magellan
28-
version: 1.0.4-s_2.11
23+
version: 1.0.5-s_2.11
2924

3025
# Requirements
3126

32-
This library requires Spark 2.1+ and Scala 2.11
27+
v1.0.5 requires Spark 2.1+ and Scala 2.11
3328

3429
# Capabilities
3530

36-
The library currently supports the [ESRI](https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf) format files as well as [GeoJSON](http://geojson.org).
31+
The library currently supports reading the following formats:
32+
33+
* [ESRI](https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf)
34+
* [GeoJSON](http://geojson.org)
35+
* [OSM-XML](http://wiki.openstreetmap.org/wiki/OSM_XML)
36+
* [WKT](https://en.wikipedia.org/wiki/Well-known_text).
3737

3838
We aim to support the full suite of [OpenGIS Simple Features for SQL ](http://www.opengeospatial.org/standards/sfs) spatial predicate functions and operators together with additional topological functions.
3939

40-
Capabilities we aim to support include (ones currently available are highlighted):
40+
The following geometries are currently supported:
4141

42-
**Geometries**: **Point**, **LineString**, **Polygon**, **MultiPoint**, **MultiPolygon**, MultiLineString, GeometryCollection
43-
44-
**Predicates**: **Intersects**, Touches, Disjoint, Crosses, **Within**, **Contains**, Overlaps, Equals, Covers
45-
46-
**Operations**: Union, Distance, **Intersection**, Symmetric Difference, Convex Hull, Envelope, Buffer, Simplify, Valid, Area, Length
42+
**Geometries**:
43+
44+
* Point
45+
* LineString
46+
* Polygon
47+
* MultiPoint
48+
* MultiPolygon (treated as a collection of Polygons and read in as a row per polygon by the GeoJSON reader)
4749

48-
**Scala and Python API**
50+
The following predicates are currently supported:
51+
52+
* Intersects
53+
* Contains
54+
* Within
55+
56+
The following languages are currently supported:
57+
58+
* Scala
59+
4960

5061

5162

@@ -158,6 +169,77 @@ A few common packages you might want to import within Magellan
158169

159170
A Databricks notebook with similar examples is published [here](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/137058993011870/882779309834027/6891974485343070/latest.html) for convenience.
160171

172+
# Spatial indexes
173+
174+
Starting v1.0.5, Magellan support spatial indexes.
175+
Spatial indexes supported the so called [ZOrderCurves](https://en.wikipedia.org/wiki/Z-order_curve).
176+
177+
178+
Given a column of shapes, one can index the shapes to a given precision using a geohash indexer by doing the following:
179+
180+
```scala
181+
df.withColumn("index", $"polygon" index 30)
182+
```
183+
184+
This produces a new column called ```index``` which is a list of ZOrder Curves of precision ```30``` that taken together cover the polygon.
185+
186+
# Creating Indexes while loading data
187+
188+
The Spatial Relations (GeoJSON, Shapefile, OSM-XML) all have the ability to automatically index the geometries while loading them.
189+
190+
To turn this feature on, pass in the parameter ```magellan.index = true``` and optionally a value for ```magellan.index.precision``` (default = 30) while loading the data as follows:
191+
192+
```scala
193+
spark.read.format("magellan")
194+
.option("magellan.index", "true")
195+
.option("magellan.index.precision", "25")
196+
.load(s"$path")
197+
```
198+
199+
This creates an additional column called ```index``` which holds the list of ZOrder Curves of the given precision that cover each geometry in the dataset.
200+
201+
# Spatial Joins
202+
203+
Magellan leverages Spark SQL and has support for joins by default. However, these joins are by default not aware that the columns are geometric so a join of the form
204+
205+
```scala
206+
points.join(polygons).where($"point" within $"polygon")
207+
```
208+
209+
will be treated as a Cartesian Join followed by a predicate.
210+
In some cases (especially when the polygon dataset is small (O(100-10000) polygons) this is fast enough.
211+
However, when the number of polygons is much larger than that, you will need spatial joins to allow you to scale this computation
212+
213+
To enable spatial joins in Magellan, add a spatial join rule to Spark by injecting the following code before the join:
214+
215+
```scala
216+
magellan.Utils.injectRules(spark)
217+
```
218+
219+
220+
Furthermore, during the join, you will need to provide Magellan a hint of the precision at which to create indices for the join
221+
222+
You can do this by annotating either of the dataframes involved in the join by providing a Spatial Join Hint as follows:
223+
224+
```scala
225+
var df = df.index(30) //after load or
226+
val df =spark.read.format(...).load(..).index(30) //during load
227+
```
228+
229+
Then a join of the form
230+
231+
```scala
232+
points.join(polygons).where($"point" within $"polygon") // or
233+
234+
points.join(polygons index 30).where($"point" within $"polygon")
235+
```
236+
237+
automatically uses indexes to speed up the join
238+
239+
240+
# Developer Channel
241+
242+
Please visit [Gitter](https://gitter.im/magellan-dev/Lobby?source=orgpage) to discuss Magellan, obtain help from developers or report issues.
161243
# Magellan Blog
162244

163245
For more details on Magellan and thoughts around Geospatial Analytics and the optimizations chosen for this project, please visit my [blog](https://magellan.ghost.io)

build.sbt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
name := "magellan"
22

3-
version := "1.0.5-SNAPSHOT"
3+
version := "1.0.5"
44

55
organization := "harsha2010"
66

0 commit comments

Comments
 (0)