Select*from Places where Latitude between X-D and X+D and Longitude between Y-D and Y+D -- Between queries will create large amounts of intermediate data and create high costs on servers.-- GridID will greatly reduce the amount of intermediate data size.select*from Places where latitude between X-D and X+D and longitude between Y-D and Y+D and gridID in (gridIDx0,gridIDx1,gridIDx2,gridIDx3,gridIDx4,gridIDx5,gridIDx6,gridIDx7,gridIDx8);
Pros & Cons
Pros: Simple and straightforward
Cons: Some grid will be much denser than others. How to choose the optimal grid size.
Size estimation
# Size of world earth 200 M square mile
# Size of grid = 10 square mile
# Number of grids:
200 M / 10 = 20M Grids
# Each grid has maximum 500 places
# Each location has 24 bytes (string as location Id + double as longtitude + double as latitude).
# In total it will take
20 M * 500 * 24 = 24 * 10^10 = 240GB
SQL spatial datatypes
Use decimal to represent latitude/longtitude to avoid excessive precision (0.0001° is <11 m, 1′′ is <31 m)
To 6 decimal places should get you to around ~10cm of accuracy on a coordinate.
--Add locationInsert intoLocation (locationId, latitude, longtitude) values ("id1", 48.88, 2.31)--Search nearby k locations within r radiusSelect locationId fromLocationwhere48.88- radius < latitude <48.88+ radis and2.31- radius < longtitude <2.31+ radius
Implementation in SQL
Storage option 1: Store data as spatial data types
--select top distance resultsSELECT locationId,locationName,st_distance_sphere(ST_GeomFromText('POINT(39.994671 116.330788)',4326),address) AS distance-> FROM location;--find all locations within a polySET @poly ='Polygon(( '40.016712116.319618', '40.016712116.412773', '39.907024116.412773', '39.907024116.319618', '40.016712116.319618'))';SELECT locationId,locationName FROM LocationWHERE MBRContains(ST_GeomFromText(@poly,4326),address);
Storage option 2: PostgreSQL spatial query - KNN
PostgreSQL supports KNN search on top using distance operator <->
select locationId,locationName, addressfromLocationorder by pos <->point(51.516,-0.12)limit3;/* locationId | locationName | address ------------+------------------------+------------------------- 21593238 | All Bar One | (51.5163499,-0.1192746) 26848690 | The Shakespeare's Head | (51.5167871,-0.1194731) 371049718 | The Newton Arms | (51.5163032,-0.1209811)(3 rows)# evaluated on 30k rows in totalTime: 18.679 ms*/
The above query takes about 20 minutes, using KNN specific index (called GiST / SP-GiST) to speed up
createindexonLocationusing gist(address);select locationId,locationName, addressfromLocationorder byaddress<->point(51.516,-0.12) limit3;/* locationId | locationName | address ------------+------------------------+------------------------- 21593238 | All Bar One | (51.5163499,-0.1192746) 26848690 | The Shakespeare's Head | (51.5167871,-0.1194731) 371049718 | The Newton Arms | (51.5163032,-0.1209811)(3 rows)# evaluated on 30k rows in totalTime: 0.849 ms*/
definsertInTree(root,data):"""Pseudo code for inserting a point in a Quadtree"""ifnot rootcreateLeafAndInsertNode(root, data)elif root.isLeaf()and root.size()< BUCKET_SIZE: root.addNode(data)elif root.isLeaf():# Leaf node must be full root.decomposeLeafNodeAndInsert(data)# Find the appropriate sub-tree to insert nodeelif root.northwest.isValidParent(data)insertInTree(root.northwest, data)elif root.southwest.isValidParent(data)insertInTree(root.southwest, data)elif root.southeast.isValidParent(data)insertInTree(root.southeast, data)elseinsertInTree(root.northeast, data)defgetPointsInRange(root,range): points = []# If there is no intersection with the area, returnifnot root.intersect(range):return points# Return all data points on a leaf nodeif root.isLeaf(): points.append(root.getNodes())return points# Recursively append the points from the 4 quadrants points.append(getPointsInRange(root.northwest, range)) points.append(getPointsInRange(root.northeast, range)) points.append(getPointsInRange(root.southeast, range)) points.append(getPointsInRange(root.southwest, range))return points
Pros & Cons
Pros:
It suit well for unevenly distributed users.
Cons:
Sometimes the tree could be too high and result in low performance.
Size estimation
# Total number of locations is 500M.
# Each grid holds most 500 places.
* Then there are in total 1M leaf nodes.
# There are roughly 0.5M internal nodes. A squad tree will have roughly 1/2 internal nodes
* Leaf nodes space usage = 1M _ 24 _ 500 = 12000M = 1.2 GB
* Internal nodes space usage 32 bytes \* 0.5M = 16 MB
For latitude range [-90, 90] and longtitude range [-180, 180], it will be divided into section: below the average value and bigger than average value.
For example, 42.60411 will be encoded as "101111001001" and -5.59041 will be encoded as "011111000000"
After getting two binary values, they are concatenated together as a 25 bit value "01101 11111 11000 00100 00010". This 32 bit value will be encoded as "ezs42". Typically the precision of a 25 bit value is 4.9KM.
And this 5 bit value will be divided into 5 sections.
Each section corresponds to 0-31 in decimal value.
Redis impl
Redis uses a 52 bit Geohash encode, and it has a precision value of 0.6m.
Redis will SkipList to store geoHashes for faster neighbor search.