Storage
Storage
Crawled webpage schema
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
Storage
β βββββββββββββββββββββββββββββββββ β
βWebpage crawl history β
β β β β
βUrl: string β
β βDomain: string (sharding key) β β
βExpected frequency: date β
β βLast crawl timestamp: date β β
βContent signature: string β
β β(calculate similarity) β β
β β
β β β β
βββββββββββββββββββββββββββββββββ
β β
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
DB selection
Wide-column preferred because snapshot of the same page could be stored - support 3-dimensional query
(row, column family, timestamp)
Last updated