Storage

Storage

Crawled webpage schema

β”Œ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
                                               Storage                                               
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                                β”‚
   β”‚Webpage crawl history          β”‚                                                                 
β”‚  β”‚                               β”‚                                                                β”‚
   β”‚Url: string                    β”‚                                                                 
β”‚  β”‚Domain: string (sharding key)  β”‚                                                                β”‚
   β”‚Expected frequency: date       β”‚                                                                 
β”‚  β”‚Last crawl timestamp: date     β”‚                                                                β”‚
   β”‚Content signature: string      β”‚                                                                 
β”‚  β”‚(calculate similarity)         β”‚                                                                β”‚
   β”‚                               β”‚                                                                 
β”‚  β”‚                               β”‚                                                                β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                                 
β”‚                                                                                                   β”‚
 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─

DB selection

  • Wide-column preferred because snapshot of the same page could be stored - support 3-dimensional query

(row, column family, timestamp)

Last updated