Effiziente Verarbeitung von grossen Datenmengen
-
Upload
florian-stegmaier -
Category
Technology
-
view
226 -
download
2
description
Transcript of Effiziente Verarbeitung von grossen Datenmengen
Introduction Approaches Conclusion
Effziente Verarbeitung von grossen DatenmengenTeil II
Tristan Schneider
January 9, 2014
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Inhalt
IntroductionSocial GraphProblems and Motivation
ApproachesTAOHortonPregelTrinityUnicorn
ConclusionComparisonFuture Work
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Social Graph
I Consists of Nodes and Edges
I Describes Entities and their Relation
I Used by Facebook, Google, Amazon etc
I About 100+ million nodes and 10+ billion edges
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Problems and Motivation
I amount of data exceeds capability of a single machine
I necessary to distribute data and computation
I data access managed by framework
I different requirements (latency, throughput)
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
TAO
I developed by Facebook
I read optimized
I fixed set of queries
Strength low latency access
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
TAO: Data Model
I data identified by 64-bit integer
I Objects (id) → (otype, (key → value)*)
I Associations (id1, atype, id2) → (time, (key → value)*)
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
TAO: API
I fixed set of queries
I assoc add, assoc delete, assoc change type
I assoc get, assoc count, assoc range, assoc time range
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
TAO: Architecture
I data divided into shard (via hashing)
I each server handles one or more shard
I objects and their associations are in the same shard
I an object never changes the shard
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
TAO: Architecture
I servers divided in leaders and followers
I clients always communicate with followers
I cache misses and writes redirected to leader
I slave servers support master servers if necessary
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
TAO: Architecture: Scheme
http://www.zdnet.com/facebook-explains-how-tao-serves-social-workloads-data-requests-7000017282/
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
TAO: Fault Tolerance and Performance
I efficiency and availability > consistency
I global mark for down server
I followers are interchangeable
I slave databases promoted to master, if master crashes
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
TAO: Fault Tolerance and Performance
Figure: Write Access Latencies
https://www.facebook.com/download/273893712748848/atc13-bronson.pdfEffziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Horton
I query language execution engine
I written in C#
Strength interactive queries with low latency
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Horton: Data Model
I similar to TAO
I divided in partitions
I additional data can be attached (e.g. key-value-pairs)
I directed edges stored at source and target
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Horton: API
I horton query language
I initiated via client (library)
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Horton: Architecture
Graph Client Library translates query to regular expression
Graph Coordinator translates regular expression to finite statemachine and finds most effective execution plan
Graph Partitions executes the finite state machine and traversesthe graph
Graph Manager provides an interface to administrate the graph
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Pregel
I C++ based
I computation consists of parallel iteration
I communication using messaging
Strength high throughput (for analysis)
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Pregel: Data Model
I graph divided in partitions
I partition assignment based on node id (hash(id) mod n)
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Pregel: API
I implementation of a Vertex class (task)
I define methods like Compute(...), SendMessageTo(...)
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Pregel: Architecture
I runs on a cluster management system
I uses distributed file system (eg. Bigtable)
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Pregel: Basic Work Flow
1. copy task to worker machines, one is promoted to master
2. master assigns one or more partitions to each worker
3. master invokes supersteps
4. save graph after computation
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Pregel: Fault Tolerance and Performance
I workers save their progress at checkpoint supersteps
I worker failure detected using ping
I reassign partitions failed servers to available workers
I reload state of the most recent available checkpoint superstep
I process termination if master failed
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Pregel: Fault Tolerance and Performance
Figure: varying number of worker on 1 billion vertex binary tree
http://kowshik.github.io/JPregel/pregel paper.pdf
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Trinity
I developed by Microsoft
I flexible in data and computation
I supports online query processing and offline computation
I on top well-connected cluster (memory cloud)
I based on TFS (similar to HDFS)
Strength low latency and high throughput (not at the sametime)
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Trinity: Data Model
I key-value-store
I one table for nodes
I one table for each type of relation
I relations represented by id-pairs in the specific table
I customisation possible with Trinity Structure Language (TSL)
I data backed up in persistent file system
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Trinity: API
I Trinity Desktop Environment (TDE)
I supports query requests (similar to Horton/SQL)
I supports offline computation (similar to pregel)
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Trinity: Architecture
Slaves Stores a part of the data, processes tasks andmessages.
Proxies Optional middle tier between slaves and clients.Handles messages, does not store data.
Clients Responsible for user interaction with the cluster.
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Trinity: Architecture
Figure: Trinity Cluster Structure
https://research.microsoft.com/pubs/161291/trinity.pdf
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Trinity: Fault Tolerance and Performance
I no ACID support, but atomicity of operations
I dead machines are replaced by alive ones, reload memory fromTFS
I requesting machine will wait till the dead machine is replaced
I recovering the state of the most recent checkpoint superstep(similar to pregel)
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Trinity: Fault Tolerance and Performance
Figure: Response time of subgraph match queries
https://research.microsoft.com/pubs/161291/trinity.pdfEffziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Unicorn
I in-memory social graph-aware indexing system
I search offering backend of Facebook
I based on Hadoop
Strength TypeaheadGood performance on complex queries.
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Unicorn: Data Model
I sharded data (similar to Facebooks TAO)
I indices built and converted using custom Hadoop pipeline
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Unicorn: API
I Queries in Unicorn Query Language
I e.g. (term likers:104076956295773))≈ 6M Likers of ”Computer Science”
apply allows to query a (truncated) set of id and then usethose to construct a new query
extract attaches matches as metadata within the forwardindex of the query set
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Unicorn: Architeture
top-aggregator dispatches the query to one rack-aggregator ofeach rack, combines and returns result
rack-aggregator forwards the query to all index servers of its rack(high bandwidth), combines results
index server about 40-80 machines per rack, stores adjacencylists, performs operations
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Unicorn: Fault Tolerance and Performance
I sharding and replication
I automatically replacing machines
I serving incomplete results is strongly preferable to servingempty results
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Unicorn: Fault Tolerance and PerformanceI (apply friend: likers:104076956295773) ≈ Friends of Likers of
”Computer Science”
https://www.facebook.com/download/138915572976390/UnicornVLDB-final.pdfEffziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Comparison
Framework Query Language low latency high throughputTAO no yes noHorton yes yes noPregel no no yesTrinity yes yes yesUnicorn yes yes no
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Future Work
I query language vs fixed set queries
I all-in-one framework difficult (Trinity as best attempt)
Effziente Verarbeitung von grossen Datenmengen Teil II
Introduction Approaches Conclusion
Thank you for your attention.
I Questions?I Sources
1. https://research.microsoft.com/pubs/161291/trinity.pdf2. http://research.microsoft.com/pubs/162643/icde12 demo 679.pdf3. http://kowshik.github.io/JPregel/pregel paper.pdf4. https://www.facebook.com/download/273893712748848/atc13-bronson.pdf5. https://www.facebook.com/download/138915572976390/UnicornVLDB-final.pdf
Effziente Verarbeitung von grossen Datenmengen Teil II