Effiziente Verarbeitung von grossen Datenmengen

39
Introduction Approaches Conclusion Effziente Verarbeitung von grossen Datenmengen Teil II Tristan Schneider January 9, 2014 Effziente Verarbeitung von grossen Datenmengen Teil II

description

Ein Vortrag von Tristan Schneider aus dem Hauptseminar "Personalisierung mit großen Daten".

Transcript of Effiziente Verarbeitung von grossen Datenmengen

Page 1: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Effziente Verarbeitung von grossen DatenmengenTeil II

Tristan Schneider

January 9, 2014

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 2: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Inhalt

IntroductionSocial GraphProblems and Motivation

ApproachesTAOHortonPregelTrinityUnicorn

ConclusionComparisonFuture Work

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 3: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Social Graph

I Consists of Nodes and Edges

I Describes Entities and their Relation

I Used by Facebook, Google, Amazon etc

I About 100+ million nodes and 10+ billion edges

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 4: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Problems and Motivation

I amount of data exceeds capability of a single machine

I necessary to distribute data and computation

I data access managed by framework

I different requirements (latency, throughput)

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 5: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

TAO

I developed by Facebook

I read optimized

I fixed set of queries

Strength low latency access

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 6: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

TAO: Data Model

I data identified by 64-bit integer

I Objects (id) → (otype, (key → value)*)

I Associations (id1, atype, id2) → (time, (key → value)*)

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 7: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

TAO: API

I fixed set of queries

I assoc add, assoc delete, assoc change type

I assoc get, assoc count, assoc range, assoc time range

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 8: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

TAO: Architecture

I data divided into shard (via hashing)

I each server handles one or more shard

I objects and their associations are in the same shard

I an object never changes the shard

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 9: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

TAO: Architecture

I servers divided in leaders and followers

I clients always communicate with followers

I cache misses and writes redirected to leader

I slave servers support master servers if necessary

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 10: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

TAO: Architecture: Scheme

http://www.zdnet.com/facebook-explains-how-tao-serves-social-workloads-data-requests-7000017282/

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 11: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

TAO: Fault Tolerance and Performance

I efficiency and availability > consistency

I global mark for down server

I followers are interchangeable

I slave databases promoted to master, if master crashes

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 12: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

TAO: Fault Tolerance and Performance

Figure: Write Access Latencies

https://www.facebook.com/download/273893712748848/atc13-bronson.pdfEffziente Verarbeitung von grossen Datenmengen Teil II

Page 13: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Horton

I query language execution engine

I written in C#

Strength interactive queries with low latency

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 14: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Horton: Data Model

I similar to TAO

I divided in partitions

I additional data can be attached (e.g. key-value-pairs)

I directed edges stored at source and target

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 15: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Horton: API

I horton query language

I initiated via client (library)

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 16: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Horton: Architecture

Graph Client Library translates query to regular expression

Graph Coordinator translates regular expression to finite statemachine and finds most effective execution plan

Graph Partitions executes the finite state machine and traversesthe graph

Graph Manager provides an interface to administrate the graph

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 17: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Pregel

I C++ based

I computation consists of parallel iteration

I communication using messaging

Strength high throughput (for analysis)

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 18: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Pregel: Data Model

I graph divided in partitions

I partition assignment based on node id (hash(id) mod n)

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 19: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Pregel: API

I implementation of a Vertex class (task)

I define methods like Compute(...), SendMessageTo(...)

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 20: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Pregel: Architecture

I runs on a cluster management system

I uses distributed file system (eg. Bigtable)

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 21: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Pregel: Basic Work Flow

1. copy task to worker machines, one is promoted to master

2. master assigns one or more partitions to each worker

3. master invokes supersteps

4. save graph after computation

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 22: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Pregel: Fault Tolerance and Performance

I workers save their progress at checkpoint supersteps

I worker failure detected using ping

I reassign partitions failed servers to available workers

I reload state of the most recent available checkpoint superstep

I process termination if master failed

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 23: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Pregel: Fault Tolerance and Performance

Figure: varying number of worker on 1 billion vertex binary tree

http://kowshik.github.io/JPregel/pregel paper.pdf

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 24: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Trinity

I developed by Microsoft

I flexible in data and computation

I supports online query processing and offline computation

I on top well-connected cluster (memory cloud)

I based on TFS (similar to HDFS)

Strength low latency and high throughput (not at the sametime)

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 25: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Trinity: Data Model

I key-value-store

I one table for nodes

I one table for each type of relation

I relations represented by id-pairs in the specific table

I customisation possible with Trinity Structure Language (TSL)

I data backed up in persistent file system

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 26: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Trinity: API

I Trinity Desktop Environment (TDE)

I supports query requests (similar to Horton/SQL)

I supports offline computation (similar to pregel)

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 27: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Trinity: Architecture

Slaves Stores a part of the data, processes tasks andmessages.

Proxies Optional middle tier between slaves and clients.Handles messages, does not store data.

Clients Responsible for user interaction with the cluster.

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 28: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Trinity: Architecture

Figure: Trinity Cluster Structure

https://research.microsoft.com/pubs/161291/trinity.pdf

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 29: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Trinity: Fault Tolerance and Performance

I no ACID support, but atomicity of operations

I dead machines are replaced by alive ones, reload memory fromTFS

I requesting machine will wait till the dead machine is replaced

I recovering the state of the most recent checkpoint superstep(similar to pregel)

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 30: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Trinity: Fault Tolerance and Performance

Figure: Response time of subgraph match queries

https://research.microsoft.com/pubs/161291/trinity.pdfEffziente Verarbeitung von grossen Datenmengen Teil II

Page 31: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Unicorn

I in-memory social graph-aware indexing system

I search offering backend of Facebook

I based on Hadoop

Strength TypeaheadGood performance on complex queries.

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 32: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Unicorn: Data Model

I sharded data (similar to Facebooks TAO)

I indices built and converted using custom Hadoop pipeline

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 33: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Unicorn: API

I Queries in Unicorn Query Language

I e.g. (term likers:104076956295773))≈ 6M Likers of ”Computer Science”

apply allows to query a (truncated) set of id and then usethose to construct a new query

extract attaches matches as metadata within the forwardindex of the query set

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 34: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Unicorn: Architeture

top-aggregator dispatches the query to one rack-aggregator ofeach rack, combines and returns result

rack-aggregator forwards the query to all index servers of its rack(high bandwidth), combines results

index server about 40-80 machines per rack, stores adjacencylists, performs operations

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 35: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Unicorn: Fault Tolerance and Performance

I sharding and replication

I automatically replacing machines

I serving incomplete results is strongly preferable to servingempty results

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 36: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Unicorn: Fault Tolerance and PerformanceI (apply friend: likers:104076956295773) ≈ Friends of Likers of

”Computer Science”

https://www.facebook.com/download/138915572976390/UnicornVLDB-final.pdfEffziente Verarbeitung von grossen Datenmengen Teil II

Page 37: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Comparison

Framework Query Language low latency high throughputTAO no yes noHorton yes yes noPregel no no yesTrinity yes yes yesUnicorn yes yes no

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 38: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Future Work

I query language vs fixed set queries

I all-in-one framework difficult (Trinity as best attempt)

Effziente Verarbeitung von grossen Datenmengen Teil II

Page 39: Effiziente Verarbeitung von grossen Datenmengen

Introduction Approaches Conclusion

Thank you for your attention.

I Questions?I Sources

1. https://research.microsoft.com/pubs/161291/trinity.pdf2. http://research.microsoft.com/pubs/162643/icde12 demo 679.pdf3. http://kowshik.github.io/JPregel/pregel paper.pdf4. https://www.facebook.com/download/273893712748848/atc13-bronson.pdf5. https://www.facebook.com/download/138915572976390/UnicornVLDB-final.pdf

Effziente Verarbeitung von grossen Datenmengen Teil II