Schneller Nutzen mit Neo4j: das Beispiel Panama Papers

31
Inves&ga&ng the #PanamaPapers Connec&ons with Neo4j PartnerDay Frankfurt Stefan Kolmar Director Field Engineering

Transcript of Schneller Nutzen mit Neo4j: das Beispiel Panama Papers

Inves&ga&ngthe#PanamaPapersConnec&onswithNeo4j

PartnerDayFrankfurtStefanKolmarDirectorFieldEngineering

SourceMaterial

takenfrom•  theICIJpresenta1on•  theRedditAMA•  onlinepublica1ons(SZ,Guardian,TNWet.al.)•  theICIJwebsite

•  hFps://panamapapers.icij.org/•  ThePowerPlayers•  KeyNumbers&Figures

+190 journalists in more than 65 countries

12 staff members (USA, Costa Rica, Venezuela, Germany, France, Spain) 50% of the team = Data & Research Unit

raw files

metadata author; sender...

database

search and discovery

raw text

3 million files x

10 seconds per file =

347 Days

Inves1gatorsusedNuix’sop1calcharacterrecogni1ontomakemillionsofscanneddocumentstext-searchable.TheyusedNuix’snameden1tyextrac1onandotheranaly1caltoolstoiden1fyandcross-referencethenamesofMossackFonsecaclientsthroughmillionsofdocuments.

Lucene syntax queries with proximity matching!

400 users

Unstructureddataextrac1on●  NuixprofessionalOCRservice●  ICIJExtract(opensource,Java:hFps://github.com/ICIJ/extract),leveragesApacheTika,TesseractOCRandJBIG2-ImageIO.

Structureddataextrac1on●  AbunchofPython

Database●  ApacheSolr(opensource,Java)●  Redis(opensource,C)● Neo4j(opensource,Java)

App●  Blacklight(opensource,Rails)●  Linkurious(closedsource,JS)

Stack

ContextisKing name:“John”last:„Miller“role:„Nego1ator“

name:"Maria"last:"Osara"name:“SomeMediaLtd”

value:“$70M”

PERSON

PERSON

PERSON

PERSON

name:”Jose"last:“Pereia“posi1on:“Governor“

name:“Alice”last:„Smith“role:„Advisor“

ContextisKing

SENT

SUPPORTS

CREATED

MENTIONS

name:“John”last:„Miller“role:„Nego1ator“

name:"Maria"last:"Osara"

since:Jan10,2011

name:“SomeMediaLtd”value:“$70M”

PERSON

PERSON

WRO

TE

PERSON

PERSON

name:”Jose"last:“Pereia“posi1on:“Governor“

name:“Alice”last:„Smith“role:„Advisor“

Theworldisagraph–everythingisconnected

•  people,places,events•  companies,markets•  countries,history,poli1cs•  sciences,art,teaching•  technology,networks,machines,applica1ons,users

•  sodware,code,dependencies,architecture,deployments

•  criminals,fraudstersandtheirbehavior

NODE

key:“value”proper1es

PropertyGraphModel

Nodes•  Theen11esinthegraph•  Canhavename-valueproper%es•  CanbelabeledRela&onships•  Relatenodesbytypeanddirec1on•  Canhavename-valueproper%es

RELATIONSHIPNODE NODE

key:“value”proper1es

key:“value”proper1es

key:“value”proper1es

WhiteboardtoGraph

Cypher:FindPaNerns

MATCH(:Person{name:"Dan"})-[:KNOWS]->(who:Person)RETURNwho

KNOWS

Dan ???

LABEL

NODE NODE

LABEL PROPERTY ALIAS ALIAS

hNp://neo4j.com/developer/cypher

GeWngDataintoNeo4j

Cypher-Based“LOADCSV”•  Transac1onal(ACID)writes•  Ini1alandincrementalloadsofupto10millionnodesandrela1onships

,,,

LOADCSVWITHHEADERSFROM"url"ASrowMERGE(:Person{name:row.name,age:toInt(row.age)});

GeWngDataintoNeo4j

LoadJSONwithCypher•  LoadJSONviaprocedure•  Deconstructthedocument•  Intoanon-duplicatedgraphmodel

{}{}{}

CALLapoc.load.json("url")yieldvalueasdocUNWINDdoc.itemsasitemMERGE(:Contract{title:item.title,amount:toFloat(item.amount)});

GeWngDataintoNeo4j

CSVBulkLoaderneo4j-import•  Forini1aldatabasepopula1on•  Forloadswith10B+records•  Upto1Mrecordspersecond

,,,,,,,,,

bin/neo4j-import–-intopeople.db--nodes:Personpeople.csv--nodes:Companycompanies.csv--relationship:STAKEHOLDERstakeholders.csv

TheStepsInvolvedintheDocumentAnalysis

1.   Acquiredocuments2.   Classifydocuments

•  Scan/OCR•  Extractdocumentmetadata

3.  Whiteboarddomainandques&ons,determine•  en&&esandtheirrela&onships•  poten1alen1tyandrela1onshipproper&es•  sourcesforthoseen11esandtheirproper1es

TheStepsInvolvedintheDocumentAnalysis

4.  Developanalyzers,rules,parsersandnameden1tyrecogni1on

5.  Parseandstoremetadata,documentanden1tyrela1onships

•  Parsebyauthor,nameden11es,dates,sourcesandclassifica1ons

6.  Inferen1tyrela1onships

7.  Computesimilari1es,transi1vecoverandtriangles

8.  Analyzedatausinggraphqueriesandvisualiza1ons

WeneedaDataModel

MetaDataEn&&es•  Document,Email,Contract,DB-Record

•  Meta:Author,Date,Source,Keywords

•  Conversa1on:Sender,Receiver,Topic

•  MoneyFlows

ActualEn&&es•  Person•  Representa1ve(Officer)•  Address•  Client•  Company•  Account

Eitherbasedonourusecases&ques1onsOntheen11espresentinourmeta-dataanddata.

TheICIJDataModel

TheICIJDataModel

•  Simplis1cDatamodelwith4En11esand5Rela1onships•  Weonlyknowthepublishedmodel•  Missing

•  Documents,Metadata•  FamilyRela1onships•  Connec1onstoPublicRecordDatabases

•  ContainsDuplicates•  Rela1onshipinforma1onstoredonen11es•  Couldusericherlabeling

ExampleDataset-Azerbaijan’sPresidentIlhamAliyev

•  wasalreadypreviouslyinves1gated•  wholefamilyinvolved•  differentshellcompanies&involvements

hFp://neo4j.com/graphgist/ec65c2fa-9d83-4894-bc1e-98c475c7b57a

BasedOn:hFp://neo4j.com/blog/analyzing-panama-papers-neo4j/

VisualGraphSearch

ForNon-Developers