Schneller Nutzen mit Neo4j: das Beispiel Panama Papers
-
Upload
neo4j-the-fastest-and-most-scalable-native-graph-database -
Category
Technology
-
view
110 -
download
0
Transcript of Schneller Nutzen mit Neo4j: das Beispiel Panama Papers
Inves&ga&ngthe#PanamaPapersConnec&onswithNeo4j
PartnerDayFrankfurtStefanKolmarDirectorFieldEngineering
SourceMaterial
takenfrom• theICIJpresenta1on• theRedditAMA• onlinepublica1ons(SZ,Guardian,TNWet.al.)• theICIJwebsite
• hFps://panamapapers.icij.org/• ThePowerPlayers• KeyNumbers&Figures
+190 journalists in more than 65 countries
12 staff members (USA, Costa Rica, Venezuela, Germany, France, Spain) 50% of the team = Data & Research Unit
3 million files x
10 seconds per file =
347 Days
Inves1gatorsusedNuix’sop1calcharacterrecogni1ontomakemillionsofscanneddocumentstext-searchable.TheyusedNuix’snameden1tyextrac1onandotheranaly1caltoolstoiden1fyandcross-referencethenamesofMossackFonsecaclientsthroughmillionsofdocuments.
Unstructureddataextrac1on● NuixprofessionalOCRservice● ICIJExtract(opensource,Java:hFps://github.com/ICIJ/extract),leveragesApacheTika,TesseractOCRandJBIG2-ImageIO.
Structureddataextrac1on● AbunchofPython
Database● ApacheSolr(opensource,Java)● Redis(opensource,C)● Neo4j(opensource,Java)
App● Blacklight(opensource,Rails)● Linkurious(closedsource,JS)
Stack
ContextisKing name:“John”last:„Miller“role:„Nego1ator“
name:"Maria"last:"Osara"name:“SomeMediaLtd”
value:“$70M”
PERSON
PERSON
PERSON
PERSON
name:”Jose"last:“Pereia“posi1on:“Governor“
name:“Alice”last:„Smith“role:„Advisor“
ContextisKing
SENT
SUPPORTS
CREATED
MENTIONS
name:“John”last:„Miller“role:„Nego1ator“
name:"Maria"last:"Osara"
since:Jan10,2011
name:“SomeMediaLtd”value:“$70M”
PERSON
PERSON
WRO
TE
PERSON
PERSON
name:”Jose"last:“Pereia“posi1on:“Governor“
name:“Alice”last:„Smith“role:„Advisor“
Theworldisagraph–everythingisconnected
• people,places,events• companies,markets• countries,history,poli1cs• sciences,art,teaching• technology,networks,machines,applica1ons,users
• sodware,code,dependencies,architecture,deployments
• criminals,fraudstersandtheirbehavior
NODE
key:“value”proper1es
PropertyGraphModel
Nodes• Theen11esinthegraph• Canhavename-valueproper%es• CanbelabeledRela&onships• Relatenodesbytypeanddirec1on• Canhavename-valueproper%es
RELATIONSHIPNODE NODE
key:“value”proper1es
key:“value”proper1es
key:“value”proper1es
Cypher:FindPaNerns
MATCH(:Person{name:"Dan"})-[:KNOWS]->(who:Person)RETURNwho
KNOWS
Dan ???
LABEL
NODE NODE
LABEL PROPERTY ALIAS ALIAS
hNp://neo4j.com/developer/cypher
GeWngDataintoNeo4j
Cypher-Based“LOADCSV”• Transac1onal(ACID)writes• Ini1alandincrementalloadsofupto10millionnodesandrela1onships
,,,
LOADCSVWITHHEADERSFROM"url"ASrowMERGE(:Person{name:row.name,age:toInt(row.age)});
GeWngDataintoNeo4j
LoadJSONwithCypher• LoadJSONviaprocedure• Deconstructthedocument• Intoanon-duplicatedgraphmodel
{}{}{}
CALLapoc.load.json("url")yieldvalueasdocUNWINDdoc.itemsasitemMERGE(:Contract{title:item.title,amount:toFloat(item.amount)});
GeWngDataintoNeo4j
CSVBulkLoaderneo4j-import• Forini1aldatabasepopula1on• Forloadswith10B+records• Upto1Mrecordspersecond
,,,,,,,,,
bin/neo4j-import–-intopeople.db--nodes:Personpeople.csv--nodes:Companycompanies.csv--relationship:STAKEHOLDERstakeholders.csv
TheStepsInvolvedintheDocumentAnalysis
1. Acquiredocuments2. Classifydocuments
• Scan/OCR• Extractdocumentmetadata
3. Whiteboarddomainandques&ons,determine• en&&esandtheirrela&onships• poten1alen1tyandrela1onshipproper&es• sourcesforthoseen11esandtheirproper1es
TheStepsInvolvedintheDocumentAnalysis
4. Developanalyzers,rules,parsersandnameden1tyrecogni1on
5. Parseandstoremetadata,documentanden1tyrela1onships
• Parsebyauthor,nameden11es,dates,sourcesandclassifica1ons
6. Inferen1tyrela1onships
7. Computesimilari1es,transi1vecoverandtriangles
8. Analyzedatausinggraphqueriesandvisualiza1ons
WeneedaDataModel
MetaDataEn&&es• Document,Email,Contract,DB-Record
• Meta:Author,Date,Source,Keywords
• Conversa1on:Sender,Receiver,Topic
• MoneyFlows
ActualEn&&es• Person• Representa1ve(Officer)• Address• Client• Company• Account
Eitherbasedonourusecases&ques1onsOntheen11espresentinourmeta-dataanddata.
TheICIJDataModel
• Simplis1cDatamodelwith4En11esand5Rela1onships• Weonlyknowthepublishedmodel• Missing
• Documents,Metadata• FamilyRela1onships• Connec1onstoPublicRecordDatabases
• ContainsDuplicates• Rela1onshipinforma1onstoredonen11es• Couldusericherlabeling
ExampleDataset-Azerbaijan’sPresidentIlhamAliyev
• wasalreadypreviouslyinves1gated• wholefamilyinvolved• differentshellcompanies&involvements
hFp://neo4j.com/graphgist/ec65c2fa-9d83-4894-bc1e-98c475c7b57a