Wikidata - emse.frzimmermann/WI_2014_Site/Programme/wikida… · Problem 2: Language Diversity...

Post on 18-Oct-2020

2 views 0 download

Transcript of Wikidata - emse.frzimmermann/WI_2014_Site/Programme/wikida… · Problem 2: Language Diversity...

Technische Universität DresdenFakultät Informatik

Wikidata

Markus KrötzschTU Dresden

August 2014

2

“ Imagine a world in which every single person

is given free access to the sum

of all human knowledge.That’s our mission.”

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickof

21M+ articles 1.5B+ edits 280+ languages

Markus Krötzsch: Wikidata Toolkit Kickoff

about 500 Million views per day

~480M unique visitors per month

Markus Krötzsch: Wikidata Toolkit Kickof

Markus Krötzsch: Wikidata Toolkit Kickof

Problem 1: Content Quality

Problem 1: Content Quality High cost of maintenance

– Fighting spam and vandalism– Updating old content– Fixing errors

Problem 1: Content Quality High cost of maintenance

– Fighting spam and vandalism– Updating old content– Fixing errors

“But we have an army of contributors with the Wisdom of the Crowds!“

Number of Articles (English)

Source: wikistatistics.net

Number of Active Users (English)

Number of Edits (English)

The Crowds are not Enough

Amount of content grows Maintenance effort grows→

Number of contributors stablizes

Problem 2: Language Diversity

Problem 2: Language Diversity

Language diversity– 285 languages– English, German, French, Dutch: 1 Mio+– 40 languages: 100,000+– 112 languages: 10.000+

Quality problem– Even basic facts do not agree across languages

Coverage problem

Markus Krötzsch: Wikidata Toolkit Kickof

Mastertextformat bearbeiten Zweite Ebene Dritte Ebene

Vierte Ebene Fünfte Ebene

English

Markus Krötzsch: Wikidata Toolkit Kickof

Mastertextformat bearbeiten Zweite Ebene Dritte Ebene

Vierte Ebene Fünfte Ebene

French

Markus Krötzsch: Wikidata Toolkit Kickof

Mastertextformat bearbeiten Zweite Ebene Dritte Ebene

Vierte Ebene Fünfte Ebene

Catalan

Markus Krötzsch: Wikidata Toolkit Kickof

Mastertextformat bearbeiten Zweite Ebene Dritte Ebene

Vierte Ebene Fünfte Ebene

Italian

Markus Krötzsch: Wikidata Toolkit Kickof

Mastertextformat bearbeiten Zweite Ebene Dritte Ebene

Vierte Ebene Fünfte Ebene

Greek

Markus Krötzsch: Wikidata Toolkit Kickof

Mastertextformat bearbeiten Zweite Ebene Dritte Ebene

Vierte Ebene Fünfte Ebene

Russian

Markus Krötzsch: Wikidata Toolkit Kickof

Mastertextformat bearbeiten Zweite Ebene Dritte Ebene

Vierte Ebene Fünfte Ebene

Chinese

Markus Krötzsch: Wikidata Toolkit Kickof

Mastertextformat bearbeiten Zweite Ebene Dritte Ebene

Vierte Ebene Fünfte Ebene

English

Problem 3: Information Access

Problem 3: Information Access

Wikipedia has articles about…… all cities… their populations… their mayors

“So can I ask for a list of the world’s ten largest cities with a female mayor?“

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Wikipedia’s answer: Lists

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickoff

Markus Krötzsch: Wikidata Toolkit Kickof

63

“ Imagine a world in which every single person

is given free access to the sum

of all human knowledge.That’s our mission.”

Wikidata Provide a database of the world’s knowledge that anyone

can edit Collect references and quotes for millions of data items Engage a sustainable community that collects data from

everywhere in a machine-readable way Increase the quality and lower the maintenance cost of

Wikipedia and related projects Deliver software and community best practices enabling

others to engage in projects of data collection and provisioning

Project Funding 1.5 Mio EUR 4 donors Wikimedia Foundation

Wikidata

Official “Wikipedia Database” For all 285 language editions

Very recent: Live since November 2012 Enabled on all Wikipedia editions since March 2013 Ongoing development led by Wikimedia Germany

The Content of Wikidata

Size as of 4th August 2014

Items: 15,792,256

Properties: 1,176 Statements: 43,189,145

… with references: 23,242,779

Labels: 52,811,608 Aliases: 8,765,542 Descriptions: 37,636,220

Site links: 39,356,543

Growth (up to Feb 2014)

Activity(Feb 2014)

54k contributors – 5k contributors with 5+ edits in Jun 2014 Over 150M edits so far – up to 500k per day

Wikidata, DBpedia, RDF,and all that

Wikidata and DBpedia: A Superficial Comparison

Wikidata

Data related to Wikipedia Online since late 2012* Manual editing One multilingual dataset Based on statements About 1k properties Wikipedia integration Unique community

*) influenced by Semantic MediaWiki (started 2005)

DBpedia

Data related to Wikipedia Started in 2006 Automated extraction One dataset per language Based on triples (RDF) >10k properties Stand-alone dataset Unique community

Exporting Wikidata to RDF

Define URIs for items: http://www.wikidata.org/entity/<id>

Map MediaWiki languages to BCP 47 languages

Select suitable vocabulary for reuse: rdfs:label, schema.org description, skos:altLabel prov:wasDerivedFrom

Exporting Wikidata Statements to RDF

Simpler Export of Statements

What makes RDF export complex: Qualifiers References Complex values

Idea: export only statements that have no qualifiers drop references simplify value encoding

Classification

Properties subclass of (P279) and instance of (P31) P31 is the most used property on Wikidata

Often (but not always) used without qualifiers

Interesting class hierarchy: Entities used as classes: 41,868 Subclass of: 40,192 (without qualifiers) Instance of: 6,169,821(without qualifiers)

Available RDF Exports

RDF/OWL file exports at:http://tools.wmflabs.org/wikidata-exports/rdf/

Results for April 20, 2014:

Usage & Applications

Application Areas

Labels and descriptions

Identifiers

Data access

Advanced analytics

Third-party applications

Third-party applications

Third-party applications

Getting the Data

See www.wikidata.org/wiki/Wikidata:Data_access

Direct access per item (Web API, RDF/JSON/...)

Database dumps (full dumps + daily changes)

Full dumps in more convenient formats planned

Conclusions

Wikidata is developing rapidly Data size Vocabulary size Technical features and community processes

A platform for data integration Including links to many other databases

Data access is easy, both legally and technically Further improvements planned for exports