Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Post on 21-Mar-2017

133 views 2 download

Transcript of Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Multilinguality of Metadata Measuring the Multilingual Degree of Europeana‘s Metadata

Juliane Stiller1, Péter Király21 Berlin School of Library and Information Science, Humboldt-Universität zu Berlin

2 Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen

ISI 2017, March 14, 2017

1

Languages by eltpics

Agenda

1. Multilinguality in Europeana2. Multilingual Score for Metadata3. Implementation4. Discussion & Future Work

2

Plattform for Cultural Heritage Material

www.europeana.eu

3

○ Books, newspapers, letters, paintings, photographs, radio shows, films, etc.

○ Text, images, video, audio, sounds, 3D○ Over 54 million objects○ > 50 languages

Europeana - Facts

http://statistics.europeana.eu/europeana 4

Thumbnail

Metadata

Link to Provider

Metadata Multilinguality

6+ 40 other languages....

The Multilingual Problem

7

○ Mona Lisa 456 results○ La Gioconda 365 results ○ La Joconde 71 results

http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html

Metadata Enrichment

8

Quantify the Multilinguality of Data to

○Take measures to improve multilinguality in data

○Establish a sense of the multilingual reach of Europeana

○Distribution of languages

○Devise strategies for underrepresented languages

Multilingual Score for Metadata

10

Multilingual saturation of metadata

11

Text w/o language annotation (dc.subject: Germany)

Text w language annotation (dc.subject: Germany@en)

Text w several language annotations (dc.subject: Germany@en, Deutschland@de)

Link to (multilingual) vocabulary (http://www.geonames.org /2921044/ federal-republic-of-germany)

CalculationMissing fieldText string without language tag (language not known)

Text string with 2-3 different language tags

Text string with 4-9 different language tagsText string with more than 10 different language tagsLink to (multilingual) vocabulary

Text string with language tag (language known)

NA

0

1

2

2.3

2.6

3

Example score

13

Text w/o language annotation (dc.subject: Germany):

Text w language annotation (dc.subject: Germany@en)

Text w several language annotations (dc.subject: Germany@en, Deutschland@de)

Link to (multilingual) vocabulary (http://www.geonames.org /2921044/ federal-republic-of-germany)

0

1

2

3

Aggregation of property dc:subject

The Wittgenstein Archives at the University of Bergen: high saturation

National Library Portugal: low saturation

14http://144.76.218.178/europeana-qa/saturation.php?collectionId=all&field=proxy_dc_subject&type=average

Good examples"Die Mauer muß weg!"@de"Die Mauer muß weg! (The Wall must go!)"@en

15

"Kommentiertes Fotorama mit Bildern von 1989-1990 in Berlin"@de"Annotated images from 1989-1990 in Berlin"@en

dc:d

escr

ipti

ondc

:tit

le

"Brandenburger Tor"@de"Brandenburg Gate"@en

"Grenzübergang Potsdamer Platz"@de"Postdamer Platz border crossing"@en

"Reichstag"@de"Reichstag building"@en

Plac

e/sk

os:p

refL

abel

Descriptive fields Subject headings

Implementationsource codes: http://pkiraly.github.io/about/#source-codes

data source: http://hdl.handle.net/21.11101/0000-0001-781F-7(Europeana snapshot, 2015 december) 16

Data processing workflow

web interface

statistical analysis

measuringingestion

★ OAI-PMH★ Europeana

API★ Hadoop★ NoSQL

★ Spark★ Hadoop★ Java★ Apache Solr

★ Spark★ R

★ PHP★ D3.js★ highchart.js★ NoSQL

json csv json, png html, svg

17

Visualization

1818

APIs,abstractio

n,reusing

"Place/skos:altLabel": { "instances": [ {"TRANSLATION": 2.0}, {"TRANSLATION": 2.0}, {"TRANSLATION": 2.0}, ... {"TRANSLATION": 2.40}, {"STRING": 0.0}, ], "score": { "sum": 20.40, "average": 1.85454545, "normalized": 0.649681 }}

Discussion & Future Work

20

extension I. recalculation

The new metrics★ Distinct languages per object★ Language tags per object★ Literals per language★ Number of multilingual properties (a.k.a. fields)★ Number of multilingual statements (a.k.a. field

instances)★ Average number of languages per property with

language★ Average number of languages per proxy

21

extension II. record views

ex:providerProxy dc:subject "special relativity"@en ; dc:creator <http://vocab.getty.eu/ulan/500240971> ; dc:type <http://udcdata.info/001684> .

ex:europeanaProxy dc:subject <http://dbpedia.org/resource/Physics> .

<http://vocab.getty.edu/ulan/500240971> skos:prefLabel "Einstein, Albert"@de .

standard vocabulary

<http://dbpedia.org/resource/Physics> skos:prefLabel "Physics"@en .

<http://udcdata.info/001684> skos:prefLabel "Books in general"@en .

standard vocabulary

non-standard vocabulary

22

extension II. record views

source field link value ① ② ③ ④

ex:providerProxy dc:subject literal "special relativity"@en ① ② ③ ④

dc:creator standard "Einstein, Albert"@de ① ② ③ ④

dc:type non-std "Books in general"@en ② ④

ex:europeanaProxy

dc:subject standard "Physics"@en ③ ④

① data provider's proxy and standard enrichments② data provider's proxy and enrichments③ all proxies and standard enrichments④ all proxies and enrichments

23

Questions

○contactjuliane.stiller@ibi.hu-berlin.depeter.kiraly@gwdg.de

○Metadata Quality Assurance Frameworkhttp://144.76.218.178/europeana-qa

○Europeana Data Quality Committeehttp://pro.europeana.eu/page/data-quality-committee

24

AppendixEuropeana data structure in 30 sec

provider proxy

Europeana proxy

Agent

Concept

Place

Timespan

descriptive fields

subject headings

sem

anti

c w

eb