Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf ·...

80
Institut f ¨ ur Bibliotheks- und Informationswissenschaft Humboldt-Universit ¨ at zu Berlin Abschlussarbeit im Rahmen der Laufbahnpr ¨ ufung f ¨ ur Bibliotheksreferendarinnen und Bibliotheksreferendare Improving access to bibliographic data: representing CERL’s Heritage of the Printed Book Database as Linked Open Data Andreas Walker supervised by Prof. Dr. Vivien Petras Dr. Christian Stein ottingen / May 8, 2019

Transcript of Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf ·...

Page 1: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Institut fur Bibliotheks- und

Informationswissenschaft

Humboldt-Universitat zu Berlin

Abschlussarbeit im Rahmen der Laufbahnprufung furBibliotheksreferendarinnen und

Bibliotheksreferendare

Improving access to bibliographic data:representing CERL’s Heritage of the

Printed Book Database as Linked OpenData

Andreas Walker

supervised byProf. Dr. Vivien Petras

Dr. Christian Stein

Gottingen / May 8, 2019

Page 2: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with
Page 3: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with
Page 4: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Contents

1. Introduction 8

2. The project in context 102.1. Context I: The Heritage of the Printed Book Database . . . . . . . . . . . 102.2. Context II: Linked Data in Libraries . . . . . . . . . . . . . . . . . . . . . 112.3. Context III: Visualizations of Linked Data . . . . . . . . . . . . . . . . . . 152.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3. From case studies to data model: specifying requirements 213.1. Initial interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1. Evaluation of the current HPB . . . . . . . . . . . . . . . . . . . . 223.2.2. Relevant data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.3. Workflows and Visualizations . . . . . . . . . . . . . . . . . . . . . 223.2.4. Other aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3. Consequences for the data model . . . . . . . . . . . . . . . . . . . . . . . 243.3.1. Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.2. Types of desiderata . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4. The data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.1. BIBFRAME 2.0 as a framework . . . . . . . . . . . . . . . . . . . 263.4.2. Specification of the data model . . . . . . . . . . . . . . . . . . . . 29

4. From data model to SPARQL endpoint: converting the data 314.1. Implementation of the data model . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1. Implementation-specific choices . . . . . . . . . . . . . . . . . . . . 314.2. Inspection of the original data . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.1. Issues with data quality . . . . . . . . . . . . . . . . . . . . . . . . 324.3. Enriching the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.1. External RDF data . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3.2. The CERL Thesaurus . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4. Conversion routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5. Hosting the graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.6. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5. From SPARQL endpoint to use case: developing and evaluating the webapplication 435.1. Technical infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2. Design considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3. Prototype I: Main views . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3.1. The start page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.3.2. The search interface . . . . . . . . . . . . . . . . . . . . . . . . . . 475.3.3. The list view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4

Page 5: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

5.3.4. The record view . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.4. Intermediate Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.4.1. Implemented features . . . . . . . . . . . . . . . . . . . . . . . . . 515.4.2. Going from RDF statements to records . . . . . . . . . . . . . . . 525.4.3. Using the SPARQL server vs. cached representations . . . . . . . . 535.4.4. JavaScript and accessibility . . . . . . . . . . . . . . . . . . . . . . 53

5.5. Feedback I: Researchers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.5.1. Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.5.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.5.3. Consequences for future prototypes . . . . . . . . . . . . . . . . . . 55

5.6. Feedback II: CERL Executive Committee . . . . . . . . . . . . . . . . . . 595.6.1. Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.6.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.6.3. Consequences for future prototypes . . . . . . . . . . . . . . . . . . 60

5.7. Prototype II: Preview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6. Conclusions 676.1. Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.2. Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.3. Adjacent issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Appendices 70

A. Email to participants and responses 71

B. Griffon: Parsing PICA+ and converting it to RDF/Turtle 71

C. ISO 3166: Mapping country codes to labels and Wikidata items 71

D. Web application: first prototype 71

5

Page 6: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

List of Figures

1. The original HPB start page. . . . . . . . . . . . . . . . . . . . . . . . . . 122. The original HPB search interface. . . . . . . . . . . . . . . . . . . . . . . 123. The original HPB search results list. . . . . . . . . . . . . . . . . . . . . . 134. The original HPB record display. . . . . . . . . . . . . . . . . . . . . . . . 135. Family forest of bibliographic data models. From Suominen and Hyvonen

(2017), published under CC-BY 4.0. . . . . . . . . . . . . . . . . . . . . . 286. The data model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307. A single conversion step from the data model. . . . . . . . . . . . . . . . . 368. System architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449. First prototype, start page. . . . . . . . . . . . . . . . . . . . . . . . . . . 4610. First prototype, advanced search page. . . . . . . . . . . . . . . . . . . . . 4811. First prototype, SPARQL interface (via YASGUI). . . . . . . . . . . . . . 4812. First prototype, list view. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4913. First prototype, map view. . . . . . . . . . . . . . . . . . . . . . . . . . . 4914. First prototype, list view with statistics. . . . . . . . . . . . . . . . . . . . 5015. First prototype, record view. . . . . . . . . . . . . . . . . . . . . . . . . . 5116. Second prototype, start page. . . . . . . . . . . . . . . . . . . . . . . . . . 6217. Second prototype, search interface (blank). . . . . . . . . . . . . . . . . . 6218. Second prototype, search interface (adding fields). . . . . . . . . . . . . . 6319. Second prototype, search interface (selecting entities). . . . . . . . . . . . 6320. Second prototype, search interface (Wikidata-based river search). . . . . . 6421. Second prototype, results view (as list). . . . . . . . . . . . . . . . . . . . 6422. Second prototype, results view (as map). . . . . . . . . . . . . . . . . . . . 6523. Second prototype, record view. . . . . . . . . . . . . . . . . . . . . . . . . 6524. Second prototype, record view (with tooltip). . . . . . . . . . . . . . . . . 66

6

Page 7: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Would you tell me please, which wayI ought to go from here?That depends a good deal on whereyou want to get to, said the cat.

Lewis Carroll, Alice in Wonderland

Bibliotheksdaten gehoren zu denkomplizierteren Dingen, mit denen esein Computer zu tun bekommen kann.

Eversberg (1994)

7

Page 8: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

1. Introduction

This thesis gives a contextualized description of the design process for a linked opendata (LOD) web application. As such, it is methodologically complex: It draws its con-clusions from this single case study, viewed against a background of the general LODlandscape in libraries, but it also embeds qualitative interviews with researchers and thepractical work of writing code, constructing data models and designing a web interface.In this way, it highlights the interdisciplinary nature of library and information science,living at the crossroads between the humanities and computer science as well as movingbetween practice – that is, building usable systems for library users – and theory, con-stantly reflecting on the nature of these systems. There is also, although maybe only asan undercurrent, an autoethnographic aspect to it: The personal experience of under-taking this project also very clearly shows the exciting challenges for academic librariansin today’s world – to constantly learn and adapt new technologies in order to keep in-formation free, accessible and usable for both researchers and the larger community. Istrongly believe that this requires us to ‘get our hands dirty’, that is, become involvedin the development of the software and infrastructures that we provide as a service forour users, ensuring that this infrastructure is built on the same principles of opennessand accessibility. While not every librarian needs to be a software developer, especiallythe work of translating user requirements into software requirements, and the iterativedevelopment of prototypes together with the target audience, are processes which canonly benefit from involving librarians as well as IT specialists. In this sense, you canalso read this thesis as the story of an encounter between a librarian and the world oflinked data.

In a very serendipitous way, this thesis also happens to be a thoroughly Europeanone at a time when many European institutions and projects are being challenged. TheHeritage of the Printed Book Database (HPB), which forms the main topic and theheart of this thesis, was built in the aftermath of the year 1989 and its geopolitical con-sequences for Europe, a fact which permeates the first publication of the Consortium ofEuropean Research Libraries’ CERL papers (Matheson, 1998). The spirit of cooperationand openness that led to its inception lives on in today’s attempts to achieve an evermore open science, with open access to research publications and data. In [2.1], I brieflydescribe the history of the HPB and how its conversion to linked open data – openingup its wealth of bibliographic data to everyone – relates to its original mission of sharingbibliographic data amongst its members.

Beyond its role as a part of open science, linked open data is also a technologicalproject that comes with its own history. I sketch the role that linked data has playedin libraries during the last ten years in [2.2] and take a look at the way linked data hasbeen presented to humans – rather than machines, who arguably form its main user base– in [2.3], arguing that any such presentation needs to be informed by the needs of atarget audience, rather than by the inherent structure of the data.

Following my own advice, [3] sets out to assess the needs of a particular target audience:in my case, researchers with a focus on the history of early European prints. In fiveinformal, unstructured interviews, I learn about their current interactions with the HPB

8

Page 9: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

and their needs for future versions of it. I then apply this to the design of a data modelbuilt on an emerging standard for bibliographic data, BIBFRAME 2.0.

In [4], which may be more interesting to a technology-oriented audience, I describehow the original data of the HPB is converted from PICA+, a traditional library dataformat, to RDF, a web-based standard for linked open data, and enriched with externaldata from the LOD cloud in the process. I discuss in detail the design choices andchallenges arising from the conversion process.

Finally, in [5], I present the design of a first prototype for the web application basedon the converted data and the results of the initial interviews. I first outline someof the issues arising from the development process in [5.4]. This prototype was thenshared with the interviewed researchers, whose feedback I discuss in [5.5], and withthe CERL Executive Committee, whose feedback I analyze in [5.6]. I close by makingrecommendations for the next iteration of the prototype and the further development ofthe project. This includes a preview of the next prototype’s current state of developmentin [5.7].

Acknowledgments I would like to thank the SUB Gottingen, and especially its Meta-data Group, for hosting and supporting this project, as well as Prof. Vivien Petras andDr. Christian Stein for supervising it. Many thanks also to CERL for allowing me to dothis work on the HPB and giving me the opportunity to present it at their meeting inLyon. I am exceedingly grateful to Geri Della Roca de Candal, Cristina Dondi, JeroenSalman, Rita Schlusemann and Paul Schweitzer-Martin, the researchers who kindly pro-vided their expertise, as well as to my colleagues Alex Jahnke and Mona Kirsch whooffered many insightful comments on this project.

9

Page 10: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

2. The project in context

In this chapter, I describe the three main contexts in which the project developed inthis thesis is embedded. [2.1] sketches the history of the Heritage of the Printed BookDatabase, including a description of its current interfaces. As we are interested in theconversion of this database to linked open data (LOD), [2.2] first outlines the stateof LOD with a focus on its implementation in libraries. One of the main findings ofthis section is that there is a need for user-friendly, accessible interfaces to linked opendata, and consequently [2.3] specifically reviews the literature on approaches to thevisualization of LOD.

2.1. Context I: The Heritage of the Printed Book Database

The Heritage of the Printed Book Database (HPB) is an online catalogue of early Euro-pean prints, covering approximately the ‘handpress period’ (c.1455–c.1830). It is man-aged by the Consortium of European Research Libraries (CERL), a consortium of cur-rently 267 research libraries throughout Europe and North America (CERL, 2019). TheHPB was launched as CERL’s first major project in 1994 (then under the name ‘Hand-Press Book Database’), two years after CERL was founded, and was first accessibleto researchers within consortium member libraries three years later, in 1997 (Stegaeva,2016). The database was aimed both at librarians, assisting them in their own cata-loging through access to other libraries’ records, and at researchers. As Welsh (2015)points out, the database is not only useful for researchers by allowing them to search forindividual records, but also as an object of study in itself, allowing for large-scale quan-titative analyses of European book production and dispersion. While CERL has alwaysconsidered the visibility of the HPB as one of its missions (CERL, 2005, 2010; CERLBoard of Directors, 2016), the database has only been made accessible for non-membersrecently, in January 2018. This has lead to an increase in the amount of searches, aswell as of record downloads (CERL, 2018).

The HPB grew steadily from its initial ‘six files’ in 1997 (Hellinga, 1998) to roughly fivemillion records between 2014 and 2016 (Stegaeva, 2016; Versprille et al., 2014) and at thedate of this publication contains 7.897.521 records (author’s own count). All records aredelivered by their creating institutions to the State and University Library in Gottingen,where they are processed and checked for consistency before being integrated into theHPB. This model of a unified catalog, as opposed to federated search across variouslibrary catalogs, is a conscious design decision with the aim of improving accessibilityfor researchers who might not have the ability or capacity for refining the heterogeneousdata usually retrieved from a search across multiple sources (Matheson, 1998).

Internally, the HPB uses the PICA+ format1 that is also used across in the Gemein-samer Verbundkatalog (GVK), the Common Library Network’s (GBV) and Library Net-

1PICA+ is the internal ‘sibling’ of the PICA3 format, which is used, e.g., for cataloging purposes. Themain difference is that PICA3 uses interpunction rules instead of explicit subfield delimiters. PICA+ isisomorphic to PICA3, i.e., conversion between the two formats is possible without loss of information(Eversberg, 1994).

10

Page 11: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

work Southwest Germany’s (SWB) union catalogue2. It is accessible to users as an OnlinePublic Access Catalogue (OPAC), as shown in Fig. 1–4, as well as through the Z39.50and SRU protocols3. Reflecting the HPB’s focus on hand-printed books, the OPAC pro-vides some indexes that go beyond the search functionality offered by standard librarycatalogs that mostly provide access to a modern collection, allowing searches across,e.g., fingerprints4, imprints or dimensions. In addition to the human-readable version,records can also be displayed in PICA+ and MARC21.

2.2. Context II: Linked Data in Libraries

Linked data is a family of technologies centered on the idea of the ‘Semantic Web’(Berners-Lee et al., 2001), where information is supposed to be available in a way that isreadable for both humans and machines. Its core idea is the adoption of a very simple,self-describing data model that does not rely on externally provided semantics for itsinterpretation, making it possible to link heterogeneous data sources with one another(van Hooland and Verborgh, 2014). The main technologies for the implementation ofthe Semantic Web are the RDF standard (as the underlying data model) and its variousserializations (e.g. Turtle, JSON-LD, RDF/XML), the Description Logic-based OWLWeb Ontology Language and its syntaxes (e.g. Turtle, OWL2 XML, RDF/XML), as wellas SPARQL as a query language for RDF databases. An introduction to the technology oflinked data is beyond the scope of this thesis. Instead, I will be presupposing a familiaritywith technical linked data concepts roughly at the level of the excellent practitioner’shandbook by van Hooland and Verborgh (2014), but an understanding of the key pointsshould easily be possible by reading one of the more cursory introductions to the topicavailable in the library and information science literature (e.g. Furste, 2011; Stein, 2014).

Although the publication of data in RDF is often treated as a de facto standard to-day, it is also obvious that this is not, by any means, a mature technology yet. Asvan Hooland and Verborgh (2014) find in their survey of the literature, there are veryfew accessible introductions for metadata practitioners in library and information sci-ence, with the bulk of publications still residing firmly in computer science. Similarly,Smith-Yoshimura (2016) in her survey of institutions implemented linked data in thelibrary world finds that existing implementations are ‘primarily experimental in nature’,although she also observes an increase of services in production compared to a previoussurvey conducted in 2014. Primary barriers to the publication of linked data identi-fied in Smith-Yoshimura’s survey are, amongst others, ‘lack of documentation’, ‘lackof tools’ and ‘immature software’. While more of the surveyed projects consume thanpublish linked data, similar barriers arise for the consumption of linked data: ‘volatil-ity of data format’, ‘lack of tools’ and ‘service reliability’ are also within the top tenproblems identified here. In another survey of 185 international participants that is

2Available at https://kxp.k10plus.de/DB=2.1/DB=2.1/LNG=EN/.3Although both interfaces are publicly accessible, they are currently not documented and are still pre-sented as members-only on the CERL website (CERL, 2018).

4Fingerprints are a type of identifier formed from letters taken from predefined pages of the print, makingit possible to differentiate similar but distinct prints from one another (Muller, 1992).

11

Page 12: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 1: The original HPB start page.

Figure 2: The original HPB search interface.

12

Page 13: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 3: The original HPB search results list.

Figure 4: The original HPB record display.

13

Page 14: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

focused on information professionals’ personal position on linked data, McKenna et al.(2018) find similar concerns, with 63% of participants citing a ‘steep learning curve’ and50% naming ‘[i]nadequate LD tools available’ as a barrier to publication of metadata aslinked data. As Mitchell (2016) puts it, linked data may be ‘in the startup phase of atechnology adoption hype cycle given the variation in standards, tools, approaches andperceived benefits’. Consequently, many follow-up questions have not been tackled yet,including the challenge of building sustainable (i.e., long-term accessible) and affordableinfrastructure (Vander Sande et al., 2018).

This lack of maturity is not surprising, given that libraries have only started experi-menting with linked data in 2008 (Pohl and Danowski, 2013). While the past ten yearshave spurred a wide range of activities, they have not yet succeeded in consolidatinglinked data into a stable, accessible technology, at least not in the context of libraries.5

Given the current state of the field, the project prestend in this thesis also, necessarily,has an experimental character. It is driven in same parts by the goal to provide a usable,attractive service for researchers in the present, and by the goal to contribute to theresearch on linked data in library and information science in order to provide insightsfor future applications.

From LOD to LOUD: adding usability to the mix The extension of LOD (linked opendata) to LOUD (linked open usable data) was first proposed by Sanderson (2018) in akeynote at EuropeanaTech. Sanderson makes a concrete proposal based on JSON-LD6,but also gives more general criteria that LOUD should fulfill:

1. The right level of abstraction for the audience.

2. Few barriers (e.g. complex ontologies, models)7

3. Data should be comprehensible without additional documentation.

4. Documentation should contain examples.

5. APIs should have few exceptions.

However, the tension between producing machine-readable data and providing end-user friendly interfaces to it has been identified much earlier (Halb et al., 2008; Sabolet al., 2014). As Mitchell (2016) points out, institutions are increasingly interested

5Cf. also the recent call for participation of the LD4 conference: ‘After a decade of experimentation andpilot projects, what are the next steps to move to large-scale production of linked data?’ (Futornick,2018, emphasis added by the author).

6JSON-LD is a W3C recommendation for serializing linked data as JSON, the JavaScript Object Nota-tion, a data format derived from JavaScript (Sporny et al., 2014).

7Note that Sanderson (2018) cites SPARQL as one such barrier. SPARQL, as a well-documented andestablished web standard, has often been cited as a way of overcoming barriers in the form of library-specific technologies, such as the Z39.50 or SRU protocols. This shows that the question of who theaudience is (librarians, linked data developers, developers in general, domain experts vs. lay audiences,...) and what can be assumed as their level of technical competence is a shifting target that needs tobe reassessed on a regular basis.

14

Page 15: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

in publishing their own SPARQL endpoints. However, these endpoints are not reallyendpoints in the sense of providing an interface for end-users. Rather, they providea standardized protocol on which end-user applications can be constructed. This sec-ond step of application building is often not seen as part of the linked data publicationprocess, although the debate around LOUD shows that this attitude is changing. AsAtemezing and Troncy (2013) put it: ‘The adoption of Semantic Web technologies inmany institutions or government decision-makers are often based on an actual demon-stration of their potential via an application. Here, the famous adage A picture is wortha thousand words could be rephrased as An application is worth a billion triples.’

2.3. Context III: Visualizations of Linked Data

Kirk (2016) defines data visualization as ‘the representation and presentation of datato facilitate understanding.’ Considering this definition, there is an obvious overlapbetween data visualization and fields like user interface design and usability studies,especially in the context of a data-oriented application like the HPB (cf. Cherny, 2017).Indeed, I am going to use the term in a somewhat broad sense, assuming it to includeall relevant aspects of data (re)presentation, not just those strictly from the field of datavisualization as it is understood, e.g., within computational statistics (Chen et al., 2008).

The visualization of data is a key factor in making it accessible to human beingsand allowing for its exploration. This is especially true where data sets are both largeand heterogeneous (Bikakis and Sellis, 2016), although the growing size of data sets(‘big data’) has also pushed visualization to its limits, making it necessary to think ofnew solutions for the reduction of complexity (Dadzie and Rowe, 2011; Leskinen et al.,2018; Dokulil and Katreniakova, 2008; Gomez-Romero et al., 2018). In that sense, theparadigm of visualisation is the human-oriented counterpart of linked data: Where linkeddata aims at increasing the accessibility of data for machines, visualisation does the samefor human beings (Degbelo, 2017). However, if linked data is still in an experimentalstage, this is true even more so for its consumption and representation to the end user(Dadzie and Rowe, 2011; Dadzie and Pietriga, 2017). As Dadzie and Pietriga pointout, most publications on the topic show this both in their content (with a focus onprototypes) and their place of publication (in workshop proceedings).

The range of such publications is huge, and there is a number of distinctions that weneed to make going forward. First, a high proportion of the literature focuses not onvisualizing the statements within a knowledge base, but rather the underlying ontology(Antoniazzi and Viola, 2018). In technical terms, a knowledge base is usually split intotwo distinct parts, called TBox and ABox. TBox contains the terminology, or ontology,of the knowledge base, i.e., its classes and properties, as well as their relations withone another. ABox, on the other hand, contains the actual assertions of fact (Baaderet al., 2017). While both TBox and ABox are represented in RDF, they can oftenbe distinguished by their namespaces in actual implementations. For some reason –probably because semantic web practitioners are often concerned with the design andanalysis of ontologies, or because focusing on TBox often avoids the problems with thevisualization of big data sets pointed out above – the majority of visualisation tools are

15

Page 16: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

in fact visualisation tools for TBox only. Dudas et al. (2018) survey a range of suchtools. Because we are concerned with the visualisation of the bibliographic data itselfhere, and not with the visualisation of the underlying data model, these approaches areof less interest to us. Graziosi et al. (2018) on the other hand specifically aim to givean overview of tools that provide either ABox or ABox and TBox visualization, as wellas offering a proposal of their own8. However, they only list eight tools, a low numbercompared to the 37 TBox-oriented tools surveyed by Dudas et al. The tools Graziosi etal. list are mostly graph visualisations and graphical query languages as a replacementfor SPARQL.

A second distinction that is necessary in surveying the field is the distinction betweentools for different user groups. As Graziosi et al. note, ‘unfortunately, most of theexisting tools are intended for Semantic Web expert users.’ In the same vein, Dadzieand Rowe deplore that although linked data usually follows the recommendation ofBerners-Lee (2009) to ‘use HTTP URIs so that people can look up those names’, mostusers ‘who have no knowledge of RDF, nor ontologies, are inhibited in their ability tounderstand data returned when looking up a URI’ (Dadzie and Rowe, 2011). In a laterpaper, (Dadzie and Pietriga, 2017) return to this point, noting that visualisations need tobe ‘tailored to specific tasks.’ This requires an understanding of an application’s targetaudience, their specific interests and skill sets. This is especially relevant when evaluatingapplications, as the participants of a study can deviate a lot from the actual targetaudience9. While more complex classifications of users exist (Shneiderman et al., 2016),the main distinction I consider relevant for our purposes is that made by Dadzie andRowe (2011), between ‘lay users’, ‘domain experts’ and ‘tech users’. Dadzie and Rowedefine lay users as computer-literate (e.g., capable of using search engines), but withoutknowledge of semantic web technologies, and also without specific domain knowledge.Domain experts, on the other hand, are not familiar with semantic web technologies,but have expert knowledge in the domain described by the data. Finally, tech usersmay lack that domain knowledge, but are familiar with semantic web technologies anddata models. This differentation of levels of expertise along two axes is also highlightedby Pesquita et al. (2018). Most of the literature drops the axis of domain knowledge,distinguishing only between tech and lay users. However, it is a crucial difference whenit comes both to users’ previous experience, and to their needs in further processing theobtained data. As Hoefler et al. (2014) argue, new applications should aim to ‘make use

8Note that of the two demo implementations they provide in the paper, one is already offline, despitethe paper being published in 2018. This highlights the difficulty in surveying the field, as many imple-mentations do not live beyond the duration of their research project and have little to zero uptake inthe community. This is also reflected in the methodology of Dudas et al. (2018), where they survey theproceedings of visualization-orienteed workshops at conferences for the past four years, ‘to increase thelikelihood to cover the most recent, living tools.’

9As a current example, see e.g., Benedetti et al. (2015a) who proceed from pointing out that ‘severaltools have attempted to facilitate exploring and querying LOD datasets, but it still remains complexand restricted to skilled users’ to designing and evaluating their own tool. Their results show ‘excellent’usability results; however, a closer look at the participants of the study in the more detailed descriptionthey provide external to the papers (Benedetti et al., 2015b) shows that ‘22 were enrolled from SemanticWeb and IT communities and others 5 [sic] were bachelor IT students.’

16

Page 17: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

of the prior knowledge the users already [possess] when it [comes] to handling data, tomake them feel as comfortable as possible, and not to reinvent the wheel.’ Paradoxically,as Helmich et al. (2017) point out, linked data that follows the recommendations ofBerners-Lee (2009) less closely – in particular three-star data, as opposed to five-stardata, i.e., data that has been made openly accessible but not in the form of RDF – isoften easier to process for (lay) end users, because it makes use of concepts and toolsthat they are already familiar with. In their survey of tools, Dadzie and Rowe (2011)find that all but one target tech users, whereas our project has as its main audiencedomain experts. It should be noted that the reprise from 2017 (Dadzie and Pietriga,2017) mentions that the idea of evaluating usability for lay users and domain experts is‘gaining traction.’

With these distinctions in mind, we can now proceed to an overview of criteria for goodvisualisations that have been formulated in the literature. Note that ‘although graphicshave been used extensively in statistics for a long time, there is a not a substantive bodyof theory about the topic [...] [The knowledge] is expressed in principles to be followedand not in formal theories’ (Unwin et al., 2008). While it would be impossible to summa-rize the abundant literature on good practice in data visualisation – including the entirehandbook to which Unwin et al. provide the introduction (Chen et al., 2008) –, onefamous and often cited such principle is the ‘information seeking mantra’ (Shneiderman,1996): ‘Overview first, zoom and filter, then details-on-demand’, which Shneidermandescribes as the essence of all observations having emerged from his practical experiencein implementing user interfaces. Slightly more concrete and geared towards the visual-ization of linked data, Degbelo (2017) proposes a five-star catalogue of criteria for linkeddata visualizations, in analogy to the five-star catalogue of Berners-Lee (2009) for linkeddata. His five criteria are:

• Machine readability. Graphical formats that have machine-readable source code(e.g. SVG) are to be prefered over opaque, pixel-based formats (e.g. PNG/JPEG)

• Additional description in JSON-LD. Degbelo suggests JSON-LD because of itsrising popularity, which is also confirmed by Mitchell (2016) and Sanderson (2018).

• Use of an open license.

• Use of open source libraries. Degbelo argues that this allows for further modifica-tion and re-use of visualisations.

• Explicit link to the data source.

However, Degbelo also points out that any such catalogue rewards some design deci-sions over others, and that such a list should be built on the consensus of the researchcommunity, rather than on a single proposal. Indeed, Degbelo’s proposal already showsa certain bias, e.g., towards tech users in rewarding the use of machine readable graphicformats, which are often unfamiliar to lay users and domain experts and do not necessar-ily fit within their workflows. The same goes for the provision of additional informationin JSON-LD, where popularity may not necessarily translate into good design decisions,

17

Page 18: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

especially when looking at long-term perspectives rather than current trends. Here,adoption by user groups and inherent technical or conceptual benefits need to be care-fully weighed against one another.

Considering this, there is probably no single golden rule for designing linked dataapplications; rather, we need to keep in mind what Dadzie and Pietriga (2017) state: ‘Itis important that LD visualisation is designed with, not just for, the user.’ This mirrorsa general principle in usability engineering, where usability is defined not as a simpleobjective measure, but rather as something that emerges from the interaction betweentechnology and its users, e.g. in the early definition of usability by Eason (1984), whodescribes it as an ‘interaction of three variables (system, task and user)’ that requires‘multivariate and complex’ investigations.

This means that applications should be built on the basis of a careful investigationof users’ needs, and implemented in an iterative process that takes feedback from thetarget audience into account. Pesquita et al. (2018) propose a framework for conduct-ing user studies in the context of the semantic web. They distinguish between threedifferent approaches to evaluation: (a) formal (e.g., cost-based), (b) automated (e.g.,focusing on computational efficiency) and (c) empirical (observation of users). Focusingon empirical evaluations, they undertake a literature survey, reviewing papers from fourconference series in the years 2015–2017. They note that the majority of works surveyedstudies information exploration and seeking behaviors (as opposed to navigation andinformation retrieval, which is probably slightly more relevant to our purposes, giventhe predominance of domain experts in the target audience). Furthermore, they find astudy size between 10 and 29 participants. While they note that 8–12 users are oftenenough to detect up to 80% of the issues in qualitative usability studies (Hwang andSalvendy, 2010)10, they criticize the common recruitment of students as participants, asthey often do not match the target audience. Based on their findings, they propose aset of minimum information that usability evaluation studies for semantic web contextsshould provide:

• Purpose. Types of operations that the application supports. Pesquita et al. pro-pose four types of operations: exploration, search, creation and management.

• Users. Overlap between intended audience and study participants, with respectto both technical and domain expertise.

• Tasks. The task description as given to the participants.

• Setup. This refers to the remaining experimental conditions.

• Procedure. A chronological description of the evaluation.

10Note that Virzi (1992) gives an even lower number, finding that four to five subjects are sufficientfor identifying 80% of issues, with additional subjects providing little to no additional benefit. Theyalso note that these 80% tend to cover the most important usability issues, with additional ones oftenbeing more minor. Hwang cautions that Virzi’s results are limited to cases where subjects are likely todetect usability issues; however, publications geared towards practitioners (e.g. Richter and Fluckiger,2013) tend to agree with Virzi for practical purposes.

18

Page 19: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

• Analysis. Description of the findings and their interpretation.

Neglecting the needs of users – especially non-tech users – has led to what schraefeland Karger (2006)11 have dubbed ‘the Pathetic Fallacy’: As Sabol et al. (2014) pointout, many visualizations of RDF have been based on the internal form of the data (i.e.a graph structure), rather than on the information stored in that form. I entirely agreewith their conclusion: ‘Instead, employing visualisations suitable for the particular typeof information (e.g. statistical, temporal, geographical etc.) would significantly aid theinterpretation of data.’ However, not only the type of information but also the typeof audience needs to be taken into account. Domain experts and lay users often haveneeds that run counter to what tech users and semantic web experts require or want toimplement. For example, as Hoefler et al. (2014) describe: ‘When it came to workingwith data, analyzing and visualizing it, almost all participants mentioned that theywould use Microsoft Excel and that they would probably manually collect and copythe data into the spreadsheet.’ This is far removed from the typical ‘innovative’ graphvisualizations that are derived from the fact that RDF is a directed graph, rather thanany usability considerations12. Note that many alternatives exist – Bikakis and Sellis(2016), e.g., list the following categories of applications (note that these are not mutuallyexclusive, as for example an ontology visualization is often graph-based):

• Browsers and exploratory systems

• Generic visualization systems

• Domain, vocabulary and device-specific visualization systems

• Graph-based visualization systems

• Ontology visualization systems

• Visualization libraries

In terms of visualizations construed more narrowly, both Kirk (2016) and Chen et al.(2008) offer a wide range of possibilities, ranging from classic bar charts over treemaps toadvanced high-dimensional visualization techniques.13 Given this range of possibilities,one should carefully consider users’ actual needs before letting the internal data structuredictate the visualization.

11Note that the lowercase here is an intentional part of the author’s name.12Note that this is similar to presenting tabular data in a table. This intuitively makes sense for simple

use cases; however, in more complex cases like relational databases no one would expect users tointeract with the underlying data structure. Rather, it makes sense to separate the underlying logicof the data format from the presentation (or, ideally, different presentations for different audiences),to play to the respective strengths of machines and human beings in both areas.

13Note that neither of these are specific to linked open data, but rather are concerned with the visual-ization of data in general.

19

Page 20: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

2.4. Summary

Summing up, the current state of linked data, and especially its presentation to endusers, in libraries is largely experimental and explorative. The lack of established stan-dards, and the resulting lack of users’ familiarity with any of the tools on the market,makes it especially necessary to consider usability and readability for human users inbuilding LOD systems. This requires the building of web applications as entry pointsinto the available data – both for end users and for potential developers of applicationsbased on that data, as they, too, need ways of exploring the data and its structure.This is especially visible in the high numbers of applications aimed at automaticallysummarizing SPARQL endpoints.

Existing approaches often focus on TBox. If not, they usually either try to make LODsearch more accessible by providing graphical variants of SPARQL, or by visualizingthe underlying data structure, i.e. directed graphs. I consider neither of these twoapproaches particularly promising, as they both run counter to the expectations andpre-existing skills of users. Additionally, they are often aimed at and tested with expertusers.

In this project, I make the assumption that my users are mostly domain experts: Theydo not necessarily have technical expertise – especially in the area of linked data –, butknow the topic of the HPB data, early European prints, intimately. For this reason, Ithink it is especially important to include this target audience in the design process. Itis likely that even a small number of these users will be able to give more valuable inputon their experience with the system to be designed and its fit to their own workflows,as well as their willingness and capability to adapt their workflows, than any numberof the usual study participants, i.e. students (which are usually chosen simply becausethey are available in most research settings) or technical experts, even when they havea lot of experience in usability design.

20

Page 21: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

3. From case studies to data model: specifying requirements

In this chapter, I describe the development of the data model. I begin with a series ofexpert interviews in [3.1], from which in [3.3.1], I construct a number of more abstractcase studies to better understand the requirements. In [3.4], I explain the choice ofBIBFRAME as the basis for my data model and discuss potential alternatives, presentingthe final data model for the first prototype in [3.4.2].

3.1. Initial interviews

To assess the requirements for the new HPB version, I conducted a series of informal,unstructured interviews with the target audience of the project, that is, book historianswith a focus on the history of early European prints. This format was chosen to avoidpredetermining the outcome of the interviews with a librarian’s perspective, insteadeliciting potential requirements that we were not already aware of. It also helped inobtaining the cooperation of the researchers, who were very motivated to help with theproject but also needed to be sure that their particular interests would be represented inthe conversation. The format allowed for clarifying questions and provided a researcher-oriented approach. Due to the time-consuming nature of interviews, I had to limit thenumber of participants. However, since the interviews are not conducted in order toprovide representative quantitative data, but rather to provide a qualitative first pointof entry into the design process, this drawback is a reasonable cost outweighed by theadvantages (cf. Sarodnick and Brau, 2016). During the interviews, I took notes and –where necessary – stimulated the conversation with the help of the following guidingquestions:

• Do you currently use the HPB, or have you used it before? If yes, what was yourexperience in using the HPB? If no, what kept you from using the HPB? Have youused other, comparable tools?

• What aspects of the data in the HPB are particularly relevant to you? Can youprovide us with use cases or typical search queries?

• What kind of workflows would you like the HPB to support? Are there particulardata formats and/or visualizations that would be helpful for your research?

Interviews were conducted with five participants, either in person (1), via video con-ferencing software (3), or via email (1). Interviews took place between December 2018and January 2019. All participants agreed to also evaluate the web application proto-type. Where I mention responses by single participants in the following, I will refer tothem using singular ‘they’ (Ackerman, 2018) in order to avoid making them identifiable.

21

Page 22: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

3.2. Results

3.2.1. Evaluation of the current HPB

All participants had already worked with the HPB but some expressed dissatisfactionwith its current state. Problems pointed out included:

• Incorrect data (e.g. language codes)

• Incomplete data (e.g. dates, physical description)

• Lack of accessibility

• Presentation based on the level of the physical item, rather than the edition

However, participants also emphasized the potential usefulness of the HPB as a tooland some indicated that they were planning to use it in their research in the future. Oneparticipant also pointed out that their home institution integrates the HPB into theirvirtual research environment, allowing researchers to enrich the provided metadata.

3.2.2. Relevant data

The participants largely agreed in identifying relevant aspects of the metadata. Thefollowing items were listed as particularly interesting for search queries and researchquestions by at least one researcher: unique identifier, title, date of publication / dateof approbation, place of publication / place of printing, printer, language, format /size, number of pages / sheets, illustrations, further physical descriptions / provenancemarkers, editions, available digitizations, current holdings.

3.2.3. Workflows and Visualizations

Workflows The participants largely favored Excel Workbooks (.xslx) as the primaryformat of their own workflows, with one researcher also naming Microsoft Word (.docx)as a potential export format. Some also indicated interest in the possibility of data ma-nipulation directly within the HPB web application, naming both the possibility to sortand filter lists by further criteria and to manually select sublists for further manipulationas desiderata. Furthermore, there was interest in the possibility of contributing theirown knowledge to the data, enriching it with additions and corrections.

Visualizations The participants, with one exception, also expressed a strong interestin being able to visualize the data obtained from the HPB. They prefer the possibilityto download static images that can be embedded in publications to complex, interactiveformats. They named the following kinds of visualizations as particularly useful for theirresearch:

• Maps (potentially historical)

22

Page 23: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

• Timelines

One participant also mentioned network visualizations to display the relationshipsbetween various entities14.

3.2.4. Other aspects

The copy vs. edition problem Participants particularly stressed the importance of apresentation on the level of the edition, rather than the physical copy. This highlights amajor problem for aggregators like the HPB: Although cataloging practices in librariesalso take into account the level of edition, matching editions that originate in two dif-ferent datasets is a difficult task that is not currently implemented. Unfortunately, dueto this, there is no reliable information about editions explicitly stored within the HPB,only insofar as multiple copies originate from the same data provider. However, re-searchers very often work at the level of the edition and depend on this information. Asone of the participants pointed out, while the HPB is positioned to be a major researchinstrument due to its sheer size, it massively lacks in usability due to this deviation fromthe researchers’ practice, which most other relevant databases seem to follow.

Advanced search One participant also commented on the potential search interfaces,pointing out that researchers in his field tend to prefer ‘advanced search’ interfaces withmultiple fields, citing the unavailability of this as a reason for researchers abandoning an-other, similar project. They also pointed out that search should ideally be unrestricted,not requiring any fields (like author or title) to be filled (‘the more search options thebetter.’)

Authority control Another participant pointed out the desirability of connecting par-ticular data fields (e.g. Printer) with authority files, e.g. with the CERL Thesaurus(Jahnke, 2007). This is partially implemented in the current version of the HPB, whereplaces of publication are linked with their respective records in the CERL Thesaurus.

Permalinks for citation One participant pointed out the necessity of being able to linkto result sets and visualizations to ensure the possibility of citation.

Historical geography One participant discussed the limits of geographical facets dueto changes in the extent of historical regions, e.g. in the changing relationship betweenFrance and Burgundy. One possible solution is to provide filtering based on cities, ratherthan countries. However, there is also the possibility of representing historical regionsas sets of cities associated with a particular point in time, if the necessary data could beobtained externally.

14Network visualizations also appear as a distinct research tradition in the historical sciences, althoughprimarily in the study of networks of people and their connections (During and von Keyerlingk, 2015).Network visualizations may be more familiar to researchers from this particular research tradition,making them also more likely to embed such visualizations in their own workflows.

23

Page 24: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

3.3. Consequences for the data model

3.3.1. Case studies

I follow the methodology of Al-Eryani and Ruhle (2018), which stems from the fieldof ontology construction and metadata engineering rather than computer science, inconstructing a number of case studies from the interviews. As the authors explain, thesecase studies do not represent single survey results, but instead compile the overall resultsinto a manageable number of hypothetical stories. These stories describe a potential useof the system by an agent and can be analyzed to obtain a number of more abstractscenarios. Various case studies can refer to the same actions and entities. The casestudies can then be used to later evaluate the finished applications or prototypes thereofagainst.

For the following three case studies, I took elements from the interviews and re-assembled them into three hypothetical uses of the system by fictional agents.

Case Study 1 Mark wants to find all editions printed in Hanover between 1650 and1700. He is studying the attitudes towards various languages at the time and wantsto get the results in separate lists, one for each language. He wants to manually selectlists of items from the results that he can further analyze in a software of his choice.However, he does not know in advance which languages will have been used.

Case Study 2 Erlinde is looking for copies with particular physical characteristics, likea certain number of pages, the presence of illustrations or the use of vellum as a material.She wants to find out where these copies have been produced, and where they are beingheld now. She wants a map of the current holdings so she can get an overview of theircurrent geographical dispersion, as well as links to digitizations where they are available.

Case Study 3 David is interested in the careers of three different printers and wants tocompare their output over time. He knows that some of the printers have been identifiedby different variants of their names. He also wants to visualize the results and use themin his publication.

Use cases are then further analyzed into scenarios, where the hypothetical use casesare represented as more abstract actions (i.e., functions of the system such as searchingfor or selecting records) and elements of the data model. The corresponding actions anddata elements are the following:

Case Study 1 (Mark)

• The user searches for items based on their place of printing and date.

• The user selects their target languages from a provided list.

• The user filters the results by the selected languages.

24

Page 25: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

• The user manually filters the search results further.

• The user exports the results in a familiar file format.

Case Study 2 (Erlinde)

• The user searches for items based on their physical description.

• The user views the items’ place of printing.

• The user visualizes the location of current holdings on a map.

• The user views the links to digitizations.

Case Study 3 (David)

• The user searches for items based on their printers.

– The system automatically resolves the printers’ names to their variants.

• The user visualizes the results on a timeline.

• The user exports the visualization in a familiar file format.

• The user cites a stable link to the results in his publication.

Our data model will have to support the indicated types of data, while our applicationwill have to make it possible to perform the indicated actions.

3.3.2. Types of desiderata

The desiderata provided by researchers can be classified into three broad groups, de-pending on the availability of the underlying data necessary to fulfill them:

1. Available data

2. Inferable data

3. Unavailable data

Desiderata of type (1) can be implemented using the metadata already available inthe HPB. This includes, e.g., the ability to filter results by data like place of publicationor date of printing. Type (2) cannot be implemented in the same straightforward man-ner, but could be made available by automatically enriching the available data. Thisranges from simpler (e.g. extracting information about physical format from unstruc-tured physical description fields) to more complex (e.g. identifying editions of a textthrough clustering methods) cases. Type (3), finally, would make it necessary to obtaindata that is currently not in the record and cannot be infered from it. Necessarily, the

25

Page 26: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

lines between type (2) and (3) are somewhat fluid, depending on the available technolo-gies and the possibility of accessing external datasets.

For the HPB it makes sense to first prioritize desiderata of type (1), as their implemen-tation should be relatively straightforward. However, note that many of the desideratanamed as crucial by researchers (especially the clustering of physical copies under theirrespective editions) belong to type (2). Here, it becomes a question of weighing the ef-fort required for an implementation against its perceived usefulness to end users. Somepoints may also be postponed if they require additional resources that are not currentlyavailable but will become available at a later point (e.g. links to authority files).

Concretely, I prioritized those features that would allow me to present researchers witha usable prototype as fast as possible, allowing them to comment on the project as itdevelops. For the first prototype, this meant de-prioritizing search (which requires a lotof infrastructure investment compared to its impact on the interface) and first focussingon the display of records and result lists, as well as the means of interacting with those.Similarly, export functions where not prioritized, as these mostly concern interactionsbetween researchers and the data away from our interface.15

3.4. The data model

3.4.1. BIBFRAME 2.0 as a framework

Using an existing ontology as a framework for our data model allows us to avoid havingto construct our own ontology, which is desirable as long as the intended domain alreadyhas existing ontologies (Geipel et al., 2013; Suominen and Hyvonen, 2017; Al-Eryani andRuhle, 2018). This increases interoperability, making the project compatible with otherprojects that use the same ontology to describe bibliographic resources.

BIBFRAME 2.0 is an ontology developed by the Library of Congress within the Bibli-ographic Framework Initiative (BIBFRAME). It is intended as a full replacement for theMARC21 (MARBI, 1996) format, to be used not only in publishing linked data on theweb, but also as a native data management and storage format. While not widely used inproduction at the moment, it has been under development since 2011, and has undergoneextensive pilot studies (cf. Acquisitions & Bibliographic Access Directorate, 2016), result-ing in the redevelopment of the initial BIBFRAME 1.0 ontology. BIBFRAME uses somepreexisting ontologies (OWL, RDF and RDFS, SKOS16) as primitive building blocks,but defines its own properties and classes for all relevant aspects of the data model. Itis strongly linked with the development of the cataloging standard Resource Descrip-tion and Access (RDA) and the underlying Functional Requirements for BibliographicRecords (FRBR) model, especially in its deviation from the record-based perspectiveof MARC towards a more modular perspective that importantly distinguishes betweendifferent levels of a record: the conceptual level of the work, its material embodiment

15To see which features were implemented in the first prototype, refer to [5.4.1].16See W3C OWL Working Group (2011); Brickley and Guha (2014); Miles and Bechhofer (2009) respec-

tively for descriptions of these ontologies.

26

Page 27: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

in an instance and the level of the actual, concrete copy in an item (McCallum, 2017;El-Sherbini, 2018; Steele, 2018)17.

Using a subset of the terms in the BIBFRAME ontology allows for the rapid de-velopment of prototypes, while still allowing for an extension to the full model in thelonger term. As a data model for the full description of bibliographic data, BIBFRAMEnatively supports all the types of data identified in [3.3.1]18:

Case Study BIBFRAME property PICA+ field(s)

Place bf:provisionActivity 033A$p

Date bf:originDate 011@$a

Language bf:language 010@$a

Physical description bf:extent 034D$abf:baseMaterialbf:illustrativeContent

Current holdings bf:heldBy 009B$c

Link to digitization bf:hasReproduction 009P/09$a

Name of printer bf:provisionActivity 033A$n, 033J$a

This makes it a good candidate for use within the project. Note that the display intabular form only imperfectly captures the subject-predicate-object structure of RDF,since PICA+ fields cannot be matched to a single property or statement, but often turninto a series of statements. The table above only names the most transparently namedrelation involved in such a series of statements. For a full view of the data model, seeFig. 6 in [3.4.2].

Comparison to other approaches One – somewhat more Germany-centric – alternativewould be following the recommendations of the DINI-AG KIM (2018), which – instead ofconstructing their own ontology with an explicit underlying model as BIBFRAME does– use a larger number of pre-existing ontologies to select terms from. However, the lackof a coherent underlying model here still makes this proposal subject to the criticismof Svensson (2013), even if it is directed at a previous version of the recommendations(DINI-AG KIM, 2013): They are still rather record-oriented and ‘are more an applicationprofile aiming to provide an easy-to-implement bridge from the library world into thelinked data domain than an actual bibliographic model’ (Svensson, 2013).

Beyond the German context and the DINI recommendations, Suominen and Hyvonen(2017) provide an overview of potential data models for bibliographic data, which theypresent in a ‘family forest’ (Fig. 5), tracing the somewhat complex historical relation-ships between the models on the horizontal axis, while ordering them with respect to

17This feature may later be helpful in dealing with the distinction between copy and edition highlightedby researchers in the interviews.

18Note that the following table refers to PICA+ fields as they are used in the HPB; in some cases thisminimally deviates from GBV cataloging guidelines (VZG, 2017b), e.g. in using 009B for holdinginformation for technical reasons. Also, some of the fields may contain additional information thatmay need to be filtered before conversion, e.g. 009P/09$a which also contains links to other thingsthan digitizations.

27

Page 28: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

the degree to which they are rooted in a record-based model (like the traditional li-brary formats, such as MARC and PICA+ and their underlying models as expressed intheir respective cataloging rules) or in an entity-based model (the ‘natural’ linked datamodel). As discussed above, BIBFRAME 2.0 is strongly based on entities, rather thanrecords.

Figure 5: Family forest of bibliographic data models. From Suominen and Hyvonen(2017), published under CC-BY 4.0.

Suominen and Hyvonen (2017) also compare different models in terms of two usecases: ‘libraryish’ ones, where data is maintained natively within RDF (requiring ahigh degree of detail), and ‘webbish’ ones, where data is maintained in a record-basedformat and only converted to RDF for web applications (allowing for simplifications tofacilitate reuse). Here, they consider BIBFRAME to be a ‘libraryish’ model. Given thatthe HPB is supposed to be maintained in PICA+ in the foreseeable future, this meansthat BIBFRAME may not be the ideal target model, and that developers of externalapplications may prefer a more simplified data model. This concern should be takenseriously, especially in the light of the LOUD discussion I sketched in [2.2]. However,given the status of BIBFRAME 2.0 as an emerging standard and the considerationsabove, I still consider it a viable data model. Additionally, if it later turns out thatsimplified models are necessary, it is always easier to produce a simpler model from

28

Page 29: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

a more complex one than going in the other direction. That is, BIBFRAME 2.0 is agood model for developing our own web application – where the simplification is notneeded, but we can potentially make use of the additional information preserved in amore complex model –, while in the future we may want to also produce a simplifiedversion of our RDF data for external data consumers.

As Coyle (2015) puts it, comparing BIBFRAME to other FRBR-based RDF mod-els, BIBFRAME ‘is taking on the unenviable task of carrying forward into RDF thecenturies-old practices of traditional library cataloging’, making it a fairly conservativemodel. However, as is clear from Fig. 5, it still breaks with the record-based model andis – though complex – a ‘native’ inhabitant of the semantic web. These two featuresmake it a good starting point for a translation from one world into the other.

3.4.2. Specification of the data model

I use a subset of BIBFRAME 2.0 as my data model, which is sketched in Fig. 6. Allrelations between a BIBFRAME class (orange) and either a PICA+ or external datasource (blue and green respectively) have as their range either the set of literals (besidesrdfs:label, this holds for bf:originDate, wg84 pos:lat, wg84 pos:long) or the set of URIs(besides rdf:value, this holds for owl:sameAs). Since data sources are indicated in thismodel, it also serves as a model of the conversion routine which I will describe in moredetail in [4].

As is visible in Fig. 6, the record-based PICA+ is mapped onto different entities inBIBFRAME: Information related to the work is attached to an object of type bf:Work,whereas information that relates to the instance attaches to a bf:Instance. Actual phys-ical copies are modelled as bf:Items.

Each of these three levels have properties that are mostly derived from the originalrecord. For example, the country code of the record – as recorded in PICA+ field 019@,subfield $a – is recorded in BIBFRAME as a property of the work in the following way:An RDF statement with the work as its subject asserts that the object of the relationbf:originPlace is a blank node of type bf:Place. This blank node in turn is the subject ofa second RDF statement that asserts the content of 019@$a (a string) as the object ofthe relation rdf:value. In most cases, the content is derived from the PICA+ record inthis way, except for the case of geographic coordinates which are obtained from externaldata sources (marked in green in Fig. 6).

A full record, as an example, is shown in [4.4], after a discussion of a number ofrelevant implementation-specific choices that go beyond the data model in [4.1.1].

29

Page 30: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 6: The data model.

30

Page 31: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

4. From data model to SPARQL endpoint: converting the data

The data model in Fig. 6 doubles as a (simplified) model of the conversion routine.Literals and URIs in the BIBFRAME model are formed from PICA+ data or obtainedfrom external data sources. In this chapter, I first discuss in somewhat more detailhow particular choices in the implementation are justified in [4.1.1]. I then describe theresults of inspecting the original data in [4.2], with a focus on data quality in [4.2.1],and the possibilities for enriching this data from external sources in [4.3]. After thesepreliminary steps, I describe the conversion routine in [4.4].

4.1. Implementation of the data model

4.1.1. Implementation-specific choices

URIs for works and instances For the first prototype, there is a one-to-one relationshipbetween a PICA+ record and a tuple of URIs, one for the work and one for the instance.Since no merging of works is being done in the prototype, we use a simple namingscheme that forms the URI from the record’s identifier (the CID) and a suffix of ‘-work’or ‘-instance’ respectively. Since records are most closely related to the level of theinstance, later implementations might make it desirable to use a different naming schemefor the work level.19

Blank nodes versus Hashed URIs Blank nodes are an element of RDF that providesexistentially quantified variables for statements with unnamed resources that have notbeen assigned (and are not intended to be assigned) a URI. Because they can makeentailment checking over RDF graphs intractable20, and because they cannot be refer-enced externally, their use is often discouraged (e.g. Heath and Bizer, 2011). However,as Hogan et al. (2014) point out, they are a convenient tool for data publishers, and areindeed prevalent on the web, with 44.9% of documents in a corpus of 8.373 million RDFdocuments containing at least one blank node, and 25.9% of all unique RDF terms (i.e.,URIs, literals and blank nodes) appearing in the documents being blank nodes. TheDINI-AG KIM discusses two options for dealing with terms that should not be assignedtheir own URI: first, blank nodes, and as an alternative, the assignment of a ‘hash URI’which is constructed from the URI of the subject term and a ‘fragment identifier’ startingthe # symbol. However, note that their discussion of blank nodes (or hash URIs) is lim-ited to cases where they are inserted into the document to avoid assigning both literalsand URIs as the object of the same property, essentially using them to provide an anchor

19Independently, there is a discussion to what degree identifiers (and by extension URIs) should carrymeaning or be semantically opaque, e.g. in the context of DOIs (International DOI Foundation, 2018).I treat this discussion as outside of the scope of this thesis, as at least for the prototype implementationsno permanent decisions about naming schemes are necessary yet. Note though that this also wouldprovide additional arguments against hashed URIs, as discussed in the next section, as hashed URIsare by definition not semantically opaque, instead expressing a kind of mother-child relation betweentwo URIs.

20The notion of intractability, as used in computer science, describes a problem which can be solvedtheoretically, but which is too resource-intensive to be solved in any actual implementation.

31

Page 32: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

for a literal attached with the rdfs:label property (DINI-AG KIM, 2018). As Hogan et al.(2014) demonstrate, these are actually unproblematic uses of blank nodes that do notadd computational complexity. If there is no need to externally reference them, it istherefore semantically more appropriate to use blank nodes, especially since the use ofhash URIs can actually prevent entailment checks between two different graphs that usedifferent methods of assigning them, as well as creating problems with the persistenceof hash URIs across successive versions of the same graph. Using blank nodes in thesecases is indeed the recommendation of Booth (2012), which is also endorsed by Hoganet al., and which I will follow in this project21.

URIs and literals In the spirit of linked data, I aimed at using URIs rather than strings(i.e., literals) to represent the data. Since the original data format is a string-based one,this was only possible where the provided strings are standardized enough to be matchedto their respective URIs in an ontology. Concretly, assigning URIs was possible for lan-guage codes (standardized according to ISO 639-2, using three-letter codes), agent roles(using three-letter MARC relator codes) and countries of origin (standardized accordingto ISO 3166-1, using four-letter codes). Ready-made ontologies for MARC relators andISO 639-2 language codes are provided by the Library of Congress (2017a,b).

4.2. Inspection of the original data

In order to inspect the original data, I wrote a general purpose PICA+ parser calledgriffon in Python. The main challenge here was in the size of the HPB, with roughlyeight million records (∼ 11.3 GB in the original PICA+ format), which could not bestored in memory on any of the available computers. For this reason, the parser reads theoriginal PICA+ data, stores the result in structured Python objects and serializes themin chunks of 10.000 records each22. Queries over the data are then made in a distributedfashion, querying each chunk in turn and combining the results. The parser is attachedto this thesis as Appendix B. On top of providing access to the original records in asystematic fashion, it also provides a number of convenient functions, e.g., listing thecontents of a particular PICA+ field by occurrences or counting the number of recordsthat contain a particular PICA+ field. I used these for a first inspection of the data.

4.2.1. Issues with data quality

A preliminary inspection of the data with the help of griffon showed that are multipleissues with data quality in the HPB that could potentially affect both the data conversionroutine and the subsequent application. In the following section, I describe some of theseproblems and the solutions I chose to either resolve or work around them.

21As I noticed during the implementation process, this has the additional advantage that Apache JenaFuseki does not play well with hashed URIs, a problem we avoid by using blank nodes.

22The script can be configured for different chunk sizes. A size of 10.000 records was determined to beoptimal on the development machine, allowing me to run queries in the background while still retainingsome memory for other tasks, e.g. thesis writing.

32

Page 33: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Duplicates Given the size of the HPB and its origin as an aggregation of multiple datasets, it contained a surprisingly small number (91) of genuine duplicates, i.e. records thathad the same CERL IDs (CID, the HPB’s internal unique identifier). Griffon containsa function that identified them after the entire data set was parsed and replaced theiridentifiers with randomly generated UUIDs (The Internet Society, 2005), allowing themto be accessed while retaining a dictionary that maps those UUIDs back to their originalCID for later investigation.

Invalid or missing codes in standardized fields There is a number of fields (e.g., lan-guage, country and relators) that are supposed to contain standardized codes, whichmakes them ideal for conversion to URIs. However, inspection of these fields turned upa large number of data which were not conforming to the standard, coded as ‘unknown’or simply missing that field.23 For country codes (PICA+ field 19@a), there were 712invalid codes, 546.478 ‘ZZ’ (=unknown) codes and 1.592.709 records without an entryin the field. In language codes (PICA+ field 010@a), there were 57.346 syntacticallyinvalid codes (e.g. having only one or two letters), in addition to 928.164 entries codedas ‘und’ (=undefined) and 508.994 records without any language code at all. For yearsof publication (PICA+ field 011@a), I found that 19.968 records were coded as ‘uuuu’(=unknown), with another 703 records having implausible (i.e. beyond the current year)or syntactically ill-formed publication dates. 14.303 records did not have an entry in thefield. In many cases, ill-formed entries seem to be based on syntactically well-formedcodes, where a conversion routine has accidentally cut off one or more letters. In oth-ers, the mistake was most likely already present in the original data and is a result ofcataloging not adhering to the standards. For the purposes of the conversion to RDF,one could match all three cases (ill-formed, coded undefined, missing) to ‘undefined’;however, it may be possible to recover in the future some of the data lost in conver-sion routines, either from the original data or by systematically reconstructing the error(e.g. in the case of ‘ng’, which is very likely to have been ‘eng’, for English, originally).For this reason, I currently retain the original encoding, displaying the ill-formed codewithout a URI for the ill-formed codes and only linking those entries explicitly coded as’undefined’ or entirely missing a language code with the appropriate ISO-639 categoryat http://id.loc.gov/vocabulary/iso639-2/und.html. I follow the same principlefor the other two cases, country codes and MARC relators24.

Template artefacts On inspection, I also discovered a field where catalogers had ap-parently worked with a template, but had neglected to fill in the respective values for therecord: In PICA+ field 037Aa, 2.478 records stated ‘Die Vorlage enthalt insgesamt ...

23Note that there is a number of possible reasons for missing codes: They can result from catalogingerrors, cataloging practices that omit the field in question, mistakes during data conversion or genuineundecidability resulting from the source material.

24Note though that the process for extracting MARC relators is slightly more complex, as cataloging inthis field is not very standardized, often mixing free text entries with codes, e.g. in a construction like‘Auteur (*aut)’. Here, I employ a regular expression to try and detect three-letter codes within thetext at the level of the web application, in order to leave the original data intact at conversion.

33

Page 34: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Werke’, with no value given25. As a work-around, this particular case could be filteredout in the conversion routine. In the current version of the conversion, this field is notconverted to RDF, rendering the problem irrelevant.

UTF-8 combining diacritics The original PICA+ data encodes umlaute (e.g. in Ger-man) with the non-umlaut character and the ASCII sequence 204 136. This encodingis preserved across conversions and leads to problems when displaying the data in aweb browser with a number of fonts, including the ones chosen for the web application,where the umlaut diacritics can appear ill-aligned with the vowel characters. Instead,the browser expects an UTF-8 encoded umlaut composed of the sequence 195 and an-other number identifying the vowel. As a workaround, the conversion routine replacesthe original sequence with the expected UTF-8 sequence, ensuring correct rendering inthe browser.26

4.3. Enriching the data

4.3.1. External RDF data

After inspecting our own original data, in order to enrich it, I also included some exter-nally provided RDF data in our triple store. Hosting these external data sets in our owntriple store, rather than querying them through an external SPARQL endpoint, has thedisadvantage of potentially using an outdated version of the respective dataset, ratherthan the respective data provider’s most current version. However, the advantage isthat if the provider decides to make changes to the dataset, we can choose whether toalso implement them on our side, or whether to continue working off the older version.This is especially important where changes would break some functionality in our webapplication. In addition, hosting the data ourselves improves performance over exter-nally hosted SPARQL endpoints. It is also a necessity where the data provider doesnot present a SPARQL endpoint in the first place. This is the case for the ISO 369-2language codes and the MARC relators provided by the Library of Congress (Library ofCongress, 2017a,b). Here, I uploaded the provided data dumps to our own data store inorder to resolve language labels.

I also created an RDF data set that maps the GBV’s slightly deviant use of ISO3166 country codes to their Wikidata equivalents and provides German and Englishlabels. This data was created semi-automatically, by querying Wikidata labels for theGerman labels provided by the GBV, and – if successful – adding the English label fromthe Wikidata item automatically. Remaining items (mostly subdivisions of Germany,Austria and Switzerland, as well as historical territories no longer in existence) wereadded by hand. This data set is also provided as Appendix C to this thesis. All external

25This field is used for general notes in PICA+, with GBV-RDA cataloging guidelines specifying thatdependent documents (‘unselbstandige Werke’) are to be noted here with this exact template (VZG,2018).

26Note that technically this is not a problem with the data as such, as the encoding conforms to UTF-8standards, but rather of an incomplete implementation of these standards by the font providers.

34

Page 35: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

data sets are stored in named graphs, allowing them to be queried where needed whileignoring them for other queries, increasing query speed.

4.3.2. The CERL Thesaurus

The current HPB data already contains links to the CERL Thesaurus (Jahnke, 2007) forplaces of publication, with links for publishers intended to be added in a future version.The Thesaurus provides variant name forms as well as links between places and personsfrom the hand-press era. Of particular interest for my project, it also links place nameswith their standardized GeoNames identifier. GeoNames is a geographical databasewhich contains a number of data points about places (e.g., coordinates, populationnumbers), but more importantly, could also be useful in linking place names to Wikidata,where GeoNames identifiers are also stored (Unxos GmbH, 2019). This enables us toadd geographical features, such as map representations, to the HPB, as well as linkingour data store to any information that is stored about those places in Wikidata. Thisis especially important because Wikidata, in the current linked data ecosystem, servesas a hub that brings together a large number of identifiers from other databases, linkingtheir data sets with one another. Through the links provided in the CERL Thesaurus,the HPB also becomes part of that ecosystem27. At a later point in time, links willalso be established between persons in the HPB and their representation in the CERLThesaurus, making it possible to access this information in the same way. Currently,users can only access this information manually by following the link from the HPB tothe thesaurus and looking up people associated with places in the Thesaurus itself.

4.4. Conversion routine

The conversion of the original PICA+ data to RDF/Turtle was implemented in a conver-sion routine that forms a module within griffon, making it possible to access the internaldata model generated as a result of parsing the records28.

The conversion routine accesses the internal representation of the PICA+ format andwrites out RDF files in turtle syntax. A typical simple case of translating a PICA+ fieldinto a BIBFRAME representation, corresponding to the highlighted part of the datamodel in Fig. 7, looks like this:

try:

language = value2uri(record.fields.get(’010@’)[0].subfields.get(’a’).

content.lower(), ’lang’)

out.append(’%s bf:language [ rdf:type bf:Language; rdf:value %s ] .’

% (iuri, cleanup(language)))

except (TypeError , KeyError, AttributeError):

27Note that we also link with Wikidata through the country codes, as described in the section above.However, these links are much less precise and mostly serve as a backup for cases where no preciselocation is established in the metadata.

28For the full source code of the conversion routine, see Appendix B.

35

Page 36: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 7: A single conversion step from the data model.

log.warning("No language code in original data. Substituting und")

out.append(’%s bf:language [ rdf:type bf:Language; rdf:value %s ] .’

% (iuri, "lang:und"))

This code reads the value of PICA+ field 010@, subfield a and transforms its con-tent (an ISO 639-2 language code) into a URI. It then constructs a turtle statementthat relates the instance (iuri) with this URI through the appropriate BIBFRAME con-struction, in this case a blank bf:Language node. If the PICA+ representation does notcontain a value for field 010@a, the language code is set to und (undefined) instead andwritten to the Turtle file in the same way.

In many cases, e.g. when establishing titles and contributors, more complicated rou-tines check the values of several PICA+ fields and make decisions about the resultingBIBFRAME representation based on their contents:

try:

title a = record.fields.get(’021A’)[0].subfields.get(’a’).content.

replace("@", "")

except (TypeError , KeyError, AttributeError):

log.warning("No main title in original data.")

title a = None

try:

title d = record.fields.get(’021A’)[0].subfields.get(’d’).content

36

Page 37: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

except (TypeError , KeyError, AttributeError):

log.warning("No subtitle in original data.")

title d = None

try:

title h = record.fields.get(’021A’)[0].subfields.get(’h’).content

except (TypeError , KeyError, AttributeError):

log.warning("No responsibility statement in original data.")

title h = None

Here, subfields a, d and h of PICA+ field 021A are checked for their contents, repre-senting main title, subtitle and responsibility statement respectively. The following codethen assembles a display string from these ingredients:

if (title a and not title d):

title = title a

elif (title a and title d):

title = title a + " : " + title d

elif (not title a and title d):

log.warning("Using subtitle as title.")

title = title d

else:

title = None

if title h:

if title:

title = title + " / " + title h

else:

title = "/ " + title h

Both the respective parts and the display string are then written into the Turtle file,the former as objects of BIBFRAME relations (bf:mainTitle, bf:subTitle, bf:responsibilityStatement), the latter as the object of an rdfs:label relation29:

if title a:

out.append(’%s bf:title [ rdf:type bf:Title; bf:mainTitle "%s" ] .’ %

(wuri, cleanup(title a)))

if title d:

out.append(’%s bf:title [ rdf:type bf:Title; bf:subTitle "%s" ] .’ %

(wuri, cleanup(title d)))

if title h:

29Note that this is an abbreviated explanation of the full system, which also considers other sources fortitle information and deals with the more complex cases of series and part titles.

37

Page 38: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

out.append(’%s bf:responsibilityStatement "%s" .’ % (iuri, cleanup(

title h)))

if title:

out.append(’%s bf:title [ rdf:type bf:Title; rdfs:label "%s" ] .’ % (

wuri, cleanup(title)))

There are also a number of cases where information is not available in the PICA+representation, but is obtained from external sources instead:

Coordinates (CERL Thesaurus) Since the PICA+ representation contains a link to theCERL Thesaurus for places of publication, we can read coordinates from the thesaurus’RDF file. The CERL Thesaurus in turn obtains its coordinates from the Bernstein PaperAtlas Project (Atanasiu et al., 2008) or manual entry, and also includes a link to theGeoNames representation of the place (derived from the coordinates), which is copiedinto the BIBFRAME model.

def cerlthesaurus2coordinates(uri):

g = rdflib.Graph()

r = requests.get(uri, headers = {’accept’: ’application/rdf+xml’})

try:

g.parse(data=r.text)

except:

return [None, None]

clat = None

clong = None

geonames = None

a = g.query(’select ?s where { [] wgs84 pos:long ?s .}’)for row in a:

clong = str(row.s)

a = g.query(’select ?s where { [] wgs84 pos:lat ?s .}’)for row in a:

clat = str(row.s)

a = g.query(’select ?s where { [] owl:sameAs ?s .}’)for row in a:

identifier = str(row.s)

if identifier.startswith(’http://sws.geonames.org/’):

geonames = identifier

if clat and clong and geonames:

38

Page 39: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

return [[clat, clong], geonames]

elif clat and clong:

return [[clat, clong], None]

elif geonames:

return [None, geonames]

else:

return [None, None]

Coordinates (Wikidata) For holding institutions, the PICA+ representation currentlyonly uses non-standardized strings with the institutions’ names. In order to assign co-ordinates to these places, the current version of the conversion script looks them up inWikidata. In ideal cases, the coordinates of the holding institution can be established,with a fallback routine trying to establish the coordinates of the containing city oth-erwise. Since the automatic lookup of wikidata items based on strings is likely to beerror-prone30, later versions of the code will use a CERL database of holding institutionswhich is currently in preparation.

def uri2coordinates(uri):

try:

r = requests.get(uri + ’.json’)

entity = uri[len(’http://www.wikidata.org/entity/’):]

lat = r.json()[’entities’][entity][’claims’][’P625’][0][’

mainsnak’][’datavalue’][’value’][’latitude’]

lon = r.json()[’entities’][entity][’claims’][’P625’][0][’

mainsnak’][’datavalue’][’value’][’longitude’]

return [lat, lon]

except:

return None

Result A typical result from the conversion routine looks as follows:

30And indeed, many such errors occur in the prototype, ranging from not being able to find a Wikidataitem for the Bayerische Staatsbibliothek over merging the French and Icelandic national libraries toidentifying Wikidata items that are entirely misleading. Possible solutions would require the adoptionof more sophisticated technology from the area of Named Entity Recognition and Disambiguation.However, note that these approaches are also still challenged by short text fragments, as they usuallyoccur in metadata, with no disambiguating context (see e.g., Sakor et al., 2019).

39

Page 40: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

@prefix rdf: <http://www.w3.org/1999/02/22−rdf−syntax−ns#> .

@prefix rdfs: <http://www.w3.org/2000/01/rdf−schema#> .

@prefix bf: <http://id.loc.gov/ontologies/bibframe/> .

@prefix hpb: <http://141.5.107.58:5000/entities/> .

@prefix iso3166: <http://141.5.107.58:5000/iso3166/> .

@prefix lang: <http://id.loc.gov/vocabulary/iso639−2/> .

@prefix wgs84 pos: <http://www.w3.org/2003/01/geo/wgs84 pos#> .

@prefix owl: <http://www.w3.org/2002/07/owl#> .

hpb:DE−604.VK.BV013858317−work rdf:type bf:Work .

hpb:DE−604.VK.BV013858317−instance rdf:type bf:Instance .

hpb:DE−604.VK.BV013858317−work bf:hasInstance hpb:DE−604.VK.BV013858317−instance .

hpb:DE−604.VK.BV013858317−instance bf:instanceOf hpb:DE−604.VK.BV013858317−work .

hpb:DE−604.VK.BV013858317−instance rdf:type bf:Print .

hpb:DE−604.VK.BV013858317−instance bf:identifiedBy [ rdf:type bf:Local; rdfs:label "

DE−604.VK.BV013858317" ] .

hpb:DE−604.VK.BV013858317−instance bf:language [ rdf:type bf:Language; rdf:value lang:

lat ] .

hpb:DE−604.VK.BV013858317−instance bf:originDate "1668" .

hpb:DE−604.VK.BV013858317−work bf:originPlace [ rdf:type bf:Place; rdf:value iso3166:

XA−DE ] .

hpb:DE−604.VK.BV013858317−work bf:title [ rdf:type bf:Title; bf:mainTitle "Gottes

geschick, der Christen Gl�uck" ] .

hpb:DE−604.VK.BV013858317−work bf:title [ rdf:type bf:Title; bf:subTitle "In einer

Christlichen Leichpredigt �uber die Wort S. Pauli Rom. 8. v. 28. Als Die ... Frau

Susanna Margaretha, geborne Viescherin, Des ... Herrn, Johann Melchioris Fuchsen,

Wohlverordneten Stattschreibers in ... Speyer, gewesene Eheliebste ... zur Erden

bestattet worden ; Auıgef�uhret und erkl�art in S. Augustini Kirchen" ] .

hpb:DE−604.VK.BV013858317−instance bf:responsibilityStatement "Durch Johann−PetrumWeidtmann, selbiger Kirchen Evangelischen Pastorem" .

hpb:DE−604.VK.BV013858317−work bf:title [ rdf:type bf:Title; rdfs:label "Gottes

geschick, der Christen Gl�uck : In einer Christlichen Leichpredigt �uber die Wort S.

Pauli Rom. 8. v. 28. Als Die ... Frau Susanna Margaretha, geborne Viescherin, Des

... Herrn, Johann Melchioris Fuchsen, Wohlverordneten Stattschreibers in ... Speyer,

gewesene Eheliebste ... zur Erden bestattet worden ; Auıgef�uhret und erkl�art in S.

Augustini Kirchen / Durch Johann−Petrum Weidtmann, selbiger Kirchen Evangelischen

Pastorem" ] .

hpb:DE−604.VK.BV013858317−work bf:contribution [ rdf:type bf:Contribution; bf:role [

rdf:type bf:Role; rdfs:label "Verfasser, ∗aut" ]; bf:agent [ rdf:type bf:Agent; rdfs

:label "Weidtmann, Johann Peter" ] ] .

hpb:DE−604.VK.BV013858317−work bf:contribution [ rdf:type bf:Contribution; bf:role [

rdf:type bf:Role; rdfs:label "Gefeierte Person, ∗hnr" ]; bf:agent [ rdf:type bf:

Agent; rdfs:label "Fuchs, Susanna Margaretha" ] ] .

40

Page 41: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

hpb:DE−604.VK.BV013858317−instance bf:provisionActivity [ bf:place [ rdf:type bf:Place

; rdfs:label "Regenspurg" ] ] .

hpb:DE−604.VK.BV013858317−instance bf:provisionActivity [ bf:place [ rdf:type bf:Place

; rdf:value <http://thesaurus.cerl.org/record/cnl00016188>; wgs84 pos:lat "49.02";

wgs84 pos:long "12.1"; owl:sameAs <http://sws.geonames.org/2849483/> ] ] .

hpb:DE−604.VK.BV013858317−instance bf:provisionActivity [ bf:agent [ rdf:type bf:Agent

; rdfs:label "Fischer" ] ] .

hpb:DE−604.VK.BV013858317−instance bf:extent [ rdf:type bf:Extent; rdfs:label: "[32]

Bl." ] .

hpb:DE−604.VK.BV013858317−instance bf:hasItem [ rdf:type bf:Item; bf:heldBy [ rdf:type

bf:Agent; rdf:type bf:Organization; rdfs:label "Staatliche Bibliothek Regensburg";

rdf:value <http://www.wikidata.org/entity/Q2324325>; wgs84 pos:lat "49.01833333";

wgs84 pos:long "12.09055556" ] ] .

hpb:DE−604.VK.BV013858317−instance bf:hasItem [ rdf:type bf:Item; bf:heldBy [ rdf:type

bf:Agent; rdf:type bf:Organization; rdfs:label "Bayerische Staatsbibliothek M�unchen

"; rdf:value <http://www.wikidata.org/entity/Q256507>; wgs84 pos:lat "48.14751";

wgs84 pos:long "11.58015" ] ] .

4.5. Hosting the graph

For development purposes, the resulting RDF graph is hosted on a virtual machine pro-vided by the Gesellschaft fur wissenschaftliche Datenverarbeitung (GWDG) in Gottingen,using a standard installation of Apache Jena Fuseki (The Apache Software Foundation,2018). Fuseki provides both a triple store and a SPARQL endpoint ‘out of the box’.

4.6. Results

The conversion routine generates syntactically valid Turtle files which can be uploadedinto a triple store31. For the first prototype, I converted a set of 15.000 records randomlychosen from the full set of HPB records and uploaded them by hand, using the Fusekiweb interface. Further iterations will most likely profit from setting up an automatedworkflow for converting and uploading files.

The main takeaway from undertaking this conversion to RDF is the importance ofmetadata quality, which cannot be overstated. The challenge here was not in imple-menting the conversion itself, but rather in dealing with the border cases of insufficientlystandardized data, incorrect data and missing data. There is already a quality controlprocess in place when data is aggregated at the Metadata Group in Gottingen, but asthe inspection of the data shows, the conversion to RDF may pose new challenges thatmay require considering additional measures. Specifically, where string-based data for-mats can more easily deal with deviations from standardization, allowing for correction

31There is a marginal number of encoding errors that have not yet been addressed in the conversionroutine, making some Turtle files fail on upload. Addressing these errors will be part of the nextiteration of improving the conversion routine.

41

Page 42: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

measures like fuzzy searches, a data format that wants to uniquely and with certaintymap entities to URIs cannot afford the same leniency.

In the same way, missing data points in a record-based format only affect that partic-ular record and its extent; however, in the context of linked data, missing data pointscan also skew the accuracy of statements about the data set as a whole32, and can in-terrupt the links between data sets where they depend on that particular data point,for example in linking bibliographic records to a geographical data set which becomesimpossible when a place of publication cannot be identified.

However, we also need to acknowledge that it is impossible to achieve perfection inmetadata quality, especially when data is aggregated from external sources, where thereis little control over the original metadata creation process and further losses may occurin conversion.

32This is, of course, to a lesser degree also true if traditional record-based databases are made an objectof study, but it is to be expected that this will be more and more the case where data is published inopen, accessible formats and machine-readable.

42

Page 43: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

5. From SPARQL endpoint to use case: developing andevaluating the web application

In this chapter, I describe the design of the web application. I begin with a sketch of thetechnical infrastructure in [5.1], followed in [5.2] by general design considerations basedon both my review of the literature and my own interviews with researchers. In [5.3],I present the first prototype’s main views. I follow up with a discussion of immediateinsights from the design process itself in [5.4], before describing how I gathered feedbackfrom both researchers and the CERL executive committee in [5.5–5.6]. Here, I alsodiscuss how this feedback will influence the development of the second prototype, beforegiving a preview of its current state in [5.7].

5.1. Technical infrastructure

System architecture. A high-level overview of the system architecture is given in Fig.8: The converted data is accessible to web applications via the SPARQL endpoint pro-vided by Fuseki. Our own web application is on the same level as other potential futureweb applications, also accessing the data in the same way. Human users will be able tochoose to interact with our data through any of the available web applications. However,our web application will serve a central role as an entry point both for non-specializedusers (who may want to use the ‘official’ application), and for technical users seeking tounderstand the data in order to build their own applications.

Technology. The web application communicates with the SPARQL endpoint set uppreviously and serves the information contained within in human-readable form. As aframework for the prototype web application, I chose Flask (Ronacher, 2019). Flask isa minimalistic web framework for the Python language, allowing for rapid developmentwith relatively little overhead. While other frameworks (e.g. Django) may offer a betterlong-term perspective in terms of the functionality they provide ‘out of the box’, Flaskprovided the possibility of quickly developing prototypes to be shared with my partici-pants. Additionally, I was already acquainted with the framework, making it easier tobegin work without having to learn yet another technology.

Flask uses the Jinja2 template engine (Ronacher, 2014), i.e. the web application canbe developed in standard HTML and CSS, which is then extended with Jinja code inorder to dynamically generate content from the Python-based backend. In our case, thePython backend itself serves as an interface to the Fuseki triple server, running SPARQLqueries against it and serving the results in a human-readable way.

While I aimed to keep the external dependencies of the web application minimal, Iused three JavaScript libraries that are imported into the application from an externalserver. For a long-term production use, these should be installed locally to guaranteetheir availability independent of external infrastructure, or ideally replaced with a non-JavaScript server-side solution. The libraries used were Leaflet (for map visualizations),Chart.js (for various kinds of charts) and YASGUI (a graphical interface to the SPARQLendpoint).

43

Page 44: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 8: System architecture.

44

Page 45: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

5.2. Design considerations

Following the conclusions of my literature review in [2.4], I aimed at developing an appli-cation that easily integrates with the existing workflows of non-technical domain experts,that is, researchers who are familiar with the bibliographic data and its interpretation,but not necessarily with the technical details of linked open data implementations. Forthis purpose, I designed an application that largely followed the layout of the existingHPB interface, as shown in [2.1], maintaining a coherent visual identity to allow usersto recognize that they are interacting with a familiar data set, including the use of thesame main views (i.e., start page, search interface, list of search results and display ofa single record). At the same time, I attempted to give the application a more modernlook and feel, updating its main views in a way that is more aligned with current webdesign practices.

As I discussed in [2.3], I wanted to avoid letting the internal structure of the datadictate its visualization. Rather, I followed the input from researchers in the initialinterviews, using forms of visualization that are familiar and intuitive, namely maps andbar charts. The aim here was to support activities researchers told me would be relevantto their work, such as understanding the geographic or historical distribution of items.

The danger of such a ‘conservative’ approach is that it may be more difficult to show-case the advantages of the underlying data conversion. In order to increase usabilityand minimize learning efforts on the researchers’ side, some design decisions may notoptimally make use of the potential of linked open data. Additionally, hiding away thecomplexity of the underlying system may lead to a lack of appreciation – a paradox wellknown to information science (e.g., Star and Strauss, 1999) media theory (e.g., Kramer,1998) and library practitioners (e.g., Georgy and Schade, 2012) alike: Where our sys-tems are maximally comfortable and user-friendly, they are often not even recognizedas a library-provided service that requires maintenance and further development, a phe-nomenon visible, e.g., in the seamless access to electronic resources. Both the literatureand the interviews support such an approach, as it meets the needs of the users; however,it does raise the challenge of communicating the value added by a new system.

To demonstrate the possibilities of a LOD approach that would be more accessiblethan a full SPARQL interface, I also designed a demo for a search that brings togetherthe HPB’s bibliographic data and information from Wikidata, making possible a searchfor items held outside their country of production, which would not have been possiblewith the resources of the HPB alone.

5.3. Prototype I: Main views

The first prototype of the web application consists of four main views and a navigationbar:

• Start page: Basic information about the database, quick access to main functions.

• Search interface: A view that implements basic and/or advanced search func-tionalities.

45

Page 46: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 9: First prototype, start page.

• List view: A view that lists several items, e.g. as a result of a search

• Record view: A view that shows details on a single item

5.3.1. The start page

The start page (Fig. 9) serves as a first introduction to the database. Professional usersof a database will later often bypass it by bookmarking the search interface directly. Forthis reason, I decided to keep the start page simple in functionality, but use it to providean attractive first impression for users who may not be well-acquainted with the databaseyet. The prototype employs a start page with a large ‘hero image’ (selfhtml, 2018)depicting a short passage of text set in movable metal type. Both the picture and theheadline pick up the original HPB’s orange-blue color scheme, to increase recognizabilityfor the old user base. Below the image, the start page provides a simple search box anda short introductory text about the contents of the database.

46

Page 47: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

5.3.2. The search interface

The current prototype does not fully implement this view yet. Instead, selecting ‘ad-vanced search’ brings the user to a landing page that discusses the future search inter-faces (Fig. 10). It contains a link to a YASGUI-powered graphical user interface for theSPARQL endpoint (Fig. 11). However, without intimate knowledge of the underlyingdata structure (both the BIBFRAME framework and the actual implementation), it isunlikely that lay users will be able to use this interface without assistance. However, inorder to demonstrate the possibilities of a full SPARQL search, the landing page alsoprovides a sample query that brings together data from the HPB and Wikidata in orderto search for items that are held outside their country of production. This demonstratesthe additional benefit of a linked data implementation for technically proficient users.For testing purposes, the landing page also implements a (somewhat restricted) tradi-tional search interface, with free text searches for some fields (where the data containsliterals) and selection boxes for others (where URIs are used in the dataset, i.e. forcountries of production and languages). The reason for not prioritizing the search in-terface is the fact that SPARQL is not optimized for full text search. Consequently,implementing a proper search will require additional technical infrastructure, e.g. byadding the compatible Lucene search engine (The Apache Software Foundation, 2019)to our Fuseki server. This was not possible in time for the first prototype.

5.3.3. The list view

The list view provides an overview of a set of search results. In the current prototype itcomes in two modes: list mode (Fig. 12), a list of record titles and authors, and mapmode (Fig. 13), a world map with markers for places of publication and current holdings.Both modes contain links to the record view for the listed items. They also offer facets,one of which (languages) is currently functional in the prototype. When facets are used,the list of records is filtered for the user. However, the full list of results is still cachedon the server and can quickly be retrieved without needing to perform the search again.The list mode can also be sorted, currently by three fields (author, title, year), andmanual edits can be performed. These are currently restricted to removing items fromthe result set, e.g. in order to manually prune down the results of an imprecise search.Map mode can be switched between displaying places of publication, current holdingsor both.

A side bar also provides export functions (e.g. for lists of identifiers) and a view forstatistics over the result set. The prototype currently implements one such statisticsthat displays the number of records per decade in a bar chart (Fig. 14).

5.3.4. The record view

The record view displays information about a single record (Fig. 15). In the prototype,it does not display the full record, but only the subset of it that is currently captured inthe data model. For this reason, it currently contains a link back to the record in the

47

Page 48: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 10: First prototype, advanced search page.

Figure 11: First prototype, SPARQL interface (via YASGUI).

48

Page 49: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 12: First prototype, list view.

Figure 13: First prototype, map view.

49

Page 50: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 14: First prototype, list view with statistics.

original HPB to allow for the lookup of additional information that has not yet beencaptured in RDF.

Where information is only available as a string (e.g. in the case of titles), that stringis directly displayed. For URIs, we either have a supplementary string in the data (e.g.for the place of publication, or for holding institutions), or it can be found in one ofthe additional data sets in the triple store (e.g. for the names corresponding to countrycodes). The record view then displays the string, but also provides a link to the URIbehind one of two symbols: the Wikidata logo ( ) for Wikidata URIs, and a generalLOD symbol ( ) for other URIs. For some selected URIs, it is also possible to searchfor records that also refer to that URI by clicking on a search icon ( ).

In parallel to the list view, the record view also features a side bar with differentoptions. The first option, Geography, currently displays a map with the place of publi-cation and current places of holdings. As can be seen in Fig. 15, where the informationfrom Wikidata is fine-grained enough, this allows users to see where an item is currentlyheld at the level of the institution’s address, rather than just a generalized marker on thecity: In the record shown, two copies of the work reside in different libraries in Augsburg,both of which are marked on the map which is zoomed in on Augsburg33.

33Note how Fig 15 also highlights the shortcomings of the current approach to geographical information,as no Wikidata item has been identified for the holdings recorded for the Bayerische Staatsbibliothek.

50

Page 51: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 15: First prototype, record view.

The second option, Export is currently a placeholder but is intended to later allowusers to download a record in different formats. At the moment, it only contains a link tothe record’s RDF representation. Also a placeholder is the third option, Related, whichis intended to later contain information derived from automatic clustering methods oruser-provided information about relationships between records (see also the copy vs.edition discussion in [3.2.4]).

5.4. Intermediate Discussion

5.4.1. Implemented features

As discussed in [3.3.2], where I describe the priorities in implementing desired features,not all of the envisioned functionalities could be implemented in the first iteration ofthe web application. We can easily use the scenarios developed in [3.3.1] to check whichfeatures have been implemented already, and which ones need to be taken into accountfor further iterations:

Case Study 1 (Mark)

• The user searches for items based on their place of printing and date. 7

• The user selects their target languages from a provided list. 3

51

Page 52: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

• The user filters the results by the selected languages. 3

• The user manually filters the search results further. 3

• The user exports the results in a familiar file format. 7

Case Study 2 (Erlinde)

• The user searches for items based on their physical description. 7

• The user views the items’ place of printing. 3

• The user visualizes the location of current holdings on a map. 3

• The user views the links to digitizations. 7

Case Study 3 (David)

• The user searches for items based on their printers. 7

– The system automatically resolves the printers’ names to their variants. 7

• The user visualizes the results on a timeline. 734

• The user exports the visualization in a familiar file format. 7

• The user cites a stable link to the results in his publication. 7

5.4.2. Going from RDF statements to records

The prototype currently has long loading times for searches with a large number ofresults. This effect highlights the way in which there is no ideal database – while RDF /SPARQL allows for complex searches over various elements of a record, the fact that therecord itself is now only present as an aggregation of statements makes it a more costlyoperation to retrieve it than in a record-based format like the previous PICA+. Theweb application, given a record identifier, currently constructs the record by launching aseries of SPARQL queries anchored on that identifier, feeding the results into an internalrepresentation of the record which is then used to feed the Jinja2 template and constructa view for the user. This process, as loading times show, is in need of optimization.Potential solutions would be more efficient SPARQL queries for retrieving the neededoperation, or the retrieval of the bulk of information from outside the triple store (e.g.from the original PICA+ database), using it mainly for search purposes and to accessadditional data that has been created in the conversion process, such as geographiccoordinates. However, for testing purposes loading times are sufficiently short.

34While a first statistical display of records per decade in a barchart is implemented, this is not sophis-ticated enough to warrant a 3here.

52

Page 53: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

5.4.3. Using the SPARQL server vs. cached representations

One additional tool to speed up the web application would be to rely on caching, that is,storing on the server the results of common searches and often used records. A minimalform of caching is currently in place, as the web application stores the user’s searchresults and retrieves them from a session saved on the server, rather than running theoriginal SPARQL query every time the user returns to the search results. An extensionof this idea could improve speed in certain scenarios, where there is enough overlapbetween users’ requests, or where elements are updated rarely but used often, such as inthe display of countries and languages in the search form. Any benefits would have tobe weighed against the increased cost in server space, depending on the size and numberof cached representations.

5.4.4. JavaScript and accessibility

Certain functions of the prototype (display of maps and bar charts, SPARQL interface)currently rely on external JavaScript libraries to avoid having to implement their some-what complex functionalities from scratch. However, in the long term, I would aim toreduce the number of such dependencies for two independent reasons: First, externaldependencies make the project’s code inherently more difficult to maintain. Especiallywhen libraries grow rather complex (as is the case for, e.g., Leaflet), they carry an over-head of functionality that is not required in the project itself. It also makes it moredifficult to extend them with additional functionalities that are required by the projectbut not a priority with the library’s maintainers. This interacts with the second rea-son, namely that the libraries used do not provide all the functionality that had beenspecified by researchers in the initial interview. Concretly, Leaflet does not allow forexporting its maps as an image file35, which is something participants explicitly namedas a desideratum in [3.2.3]. As an additional consideration, JavaScript-heavy architec-tures can sometimes cause problems with downward compatibility and accessibility, e.g.through screen readers (see e.g. Tilkov, 2019).

5.5. Feedback I: Researchers

5.5.1. Procedure

After building the first prototype, the initially interviewed researchers were contactedagain. I provided them with a link to the development server and asked them to givefeedback on the prototype within 26 days36, in order to give them plenty of time to

35Plugins that add this functionality are listed on the website (Agafonkin, 2017), but this comes at thecost of increasing, rather than decreasing dependencies. Additionally, since many of these plugins donot have a community around them as the main library does, it would be necessary to audit themmuch more carefully for potential security risks in a production environment.

36This deadline was later extended for another five days to accommodate the participants.

53

Page 54: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

experiment with the application. Feedback was requested in writing this time, and wasslightly pre-structured by providing the following set of questions:37

Please feel free to explore, and let me know if there are any technical diffi-culties. As for your feedback, I would be particularly interested in:

(1) your overall visual impression of the interface (2) your opinion on thepresentation of search results (3) your opinion on the presentation of singlerecords (4) any thoughts on usability - did you find your way around? Couldthings be made easier to operate?

5.5.2. Results

All participants replied in writing. In the following, I present their replies sorted intothree categories: mentioned positively, mentioned negatively and suggestions. This, inturn, refers to features of the prototype that participants said they did or did not likerespectively, and to new features they suggested for future versions. All items below areparaphrases of the original responses, with responses in German (from two participants)also translated into English. If a response was given by multiple participants, the numberis indicated in brackets. Where features pointed out as missing were left out of theprototype intentionally (e.g., were given a lower priority and postponed to a later version,see [3.3.2]), they are marked with † in the list below.

Overall visual impression

Mentioned positively: Colours and layout — Easy to use on a large screen — Imageon start page very attractive — Improved over old version — Harmonized with otherCERL resources — Well organized, orderly (2x)

Mentioned negatively: Too dark and colourless — Search bar too small — Toomuch information before search interface — † Some options (years, keyword, genre)missing from search

Suggestions: Add image to advanced search — Make interface multilingual —Change ‘Advanced’ to ‘Advanced Search’

Presentation of search results

Mentioned positively: Larger font — Filters — Manual editing — Map visualiza-tion (3x)

Mentioned negatively: Lack of result counter (2x)38 — † No substantial improve-ment in terms of data structure39 — No dates of publication shown (2x) — Birth anddeath dates of authors superfluous here — † Truncation in search not working — Very

37The full email to participants, also outlining the current restrictions of the prototype, is available inAppendix A, together with their anonymized responses.

38A result counter was available in the statistics pane, but clearly not accessible enough.39This refers especially to the problems in distinguishing editions and copies, cf. [3.2.4].

54

Page 55: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

few results40 — Not all geographical data is available — † Export function does notprovide useful files

Suggestions: † Export of statistics and maps (2x) — Add explanation of map markercolours — Connect markers pertaining to the same record visually — Display place ofimprint — † Add sorting by printer and place (2x) — Allow return to search query

Presentation of single records

Mentioned positively: Map visualization (3x) — Search for place of printing viaicon — Well-organized presentation, clear (2x) — Link to classic HPB (2x) — ‘Related’pane

Mentioned negatively: Links to Wikidata not relevant for users — Country nameson map are displayed in original language and script — Inconsistent display of brack-eting for publishers and place of publication — Records not very extended — Missingdistinction between publisher, printer and bookseller — † No information about format,edition, illustrations, collation, provenance — Function of RDF view unclear

Suggestions: Counter for holdings — Move CID numbers lower in the record (lessrelevant for users) — Add link to original catalogue of holding institution — Add iden-tifiers from other catalogues (2x)41 — Add links to digitized versions — Explain iconson hover — Add search function for more fields (2x)

Thoughts on usability

Mentioned positively: More integration with other CERL resourcesMentioned negatively: Search for language takes much longer than other searches

— ‘Return to search’ redirects to start page rather, even if search originated from ad-vanced search (2x) — † No thesaurus behind searches (2x)

Suggestions: Add a ‘suggestions’ button for corrections and additions — Add perma-links to entries — Add link to CERL thesaurus for publishers — Add link to authorityfiles and/or biographical database for authors — Implement help

5.5.3. Consequences for future prototypes

Quick fixes There are some items from the feedback that can be fixed very quicklyin the second prototype. These are mostly visual changes, where the underlying datastructure is already in place, and only the frontend (e.g. the layout of the HTMLtemplate, or the CSS style) have to be adapted.

40Most likely, this stems from a combination of the previous problem with truncation and the smallnumber of records in the prototype.

41Catalogs mentioned here included Gesamtkatalog der Wiegendrucke (https://www.gesamtkatalogderwiegendrucke.de/), Incunabula Short Title Catalogue (ISTC, https://data.cerl.org/istc/_search), Material Evidence in Incunabula (MEI, https://data.cerl.org/mei/_search)and the Verzeichnis der im deutschen Sprachbereich erschienenen Drucke des 16. Jahrhunderts(VD16, https://www.gateway-bayern.de/index_vd16.html.

55

Page 56: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

• Start page

– The size of the search bar on the start page will be increased, and its promi-nence raised by marking it in HPB orange.

– In the menu, ”Search” will be renamed ”Home”, and ”Advanced” will berenamed ”Search”, now better guiding the user towards the main search in-terface.

• Search view

– A smaller image will be added to the advanced search page. The new searchinterface will also employ additional colors on top of the HPB orange andblue.

– The search interface will be presented without preceding text, with a link toadditional information on the bottom instead.

• Results view

– The result counter will be moved from the ”Statistics” pane to the headingof the search results.

• Record view

– Tooltips will be added to icons on the record view, explaining the respectivelinks to users.

– The CID number will be moved lower in the record.

– A counter for holding information will be added.

Functionality to be added In some other cases, functionality will have to be addedthat requires a more substantial reworking of the application’s logic.

• Search view

– The traditional advanced search interface will be replaced with a modularversion, where users can choose their own fields for search. This makes it easierto add new data fields for search once they are included in the conversion.

– Due to the aforementioned change, search queries will now be saved in aninternal format, allowing a link back to them from the results page. Users willbe able to see their query in the full search interface even if it originated fromthe start page, making it easy to further refine a search launched from there.This also solves the problem participants encountered in being redirected tothe start page after searching from the advanced interface.

– The modular system will also include a translation from the internal formatto SPARQL queries that is constructed dynamically, making it possible forusers to export and inspect the SPARQL query generated by the system42.

42This functionality is mostly aimed at potential developers of applications based on the HPB data whowant to understand the structure of the triple store by inspecting the queries generated by its ‘official’interface.

56

Page 57: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

• Results view

– The data structure for short records will be changed to include dates of pub-lication, so that this information can also be displayed in the results list. Lifedates of authors may be removed.

– Potentially, the data structure will also include printer and place to allowsorting by these. One crucial factor here is the issue of generating recordswith sufficient speed, which currently is slowed down by adding additionalfields. See [5.4.2].

– Export functions will be improved to include more useful information, as wellas including Excel and Word as export formats.

– Export of statistics and maps will be explored (but see the discussion aboutJavaScript in [5.4.4]).

• Record view

– The conversion script will be extended to capture more fields from the originalPICA+ data to extend the record display in the second prototype as well.

– The conversion script will be extended to clean up some data fields whereencoding issues or brackets may lead to inconsistent display in the web appli-cation.

Rethinking the architecture Some final issues are on a more conceptual level and mayrequire the application to be developed well beyond its initial specifications.

• Choosing identifiers and links While the participants welcomed the additionallinks to the CERL Thesaurus and also asked for additional links, e.g. to authorityfiles for authors, from their comments it looks like they did not see much value inthe current links to Wikidata and the other data sets used in the web application.This reinforces the considerations from [2.3]: Users who are domain experts butnot technical experts do not necessarily interact with linked data systems in theway that Berners-Lee initially envisioned – i.e., following links from resource toresource, traversing the graph –, but rather in a way that embeds with their existingworkflows. While the Thesaurus is seen as a useful resource, Wikidata is not.Obviously, this is only partially true – the value of a resource of Wikidata can bedemonstrated, for example, through search possibilities like the experimental onein the first prototype (items held outside their location of origin). However, thefeedback shows that this search option was also not seen as part of the normal userexperience, but rather as part of a separate, too technical domain. In the secondprototype, such searches will be embedded in the normal search interface, hidingtheir underlying complexity to the user. This demonstrates a general paradox ofany kind of technical system: As soon as it becomes integrated into the interface ina comfortable, frictionless way it also turns invisible, making it hard to demonstrateits value without explicitly taking apart and showing the underlying machinery tousers.

57

Page 58: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

The lack of acceptance for links to Wikidata also points towards a second prob-lem: The value of linked data does not only lie in its usability at the moment ofpublication. Rather, in the spirit of large-scale serendipity underlying the LODcloud, it is useful to weave the net of links as tightly as possible to encourage thedevelopment of future applications, especially outside our own area of expertise.Since it is impossible to predict which datasets will be of interest to future usersof our bibliographic data, it makes sense to link to large-scale data hubs like Wiki-data. However, again, their value does not necessarily translate into functionalityfor end users, but rather in allowing technical experts to develop new applications.For this reason, it may make sense to hide those links from end users in futureiterations. At the same time, it may be helpful to identify more data sets whosevalue is more obviously visible to current users and add links to those.

For linking up with other datasets, it will also be important to provide stable,permanent URIs for resources in our system in the long term. However, this isnot yet a consideration for the next prototype, as the system is still subject tochange. Assigning permalinks now would create a responsibility for maintainingthem, even beyond potential radical changes that might still happen before thesystem is published.

• Towards a virtual research environment The initial interviews in [3.2.1]named a number of problems for the original HPB, some of which cannot eas-ily be solved by switching to a new database and interface. Rather, they pointtowards crucial underlying problems with the quality of existing data, as I alsodiscussed in [4.2.1]. One such issue is the copy/edition distinction, which keptcoming up in interviews. While in the future automated techniques for clusteringand further processing data may be able to solve some of these problems, othersstill rely on human judgment, and especially on the judgment of domain experts.Paradoxically, the users who come to our system to look for this kind of infor-mation are often also the most qualified to provide it themselves. The ability togive direct feedback on missing or incorrect data was suggested in the feedback.However, one might even go further and develop the current system from whatis essentially an aggregated library catalogue towards a Virtual Research Envi-ronment (VRE; Candela et al., 2013). This would mean providing facilities forresearchers to add their own annotations to the existing entities in the database.The linked data architecture is especially fruitful in this respect, as user-providedtriples could easily be stored in their own named graph, allowing for them to be,e.g., used or not used for search queries based on user preferences. In some sense,they would simply constitute a separate data set, albeit one with very close linksto the HPB core data set.

58

Page 59: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

5.6. Feedback II: CERL Executive Committee

5.6.1. Procedure

During the evaluation phase, I also presented an overview of the project, including staticslides of the prototype43, to the CERL Executive Committee. This was done to ensurethat the project not only corresponded with the needs of its users, but also with therequirements of the other main stakeholder, that is, CERL as the project owner. In thediscussion following the presentation, the committee raised a number of points that arerelevant to the further development of the prototype.

5.6.2. Results

Scalability One important issue is the possibility of scaling up the prototype to handlethe entire HPB with its roughly eight million records, rather than just the 15.000 recordsused in the prototype. Current response times on searches and generation of result listsindicate that the software as implemented would not be able to handle this amount ofdata in an efficient enough manner to provide a usable interface. However, there area number of optimizations that can increase response times and make the prototypemore efficient: Caching of search results and records, implementation of more efficientSPARQL queries and retrieval of information not exclusively from the triple store butalso by interfacing with the original database are all potential strategies for dealing withthe issue of scale. The triple store itself should be able to deal with a larger number oftriples; however, it may require more hardware resources than are currently available onthe test server.

Reliability and transparency of statistical evaluations It is important to note that anystatistics displayed in the HPB cannot be read straightforwardly as statistics about theentirety of early modern book production, but rather about the contents of the HPB.That is, they are always mediated by both the completeness and reliability of the dataaggregated in the HPB, which in turn is a result of both the data ingestion processesand the completeness and reliability of the data originally provided by CERL’s memberlibraries. For this reason, it was pointed out that it is necessary to make transparent theway in which such statistics are produced. A similar point was made about the results ofexperimental search functions like the ‘items held outside their country of origin’ samplequery provided on the search page (cf. [5.3.2]). Here, the data derived from the twoSPARQL queries to our own SPARQL endpoint and Wikidata should be accessible toresearchers, so they can understand the way in which the final list of results is derived.

Clustering methods A crucial factor in the reliability of the HPB’s data is also theexistence of duplicates in the database, that was already pointed out in the context ofthe copy vs. edition problem in [3.2.4]. The committee reiterated the point already made

43Due to the technical infrastructure at the place of presentation, a live demonstration was unfortunatelynot possible.

59

Page 60: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

by researchers that future versions of the HPB will need to implement some clusteringmethods in order to reduce the number of duplicates and allow for a more reliable countand display of records.

5.6.3. Consequences for future prototypes

• Adding full text search SPARQL can only play to its particular strengths whenthe underlying data is playing by the rules of the semantic web. While the standardsupports both URIs and literals, it is clearly geared towards the use of URIs, withthe possibilities of running searches on strings either limited (e.g., using filterslike Contains or Strstarts) or can become too slow to perform on scale (e.g.,when using regular expressions). This has led to the development of multipleextensions of the standard that add proper full-text search (Buil-Aranda et al.,2013). In order to scale up our own infrastructure and to support more advancedtext-based searches, it may be advisable to adopt such an extension. Our triplestore, Apache Jena, offers a plugin that adds support for Lucene (The ApacheSoftware Foundation, 2019).

• Clustering methods A major addition to future versions of the system will haveto be further processing and enriching of the original data. Beyond adding morelinks to other data sets, the most pertinent challenge here is solving the problemof clustering records that do in fact describe the same edition, as editions form thenatural working unit of the target audience, as discussed in [3.2.1]. As stated above,not all problems may be solvable by automated means; however, as bibliographicclustering improves (for current developments in the German library landscape,see e.g., Pfeifer and Polak-Bennemann, 2016; Vorndran, 2018, 2019; Reh et al.,2019; Witzig and Hipler, 2019), more opportunities may arise in this area.

• Documentation of technology For all automated conversion and processingsteps, it is crucial to make their workings transparent. The same holds for searchqueries that are generated through the web interface, and result sets that are gen-erated by querying not only our own data but that from other SPARQL endpointsas well. This ensures that researchers can interpret the data correctly and contex-tualize claims made based on the data contained in HPB. Strategies for achievingthis transparency may include proper documentation of the system (both as partof the web application and in the literature), as well as including information onalgorithms used in the interface. It may also be possible to use more pedagog-ical formats, such as Jupyter notebooks (Project Jupyter, 2019) to explain thetechnical infrastructure behind the interface to researchers.

Additionally, I would recommend making all parts of the technical infrastructurewhere this is possible available as open source software, complementing the strat-egy of opening up the data resources of CERL and conforming with the increasingawareness of the importance of publishing and preserving research software (Kater-bow and Feulner, 2018; Deutsche Forschungsgemeinschaft, 2016).

60

Page 61: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

5.7. Prototype II: Preview

While the second iteration of the prototype has not been finished at the time of writingthis thesis, I have already implemented some of the features noted in the feedback. Thefollowing provides a preview of the planned implementation for the next feedback loop.

Fig. 16 shows the revised start page, with the search box made more prominent.Fig. 17–20 show the new search interface, which allows for the user-driven generationof a ‘traditional’ advanced search interface, but where possible employs the selection ofentities from a list of entities present in the data rather than free text searches (Fig. 19),playing to the strengths of SPARQL44. Also shown is the integration of more complexsearch strategies that perform queries on other LOD databases in the background whilehiding the complexity from the user (Fig. 20). In this concrete case, searching forpublications published near a river first queries Wikidata for cities along that river andthen constructs a SPARQL search for publications that have these cities as their placeof publication. This search option stands in for any number of potential complex searchpatterns enabled by linked data.

So far, no major changes have been implemented in the list or map views of searchresults (Fig. 21–22), but a more prominent result counter and a link to the SPARQLquery have been added.

The record view now includes more searchable fields (Fig. 23). In order to allow usersto better understand the icons and the underlying links, tooltips have been implemented(Fig. 24).

44A similar design for search is used in the ISTC interface.

61

Page 62: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 16: Second prototype, start page.

Figure 17: Second prototype, search interface (blank).

62

Page 63: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 18: Second prototype, search interface (adding fields).

Figure 19: Second prototype, search interface (selecting entities).

63

Page 64: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 20: Second prototype, search interface (Wikidata-based river search).

Figure 21: Second prototype, results view (as list).

64

Page 65: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 22: Second prototype, results view (as map).

Figure 23: Second prototype, record view.

65

Page 66: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Figure 24: Second prototype, record view (with tooltip).

66

Page 67: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

6. Conclusions

6.1. Insights

Let me begin my conclusions with a personal remark. Working on this thesis has onceagain convinced me that academic librarians today will not be able to do their job prop-erly without engaging with the technological infrastructure that underlies our services.At the same, though, I also think that IT professionals that work on that technologi-cal infrastructure will not be able to do their job either, if not for engaging with boththeir target audience and the librarians that have served the needs of that audience longbefore the technological infrastructure available today came into being.

Linked data is a prime example for this lesson: It is a technological development thathas often been thought to be the future of the internet, but as of date it has not yetsucceeded in fully realizing its own potential. As I sketched in [2.2–2.3], too often itsapplications have been geared towards a technically proficient audience. While standardslike SPARQL have allowed library developers to leave behind library-specific protocolslike Z39.50 and SRU and connect to a larger world of developers, it has not helped inmaking our data more accessible to end users, be they domain experts or lay users.One major challenge is the fact that our data is both copious and complex: While itis easy for users to interact with a single record – either on paper or in its paper-liketranslation onto the computer –, human beings are not adept at handling large amountsof bibliographical data at once. Libraries have to find ways of making their wealth ofdata accessible in a form that is guided not by the structure of the data, but by theneeds and experiences of their users. As the interviews in [3] show, these may differ fromwhat we may expect: Users prefer consistency and integration with existing workflowsover radical innovation, and it can be a challenge to conform to their wishes withoutunderutilizing the potential of new technologies.

The feedback in [5.5] demonstrates that the HPB is moving in the right direction.Participants welcomed the visual update and its integration with other CERL resources.Many of the criticisms could easily be taken into account for the second prototype, oralready formed part of the planned functionality. However, the feedback also showed thatwe still need to further improve our data quality and find ways of adding or correctingmissing and wrong data points.

Both the concerns about data quality voiced in the interviews, and the subsequent dataanalysis and conversion described in [4] further demonstrate that improving the qualityof our metadata is a constant requirement. This means both validating incoming dataand trying to design conversions as lossless as possible, but it can also mean a more activerole: Having our data linked to other data sources allows us to pull in additional datato enrich or correct our database, and automated technologies like clustering may helpin deduplication. From the interviews in [5.5] we can recognize the difficulty of findinglinkable data sets that make sense to current users and enrich their experience, while alsocreating links that allow for large-scale serendipity and provide for the needs of futureusers, especially those for whom our data set is not the central starting point, but ratherone of those additional data sets that enriches the core data set from which their work

67

Page 68: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

originates. Data hubs like Wikidata play a central role here, providing a hook for datasets and identifiers. While this thesis, at the time of its inception, started out as a moreinterface-oriented enterprise, both the theoretical and the empirical challenges quicklyshowed that data quality will always be the limiting factor in any kind of interface.

One further challenge for linked data is the lack of maturity of the technology stackits current implementations are running on. As both the data conversion and the imple-mentation of the web application have shown, there is much room for improvement andoptimization to make applications stable and performant at scale, and the requirementsin terms of technical knowledge to perform these optimizations rises steeply. One partic-ular issue for applications like mine is the lack of full text search in SPARQL (withoutextensions); however, even this may soon change as the W3C Community Group for thenext version of the SPARQL standard has just started their work (Seaborne and Bolle-man, 2019). Other developments, such as a binary serializations for RDF (Fernandezet al., 2013) and more server-friendly interfaces for SPARQL endpoints such as triplepattern fragments (Verborgh et al., 2016) may make more reliable linked data servicespossible in the future. However, the major factor here is probably institutional sup-port for sustainable, long-term infrastructures to avoid the problems that have plaguedlinked data so far – the fast disappearance of data sets, end points and software solutionsfrom the internet after funding for a particular research project runs out. Institutionalactors like libraries or library consortiums like CERL can play an important role incounteracting this trend.

6.2. Next steps

There is a number of requirements in further developing the web application thatemerged from this thesis. Not all of these need to be developed at once – rather, Iwould recommend proceeding in small iterative steps, allowing for the further collectionof feedback from all stakeholders during the development process. The following list,consequently, does not imply an order:

Infrastructure. Optimizing the infrastructure to operate at a larger scale may be nec-essary to make sure that developed technologies do not have to be removed once thesystem goes into productive mode as they fail to scale up with the requirements of a fullHPB conversion with its eight million records (and growing). This may require changesboth in the software and the underlying hardware on which the system operates.

Search interface. One issue only partially targeted in the first prototype presented hereis that of searching the data. It should be obvious from the discussion in this thesis thatexpecting researchers to learn SPARQL – or, I would add, any of the SPARQL-derivedgraphical or otherwise ‘simplified ’notations developed in the literature – cannot be thesolution. Rather, search tools should be based on tools that human users are alreadyfamiliar and comfortable with. Beyond search bars, I think there is much potential inusing, e.g., geographical representations not just for the presentation of search resultsbut also as a way of setting geographical search parameters. Other ways of selecting

68

Page 69: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

entities for search should similarly be based on the kind of object they represent and theway these kinds of objects are usually represented by human beings. At the same time,not all of our searches will be URI-based, and it will be necessary to also upgrade theinfrastructure to allow for more powerful full-text searches.

Data enrichment. Identifying target data sets and data hubs for linking our data with,as well as the creation of such data sets (as is happening, e.g., with the CERL thesaurus)should be a priority, as it can instantly increase the value of our data set. Additionally, anexploration of various ways of enriching our data from those additional data sets shouldbe undertaken, together with an exploration of methods for increasing the coherenceand consistence of our own data, e.g., by clustering editions and removing duplicates.On top of automated methods, we should also explore the potential of letting our users– who are domain experts – enrich and improve our data manually, providing them withcomfortable tools for doing so in the form of a virtual research environment.

Visualizations. Building on the success of the map and bar chart visualization in thefirst prototype, first we should make more of the data accessible in this way. Wherethere are currently only fixed visualizations over geographical coordinates and years ofpublication, ideally these would be further parametrized and made customizable forusers, without sacrificing their simplicity of use. Gradually, additional forms of visu-alization could be introduced, responding to needs of our user base as they articulatethem throughout the design process and in their feedback.

Transparency. All steps undertaken should be thoroughly documented. Software shouldbe made available as open source, and where possible attempts should be made to com-municate the underlying principles of algorithms and technologies to our users to allowthem to make informed decisions in their research and use our data responsibly. Equallyimportant, all steps undertaken should be fed back into the community to steer theproject further towards the needs of our users.

6.3. Adjacent issues

Within this thesis, I have only insufficiently been able to address all the topics thata project like this touches upon. In particular, I believe that further research will benecessary to theoretically underpin many of the issues listed in the previous section:both the clustering of bibliographic data and the design of cognitively sound searchinterfaces for human beings (as well as the speed at which we can change these toolswithout overcharging users) remain topics where many of the foundations are still underactive discussion in the literature. I also think that there is potential in the study ofthe formal properties of data sets and their links with one another, to better understandthe networks that they currently form and to predict in which way data sets should belinked with one another for optimal results.

69

Page 70: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Appendices

70

Page 71: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

A. Email to participants and responses

For the evaluation of the first prototype, researchers were asked for their feedback inwriting. The email to the participants and their anonymized responses are attached tothe print version of this thesis in electronic form, on the CD you can find in the back.

B. Griffon: Parsing PICA+ and converting it to RDF/Turtle

For inspecting the HPB data and converting it from PICA+ to RDF/Turtle, I createda parser in Python. The source code for this PICA+ parser and the conversion routineto RDF/Turtle is attached to the print version of this thesis in electronic form, on theCD you can find in the back.

C. ISO 3166: Mapping country codes to labels and Wikidataitems

The HPB uses country codes as documented in VZG (2017a). In order to display themin human-readable form, I semi-automatically created an RDF representation mappingthem to Wikidata items and their respective labels in German and English. Where nomatching to Wikidata was possible, I added the missing data manually. The resultingTurtle file is attached to the print version of this thesis in electronic form, on the CDyou can find in the back.

D. Web application: first prototype

The first prototype is a Flask application. The source code of this application is attachedto the print version of this thesis in electronic form, on the CD you can find in the back.

71

Page 72: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

References

Ackerman, L. (2018). Our words matter: acceptability, grammaticality, and ethics ofresearch on singular ‘they’-type pronouns. Pre-Print, PsyArXiv.

Acquisitions & Bibliographic Access Directorate (2016). BIBFRAME Pilot (Phase One–Sept. 8, 2015 – March 31, 2016): Report and Assessment. Technical report, Libraryof Congress.

Agafonkin, V. (2017). Leaflet plugins. https://leafletjs.com/plugins.html. [Ac-cessed 2019-04-15].

Al-Eryani, S. and S. Ruhle (2018). Best practice guide for application profiles andontologies created in the context of the project Developing interoperable metadatastandards for contextualizing heterogeneous objects from collections, exemplified byobjects of the provenance von Asch (ASCH) funded by Deutsche Forschungsgemein-schaft (DFG). Technical report, Niedersachsische Staats- und UniversitatsbibliothekGottingen. http://asch.wiki.gwdg.de/images/3/39/BPG_v6.pdf [Accessed 2019-04-08].

Antoniazzi, F. and F. Viola (2018). RDF graph visualization tools: a survey. In Pro-ceedings of the 23rd Conference of Open Innovations Association FRUCT, pp. 25–36.FRUCT.

Atanasiu, V., C. Priol, A. Tournieroux, and E. Ornato (2008). DALEK. Georeferencesfor the Bernstein Paper Atlas. Technical report, The Bernstein Consortium.

Atemezing, G. A. and R. Troncy (2013). Towards interoperable visualization applicationsover linked data. In 2nd European Data Forum (EDF), Dublin, Ireland (April 2013).

Baader, F., I. Horrocks, C. Lutz, and U. Sattler (2017). Introduction to DescriptionLogic. Cambridge University Press.

Benedetti, F., S. Bergamaschi, and L. Po (2015a). Lodex: A tool for visual queryinglinked open data. In CEUR WORKSHOP PROCEEDINGS, Volume 1486.

Benedetti, F., S. Bergamaschi, and L. Po (2015b). LODeX evaluation. http://dbgroup.unimo.it/lodex_model/lodexEva. [Accessed 2019-04-08].

Berners-Lee, T. (2009). Linked data. https://www.w3.org/DesignIssues/

LinkedData.html. [Accessed 2019-04-08].

Berners-Lee, T., J. Hendler, and O. Lassila (2001). The semantic web. Scientific Amer-ican 284 (5), 34–43.

Bikakis, N. and T. Sellis (2016). Exploration and visualization in the web of Big LinkedData: A survey of the state of the art. In Workshop Proceedings of the EDBT/ICDT2016 Joint Conference.

72

Page 73: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Booth, D. (2012). Well behaved RDF: A straw-man proposal for taming blanknodes. http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.

pdf. [Accessed 2019-04-08].

Brickley, D. and R. Guha (2014). RDF Schema 1.1. https://www.w3.org/TR/rdf-

schema/. [Accessed 2019-04-23].

Buil-Aranda, C., A. Hogan, J. Umbrich, and P.-Y. Vandenbussche (2013). SPARQLweb-querying infrastructure: Ready for action? In H. Alani, L. Kagal, A. Fokoue,P. Groth, C. Biemann, J. X. Parreira, L. Aroyo, N. Noy, C. Welty, and K. Janowicz(Eds.), The Semantic Web – ISWC 2013, pp. 277–293. Springer.

Candela, L., D. Castelli, and P. Pagano (2013). Virtual Research Environments: Anoverview and a research agenda. Data Science Journal 12, 75–81.

CERL (2005). The substance of our civilization: a development plan 2002–2007.https://www.cerl.org/_media/about/development_plan_2002-2007.pdf. [Ac-cessed 2019-04-08].

CERL (2010). Strategic plan 2010–2015. https://www.cerl.org/_media/about/cerl_strategic_plan_2010_2015.pdf. [Accessed 2019-04-08].

CERL (2018). Consortium of European Research Libraries news. https://www.cerl.

org/_media/publications/newsletter_june_2018.pdf. [Accessed 2019-04-08].

CERL (2018). HPB access. https://www.cerl.org/resources/hpb/technical/

modes_of_access_to_the_hpb_database.

CERL (2019). List of members. https://www.cerl.org/membership/list_members.[Accessed 2019-04-08].

CERL Board of Directors (2016). CERL strategies 2016-2019. https://www.cerl.org/_media/about/cerl_strategies_2016-19_final.pdf. [Accessed 2019-04-08].

Chen, C.-h., W. Hardle, and A. Unwin (Eds.) (2008). Handbook of Data Visualization.Springer.

Cherny, L. (2017). Data visualization versus UI and data science. https:

//medium.com/@lynn_72328/data-visualization-versus-ui-and-data-

science-d59182d58af4. [Accessed 2019-04-08].

Coyle, K. (2015). FRBR, before and after: a look at our bibliographic models. AmericanLibrary Association.

Dadzie, A.-S. and E. Pietriga (2017). Visualisation of linked data – reprise. SemanticWeb 8 (1), 1–21.

Dadzie, A.-S. and M. Rowe (2011). Approaches to visualising linked data: A survey.Semantic Web 2 (2), 89–124.

73

Page 74: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Degbelo, A. (2017). Linked data and visualization: two sides of the transparency coin.In Proceedings of the 3rd ACM SIGSPATIAL Workshop on Smart Cities and UrbanAnalytics. ACM.

Deutsche Forschungsgemeinschaft (2016). Ausschreibung Nachhaltigkeit vonForschungssoftware. Eine Ausschreibung im Rahmen des Forderprogrammse-Research-Technologien. https://www.dfg.de/download/pdf/foerderung/

programme/lis/161026_dfg_ausschreibung_forschungssoftware_de.pdf. [Ac-cessed 2019-04-25].

DINI-AG KIM (2013). Empfehlungen zur RDF-Reprasentation bibliografischer Daten.Technical report, Deutsche Initiative fur Netzwerkinformation e.V.

DINI-AG KIM (2018). Empfehlungen zur RDF-Reprasentation bibliografischer Daten.Technical report, Deutsche Initiative fur Netzwerkinformation e.V.

Dokulil, J. and J. Katreniakova (2008). Visual exploration of RDF data. In InternationalConference on Current Trends in Theory and Practice of Computer Science, pp. 672–683. Springer.

Dudas, M., S. Lohmann, V. Svatek, and D. Pavlov (2018). Ontology visualizationmethods and tools: a survey of the state of the art. The Knowledge EngineeringReview 33.

During, M. and L. von Keyerlingk (2015). Netzwerkanalyse in den Geschichtswis-senschaften. Historische Netzwerkanalyse als Methode fur die Erforschung von his-torischen Prozessen. In R. Schutzeichel and S. Jordan (Eds.), Prozesse. Formen,Dynamiken, Erklarungen, pp. 337–350. Springer.

Eason, K. D. (1984). Towards the experimental study of usability. Behaviour & Infor-mation Technology 3 (2), 133–143.

El-Sherbini, M. (2018). RDA implementation and the emergence of BIBFRAME. JLIS.it 9 (1).

Eversberg, B. (1994). Was sind und was sollen bibliothekarische Datenformate:Uberarbeitete und erweiterte Neuausgabe, Volume 9 of Veroffentlichungen der Uni-versitatsbibliothek Braunschweig. Universitatsbibliothek der TU Braunschweig.

Fernandez, J. D., M. A. Martınez-Prieto, C. Gutierrez, A. Polleres, and M. Arias (2013).Binary RDF representation for publication and exchange (HDT). Web Semantics:Science, Services and Agents on the World Wide Web 19, 2241.

Furste, F. M. (2011). Linked Open Library Data. Bibliographische Daten und ihreZuganglichkeit im Web der Daten. Dinges & Frick GmbH.

Futornick, M. (2018). 2019 LD4 Conference. https://wiki.duraspace.org/display/LD4P2/2019+LD4+Conference. [Accessed 2019-01-11].

74

Page 75: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Geipel, M. M., C. Bohme, J. Hauser, and A. Haffner (2013). HerausforderungWissensvernetzung. Impulsgebende Projekte fur ein zukunftiges LOD-Konzept derDeutschen Digitalen Bibliothek. In P. Danowski and A. Pohl (Eds.), (Open) LinkedData in Bibliotheken, pp. 168–185. De Gruyter.

Georgy, U. and F. Schade (2012). Marketing fur Bibliotheken–Implikationen aus demNon-Profit und Dienstleistungsmarketing. In U. Georgy and F. Schade (Eds.), Prax-ishandbuch Bibliotheks- und Informationsmarketing, pp. 7–40. De Gruyter.

Gomez-Romero, J., M. Molina-Solana, A. Oehmichen, and Y. Guo (2018). Visualiz-ing large knowledge graphs: A performance analysis. Future Generation ComputerSystems 89, 224–238.

Graziosi, A., A. Di Iorio, F. Poggi, S. Peroni, and L. Bonini (2018). Customising LODviews: a declarative approach. In Proceedings of the 33rd Annual ACM Symposiumon Applied Computing, pp. 2185–2192. ACM.

Halb, W., Y. Raimond, and M. Hausenblas (2008). Building linked data for both humansand machines. In Linked Data on the Web Workshop at the 17th International WorldWide Web Conference 2008 (WWW2008), Beijing, China.

Heath, T. and C. Bizer (2011). Linked data: Evolving the web into a global data space.Synthesis lectures on the semantic web: theory and technology 1 (1), 1–136.

Hellinga, L. (1998). Preface. In A. Matheson, B. Fabian, and L. Balsamo (Eds.),The European Printed Heritage c.1450 - c.1830 Present and Future. Three lectures,Volume 1 of CERL Papers, pp. I–II. Consortium of European Research Libraries.

Helmich, J., T. Potocek, J. Klımek, and M. Necasky (2017). Towards easier visualizationof linked data for lay users. In Proceedings of the 7th International Conference on WebIntelligence, Mining and Semantics. ACM.

Hoefler, P., M. Granitzer, E. E. Veas, and C. Seifert (2014). Linked data query wizard:A novel interface for accessing SPARQL endpoints. In LDOW.

Hogan, A., M. Arenas, A. Mallea, and A. Polleres (2014). Everything you always wantedto know about blank nodes. Web Semantics: Science, Services and Agents on theWorld Wide Web 27, 42–69.

Hwang, W. and G. Salvendy (2010). Number of people required for usability evaluation:the 10+/-2 rule. Communications of the ACM 53 (5), 130–133.

International DOI Foundation (2018). DOI Handbook. https://www.doi.org/hb.html.[Accessed 2019-04-08].

Jahnke, A. (2007). Accessing the record of European printed heritage: the CERL The-saurus as an international repository of names from the hand-press era. In D. J. Shaw

75

Page 76: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

(Ed.), Imprints and owners: recording the cultural geography of Europe. Papers pre-sented on 10 November 2006 at the CERL Seminar hosted by the National SzechenyiLibrary, Budapest, pp. 49–66. Consortium of European Research Libraries.

Katerbow, M. and G. Feulner (2018). Handreichung zum Umgang mit Forschungssoft-ware. https://doi.org/10.5281/zenodo.1172970. [Accessed 2019-05-07].

Kirk, A. (2016). Data Visualisation. A handbook for data driven design. SAGE.

Kramer, S. (1998). Das Medium als Spur und als Apparat. In Medien, Computer,Realitat. Wirklichkeitsvorstellungen und Neue Medien, pp. 83–94. Suhrkamp.

Leskinen, P., G. Miyakita, M. Koho, and E. Hyvonen (2018). Combining faceted searchwith data-analytic visualizations on top of a SPARQL endpoint. Proceedings of VOILA2018, Monterey, California.

Library of Congress (2017a). ISO 639-2: Codes for the Representation of Names ofLanguages - Part 2: Alpha-3 Code for the Names of Languages. http://id.loc.

gov/vocabulary/iso639-2.html. [Accessed 2019-04-08].

Library of Congress (2017b). MARC Code List for Relators Scheme. http://id.loc.

gov/vocabulary/relators.html. [Accessed 2019-04-08].

MARBI (1996). The MARC 21 formats: Background and principles. Technical report,MARBI American Library Association’s ALCTS/LITA/RUSA Machine-ReadableBibliographic Information Committee in conjunction with Network Development andMARC Standards Office Library of Congress.

Matheson, A. (1998). The Consortium of European Research Libraries: A future vision.In A. Matheson, B. Fabian, and L. Balsamo (Eds.), The European Printed Heritagec.1450 - c.1830 Present and Future. Three lectures, Volume 1 of CERL Papers, pp.1–14. Consortium of European Research Libraries.

McCallum, S. H. (2017). BIBFRAME development. JLIS.it 8 (3), 71–85.

McKenna, L., C. Debruyne, and D. O’Sullivan (2018). Understanding the position ofinformation professionals with regards to linked data: A survey of libraries, archivesand museums. In Proceedings of the 18th ACM/IEEE on Joint Conference on DigitalLibraries, pp. 7–16. ACM.

Miles, A. and S. Bechhofer (2009). SKOS Simple Knowledge Organization System Ref-erence. https://www.w3.org/TR/skos-reference/. [Accessed 2019-04-23].

Mitchell, E. T. (2016). Library linked data: early activity and development. ALA Tech-Source.

Muller, W. (Ed.) (1992). Fingerprints : Regeln und Beispiele ; nach der englisch-franzosisch-italienischen Ausgabe des Institut de Recherche et d’Histoire des Textes(CNRS) und der National Library of Scotland. Deutsches Bibliotheksinstitut.

76

Page 77: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Pesquita, C., V. Ivanova, S. Lohmann, and P. Lambrix (2018). A framework to conductand report on empirical user studies in semantic web contexts. In C. Faron Zucker,C. Ghidini, A. Napoli, and Y. Toussaint (Eds.), Knowledge Engineering and Knowl-edge Management, pp. 567–583. Springer International Publishing.

Pfeifer, B. and R. Polak-Bennemann (2016). Zusammenfuhren was zusammengehort–Intellektuelle und automatische Erfassung von Werken nach RDA. o-bib. Das offeneBibliotheksjournal 3 (4), 144–155.

Pohl, A. and P. Danowski (2013). Linked Open Data in der Bibliothekswelt: Grund-lagen und Uberblick. In P. Danowski and A. Pohl (Eds.), (Open) Linked Data inBibliotheken, pp. 1–44. De Gruyter.

Project Jupyter (2019). Jupyter documentation. https://jupyter.org/

documentation. [Accessed 2019-04-25].

Reh, U., S. Winkler, T. Kirchhoff, and S. Lohrum (2019). De-Duplikationsverfahrenund Einsatzszenarien im Gemeinsamen Verbundeindex (GVI). https://wiki.dnb.

de/download/attachments/146377939/2019-04-03_KIMWS19_GVI.pdf. [Accessed2019-04-25].

Richter, M. and M. D. Fluckiger (2013). Usability Engineering kompakt: BenutzbareProdukte gezielt entwickeln. Springer Vieweg.

Ronacher, A. (2014). Jinja2. The Python Template Engine. http://jinja.pocoo.org/.[Accessed 2019-04-08].

Ronacher, A. (2019). Flask. A Python Microframework. http://flask.pocoo.org/.[Accessed 2019-04-08].

Sabol, V., G. Tschinkel, E. Veas, P. Hoefler, B. Mutlu, and M. Granitzer (2014). Dis-covery and visual analysis of linked data for humans. In P. Mika, T. Tudorache,A. Bernstein, C. Welty, C. Knoblock, D. Vrandecic, P. Groth, N. Noy, K. Janowicz,and C. Goble (Eds.), The Semantic Web – ISWC 2014, Cham, pp. 309–324. SpringerInternational Publishing.

Sakor, A., I. O. Mulang’, K. Singh, S. Shekarpour, M.-E. Vidal, J. Lehmann, and S. Auer(2019). Old is gold: Linguistic driven approach for entity and relation linking of shorttext. In Proceedings of the 2019 Annual Conference of the North American Chapterof the Association for Computational Linguistics.

Sanderson, R. (2018). Shout it out: LOUD (EuropeanaTech 2018 keynote). https:

//www.youtube.com/watch?v=r4afi8mGVAY. [Accessed 2019-04-08].

Sarodnick, F. and H. Brau (2016). Methoden der Usability Evaluation. WissenschaftlicheGrundlagen und praktische Anwendung. hogrefe.

77

Page 78: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

schraefel, m. and D. Karger (2006). The pathetic fallacy of RDF. https://eprints.

soton.ac.uk/262911/1/the_pathetic_fallacy_of_rdf-33.html. [Accessed 2019-04-08].

Seaborne, A. and J. Bolleman (2019). SPARQL 1.2 Community Group. https://www.w3.org/community/sparql-12/. [Accessed 2019-04-25].

selfhtml (2018). HTML/Tutorials/Hero-Image-Webseite. https://wiki.selfhtml.

org/wiki/HTML/Tutorials/Hero-Image-Webseite. [Accessed 2019-04-08].

Shneiderman, B. (1996). The eyes have it: A task by data type taxonomy for informationvisualizations. In Proceedings of the IEEE Symposium on Visual Languages, pp. 336–343. IEEE.

Shneiderman, B., C. Plaisant, M. Cohen, S. Jacobs, N. Elmqvist, and N. Diakopoulos(2016). Designing the user interface: strategies for effective human-computer interac-tion. Pearson.

Smith-Yoshimura, K. (2016). Analysis of international linked data survey for imple-menters. D-Lib Magazine 22 (7/8).

Sporny, M., D. Longley, G. Kellogg, M. Lanthaler, and N. Lindstrom (2014). JSON-LD1.0. A JSON-based Serialization for Linked Data. https://www.w3.org/TR/json-

ld/. [Accessed 2019-04-23].

Star, S. L. and A. Strauss (1999). Layers of silence, arenas of voice: The ecology ofvisible and invisible work. Computer Supported Cooperative Work 8, 9–30.

Steele, T. D. (2018). What comes next: understanding BIBFRAME. Library Hi Tech.

Stegaeva, M. V. (2016). Cooperative cataloging: History and the current state. Scientificand Technical Information Processing 43 (1), 28–35.

Stein, C. (2014). Linked Open Data – Wie das Web zur Semantik kam. Bibliothek,Forschung und Praxis 38 (3), 1–9.

Suominen, O. I. and N. Hyvonen (2017). From MARC silos to Linked Data silos? o-bib.Das offene Bibliotheksjournal 4 (2), 1–13.

Svensson, L. G. (2013). Are current bibliographic models suitable for integration withthe web. Information Standards Quarterly, Winter 25 (4), 6–13.

The Apache Software Foundation (2018). Apache Jena Fuseki. https://jena.apache.org/documentation/fuseki2/. [Accessed 2019-04-08].

The Apache Software Foundation (2019). Apache Lucene. http://lucene.apache.

org/. [Accessed 2019-04-08].

The Internet Society (2005). A Universally Unique IDentifier (UUID) URN Namespace.https://tools.ietf.org/html/rfc4122. [Accessed 2019-04-08].

78

Page 79: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

Tilkov, S. (2019). Wider die SPA-Fixierung. Ein Pladoyer fur eine klassis-che Frontend-Architektur. https://www.innoq.com/de/articles/2019/04/wider-

die-spa-fixierung/. [Accessed on 2019-04-18; originally published in OBJEKT-spektrum 02/2019, pp. 28-33].

Unwin, A., C.-h. Chen, and W. K. Hardle (2008). Introduction. In C.-h. Chen,W. Hardle, and A. Unwin (Eds.), Handbook of Data Visualization. Springer.

Unxos GmbH (2019). GeoNames. https://www.geonames.org/. [Accessed 2019-04-08].

van Hooland, S. and R. Verborgh (2014). Linked Data for Libraries, Archives andMuseums. Facet Publishing.

Vander Sande, M., R. Verborgh, P. Hochstenbach, and H. Van de Sompel (2018). Towardsustainable publishing and querying of distributed Linked Data archives. Journal ofDocumentation 74 (1), 195–222.

Verborgh, R., M. Vander Sande, O. Hartig, J. Van Herwegen, L. De Vocht,B. De Meester, G. Haesendonck, and P. Colpaert (2016). Triple pattern fragments:a low-cost knowledge graph interface for the web. Journal of Web Semantics 37,184–206.

Versprille, I., M. Lefferts, and C. Dondi (2014). The Consortium of European ResearchLibraries (CERL): twenty years of promoting Europes cultural heritage in print andmanuscript. 027.7 Zeitschrift fur Bibliothekskultur / Journal for Library Culture 2 (1),30–40.

Virzi, R. A. (1992). Refining the test phase of usability evaluation: How many subjectsis enough? Human factors 34 (4), 457–468.

Vorndran, A. (2018). Hervorholen, was in unseren Daten steckt! Mehrwerte durchAnalysen großer Bibliotheksdatenbestande. o-bib. Das offene Bibliotheksjournal 5 (4),166–180.

Vorndran, Angela und Grund, S. (2019). Hervorholen, was in unseren Datensteckt - Mehrwerte durch Analysen großer Bibliotheksdatenbestande (Culture-graph). https://wiki.dnb.de/download/attachments/146377939/2019-04-03_

KIMWS19_Vorndran-Grund_Culturegraph.pdf. [Accessed 2019-04-25].

VZG (2017a). Anhang Landercodes. https://www.gbv.de/bibliotheken/

verbundbibliotheken/02Verbund/01Erschliessung/02Richtlinien/

02KatRichtRDA/anhaenge/anhang-laendercodes. [Accessed 2019-04-08].

VZG (2017b). Katalogisierungsrichtlinie fur den GBV - RDA. https://www.

gbv.de/bibliotheken/verbundbibliotheken/02Verbund/01Erschliessung/

02Richtlinien/01KatRicht/02KatRichtRDA/inhalt.shtml. [Accessed 2019-04-08].

VZG (2018). Sonstige Anmerkungen. http://swbtools.bsz-bw.de/cgi-bin/help.

pl?cmd=kat&val=4201&regelwerk=RDA&verbund=GBV. [Accessed 2019-04-08].

79

Page 80: Improving access to bibliographic data: representing CERL ...walker5/docs/201905_thesis.pdf · shared with the interviewed researchers, whose feedback I discuss in [5.5], and with

W3C OWL Working Group (2011). OWL 2 Web Ontology Language. DocumentOverview (Second Edition). https://www.w3.org/TR/owl2-overview/. [Accessed2019-04-23].

Welsh, A. (2015). Metadata output and its impact on the researcher. Catalogue andIndex 178, 2–6.

Witzig, S. and G. Hipler (2019). Clustern von Daten auf der swissbib Plat-tform. https://wiki.dnb.de/download/attachments/146377939/2019-04-03_

KIMWS19_Witzig-Hipler_swissbib.pdf. [Accessed 2019-04-25].

80