The aim of the DARIAH-DE Repository is to provide researchers in the arts, humanities and cultural sciences with a low threshold tool to store their research data in a sustainable way, describe it with metadata and publish it. In accordance with the FAIR Principles and the Open Access Guidelines of Göttingen University, DARIAH-DE is committed to provide data under open licenses and recommends that all researchers use creative commons licenses for this purpose.
DARIAH-DE advocates a scientific reuse of the research data published in the repository according to the research data life cycle. It wants to promote scientific growth in a self-management system and thus remind users of their own responsibility. The repository and its applications are considered to be a living system in which users are encouraged to handle data responsibly and also confidently.
Collection Development Policy and Data Quality¶
The DARIAH-DE Repository offers a unique solution, as it enables the researchers to upload their research data by their own hand and publish them without having to take many different hurdles. The DARIAH-DE Repository has a low threshold for its users respectively both the technical resources and the prior knowledge necessary for describing their data appropriately. Each step of the process is to be done online via the DARIAH-DE Publikator in a user friendly GUI. Furthermore each step of the process is elaborately and precisely documented within the DARIAH-DE Repository Documentation. In case of technical problems or further questions, the helpdesk connects users within less than 48 hours with experts of DARIAH-DE. Articles from the user’s point of view (DHd Blog), FAQs, a User Guide, tutorials and workshops of the DARIAH-DE partners complete the support. To illustrate the skills needed to store and publish data at the repository every procedure is explained step by step.
The curation of the DARIAH-DE Repository involves a brief checking of basic metadata as the upload of the data involves a form for the Simple Dublin Core (DC Simple), which comprises 15 elements. Three fields are mandatory (title, author, license regulations). Otherwise the content is distributed as deposited. The main idea of the repository is, that DARIAH-DE provides the tools as well as counselling for the users to do so as they have the expert knowledge to describe content and veracity of the data. Users may use the tools provided by the DARIAH-DE research data federation infrastructure for example in order to map the data, improve the findability and make them citable.
The DARIAH-DE policy for the development of the collection, data access, quality and re-use as well as preservation is strongly influenced by its community driven approach. The demands of the various research communities of the Arts, Humanities and Cultural Sciences where crucial for the development of the Data Model for the collections stored in the DARIAH-DE Repository. Three different approaches were chosen in order to ensure that the data model is suited for the demands of the different research communities:
- Interaction with various researchers provided valuable information. To ensure a more systematic communication about scientific collections, a stakeholder committee with researchers who have experience with collections of the Arts, Humanties and Cultural Sciences was established, which ensured that the data model was suitable for all designated communities.
- A detailed analysis of use cases, see Modellierung und Dokumentation von Use-Cases für wissenschaftliche Sammlungen (Modelling and documentation of use cases for research collections) and Dokumentation theorie- und verfahrensgeleiteter Sammlungskonzepte (Documentation of theoretical and process guided concepts of collection) also provided crucial information.
- Cooperation with various research projects was helpful to understand different practical approaches for managing research data and working with collections in the Arts, Humanities and Cultural Sciences:
Based on the demands and feedback from the community on the one hand and established standards on the other, the DARIAH Collection Description Data Model (DCDDM) was developed. The DCDDM is a data model for collection descriptions that specifies a fixed number of classes, elements to assist institutions and individual scholars in creating descriptions of physical (or analogue) and digital collections that can be read by humans as well as by machines. It is based on the Dublin Core Collections Application Profile (DCCAP). The DCDDM was developed in close consultation with the community and has recently been revised. It is a dynamic model that can be further customized as needed.
The long-term research data archive DARIAH-DE Repository offers in this perspective safe storing, publishing and researching for versatile digital material e.g. text, images and databases.
The description to make data accessible and reusable is provided by all necessary metadata that is available for every single object of the DARIAH-DE Repository. The DARIAH-DE Repository is using the DC Simple metadata schema. We learned that it is crucial for researchers that do not have a high affinity for computers and IT to be still able to cope with the metadata and tools. Furthermore we learned that the metadata of different projects and/or depositors are very different and heterogeneous so that a single metadata schema with many mandatory fields can not be served easily for everyone. The complexity level should be as low as possible for those users that simply want to import and/or publish their data without having a high level of metadata expertise. We also want to serve those researchers that have more complex metadata and more experience in IT digital data. So any researcher can decide which complexity shall be used and fits best to their requirements (see minimal and mandatory metadata).
This complexity is fully implemented in the DARIAH-DE Repository Search so that metadata searches can sustain the findability of all documents. For reuse of the data the mandatory metadata fields are sufficient as the community evaluates the quality of data.
Additionally technical metadata is extracted for every object during the publishing process, and is then also publicly available. It is stored as an extra file beneath the data and metadata files. You can get the object’s technical metadata from the object’s landing page via https://doi.org/10.20375/0000-000B-C8EF-7 or directly from the DH-crud via https://repository.de.dariah.eu/1.0/dhcrud/21.11113/0000-000B-C8EF-7/tech.
Within the framework of CLARIAH-DE and the NFDI-consortium Text+, the further DARIAH-DE collection development is expanded. The works especially include measures to increase the interoperability of the collections stored in the DARIAH-DE Repository with other data sets as well as measures to increase the reuse of the data, for instance the already established connectivity to further tools like Switchboard and Weblicht (see footer “TOOLS – Call Language Resource Switchboard with this resource“ at https://doi.org/10.20375/0000-000B-C9D3-4). In this context also the use of the DARIAH-DE Data Modelling Environment (https://dme.de.dariah.eu/dme/registry) is considered and evaluated for the integration of heterogeneous data and metadata, such as fulltext search or enhanced metadata mapping.
Further adjustments and developments will be considered by observing and analysing best practice of the designated community and taking into account their feedback.
In order to provide tools for working with the data and metadata of the collections stored in the DARIAH-DE Repository, for modelling and mapping the metadata to other schemes, and to make the data collections findable, the DARIAH-DE Data Federation Architecture (DFA) was developed. With the DARIAH-DE Repository as one of its central components, the DFA facilitates the reuse of the research and metadata published in the DARIAH-DE Repository:
- The DARIAH-DE Publikator is an easy to use tool for importing research data conveniently into the DARIAH-DE Repository via graphical interface and adding metadata. An extensive user guide description leads users through the whole process of publishing data and can be found in the documentation of the tool).
- The Data Modeling Environment (DME) is the place where data can be modelled and mappings between data models can be stored, managed on a long-term basis and combined as required. It thus provides conceptual support for researchers in the arts, humanities and social sciences to connect heterogeneous data and thus creates interoperability. Mappings allow automated translations of data from one model into another. Therefore, the DME forms the basis, for example, for the generic search of different collections.
- Every collection will be added to the Repository Collection Registry automatically and its data will be available in the Repository Search immediately.
- The DARIAH-DE Collection Registry serves as a catalogue of collections which occurred within the scope of research projects or serves as a basis for them. It links data, whose data models and the description of a collection for technical reuse by services such as search or analysis tools and also serves to manage collection descriptions. These can include, in addition to digitally accessible, analogue, protected or offline collections.
- The DARIAH-DE Generic Search provides a front-end for the data stored in the DARIAH-DE Collection Registry. The generic search can be used to search the distributed metadata records. In addition, using the generic search, it is possible to search the listed metadata, save this search in a personalized way, and then adapt or refine it at a later date.
Reuse of research data and metadata is one of the main goals of all services provided by DARIAH-DE. This is reflected by the very definition DARIAH-DE provides for research data, which is considered
“all those sources / materials and results collected, written, described and/or evaluated in the context of a research and research question in the field of human and cultural sciences, and in machine-readable form for the purpose of archiving, citation and for further processing.“ (https://de.dariah.eu/en/weiterfuhrende-informationen).
Since the DARIAH-DE Repository is part of the DARIAH-DE Data Federation Architecture (DFA), the data published in the DARIAH-DE Repository can be managed and reused according to different processes of the research data life cycle:
- Planning and creation
- Conservation measures
The concept of the research data life cycle for the DARIAH-DE services was described in the paper Diskussion und Definition eines Research Data LifeCycle für die digitalen Geisteswissenschaften (Discussion and Definition of a Research Data LifeCycle for the Digital Humanities, German only) and can be visualised in the following schema:
The Preservation Policy of the DARIAH-DE Repository is in line with the open access strategy of the University of Göttingen and its research data policy. It represents a clear commitment to open access of research data in promoting and making data of the designated community of the repository as widely accessible and usable as possible. Here it follows clearly the mission statement of the repository in supporting the use of publications and data without any access restriction.
The DARIAH-DE Repository explicitly and actively recommends especially supported formats for long-term preservation. The designated community mainly consists of scholars in the humanities, cultural and social science and represents all its sub-disciplines with their specific research questions. A variety of disciplines are working with XML, TEI, TXT, CSV and several image file formats such as TIFF or PNG. It therefore differs significantly from the TextGrid Repository, which focuses on files in XML TEI. The represented research disciplines are all sub-disciplines of the humanities, cultural and social sciences such as editorial philology, theology, philosophy, ethnology, art history and many more. In addition, the DARIAH-DE repository is also open for other disciplines.
Due to this interdisciplinarity the Dublin Core Simple metadata schema was used with a minimal set of mandatory metadata to ensure the reuse and evaluation of all data at a basic level. The repository supports format standards that ensure usability, access to data and its preservation for the designated community (see Data Reuse). Within an ongoing collection development the repository stays in touch with the needs and the state of the art of the designated community and undertakes necessary steps including format changes or adding of new formats (see Data Reuse). Due to its commitment to open access and open science the repository supports in this context open formats in the sense of free file formats, that can be used by anyone at no monetary cost and whose specifications are visible and maintained by a standards organisation relevant for the designated community.
The following sections provide an overview of the main aspects of the preservation policy.
Aims and Requirements of the Policy¶
The preservation policy and its implementation aims:
- to operate for the community as a trusted digital repository for data in the humanities and related disciplines with a special focus on digital editions and relevant data for text-based scientific research
- to guarantee long-term preservation and open accessibility of the stored research data
- to keep data long-term searchable and citable
- to ensure the authenticity and integrity of the data and provide reliable data to researchers, and
- to keep the repository standards in compliance with the state oft the art of the designated community including its ethical and legal standards following applicable law
For this purpose the DARIAH-DE Repository strives to ensure the following requirements by its organisational and technical infrastructure:
- The SUB and GWDG as two well recognised institutions with the respective relevant expertise declare their responsibility for the long-term operation of the repository through common founding of the Humanities Data Centre (HDC) as operator of the DARIAH-DE Repository and to take care of providing all necessary resources (technical, financial and in terms of knowledge and expertise of stuff members) – in addition to public project funding of associated projects and independently whenever necessary. See in this context also the founding manifesto of the HDC.
- All phases of the DARIAH-DE Repository’s publication and preservation workflows are based on the Open Archiving Information System (see DARIAH-DE and the Open Archival Information System (OAIS)),
- at bitstream preservation level the repository ensures data preservation in unchanged form for every item,
- further recommendations in terms of preferable long-term-preservation-formats are given,
- the data is accompanied by appropriate metadata standards for the professional cataloguing of the data and to enable use and reuse for research purposes,
- appropriate ingest procedures ensure that data are checked and validated according to community standards (such as mandatory metadata fields and generated additional administrative and technical metadata),
- the integrity and authenticity of data is regularly checked through a technical based routine,
- the repository has implemented periodical local and distributed backups (located in dedicated computing centres with strict access control) allowing to reinstall the repository data from backup and to recover data in case of technical failures,
- the infrastructure of the repository is regularly checked and maintained in its functionalities,
- security issues are covered through security and disaster plans including responsible persons and actions to undertake,
- documentation, data, metadata, and all related information are regularly maintained suitable to long-term archival storage,
- all involved entities and stuff members agree to regularly observe and evaluate if changes are to be considered necessary due to changing scientific practice or technical developments and how they are to be implemented (to see ongoing evaluations and planned actions that will be implemented see the section Data Reuse, and the wiki page Digital Object Management),
- also on an organisational and strategic level SUB and GWDG ensure that the repository stays closely related to its designated community and ongoing innovative developments through associated projects and engagement in new developments and initiatives at a national and international level.
Recommendations and List of Preferred Formats¶
In a long-term perspective not all file formats will ensure long-term usability, access to data and its preservation. Therefore, the DARIAH-DE Repository recommends certain file formats according to the current state of the art and common practice within the designated community.
Furthermore DARIAH-DE provides guidelines with information about formats suitable for long time storage and reuse: Empfehlungen für Forschungsdaten, Tools und Metadaten in der DARIAH-DE Infrastruktur (Recommendations for Reseach Data, Tools and Metadata in the DARIAH-DE Infrastructure, German only).
The DARIAH-DE Repository as part of the Humanities Data Centre and of DARIAH-DE supports long-term preservation for the following formats, which are widely used by the designated community and represents the major part of the stored data, as proposed in the nestor criteria Catalogue of Criteria for Trusted Digital Repositories:
“Open, disclosed and frequently used formats are preferred as archive file formats, the assumption being that these will have a longer life, and there are more likely to be techniques and tools for converting or emulating them, given that they are supported by a wide circle of users.“
A list of those proposed file formats is listed here below (see p. 26f.):
- for structured text: XML (http://www.w3.org/XML/) preferably TEI/XML
- for unformatted text (including csv and plain text): ASCII/Unicode
- for raster graphics: TIFF 6.0 (https://www.itu.int/itudoc/itu-t/com16/tiff-fx/docs/tiff6.html)
Further recommendations from the nestor criteria catalogue:
- for formatted text: PDF/A, ISO 19005-1: 2005 (http://www.iso.org/iso/catalogue_detail?csnumber=38920)
- for audio formats: WAVE (http://msdn.microsoft.com/en-us/library/ms713498%28VS.85%29.asp)
- for video files: MPEG 4 File Format, ISO/IEC 14496 (https://www.mpeg.org/standards/MPEG-4/)
At all levels of the publication workflow and lifecycle of the DARIAH-DE Repository (as illustrated and described above in the section Data Reuse) support and consultation are guaranteed and given by staff members of the SUB and DARIAH-DE dealing with the DARIAH-DE Repository via:
- consultation of research projects via the helpdesk, mail communication and personal conversations
- stand-alone workshops or at conferences
- email for support
All cooperating projects and research projects using the DARIAH-DE Repository are entitled to receive at the beginning an initial consultation for starting using the repository and to be aware of relevant issues for data publication into the repository. Consultation and support is usually used by our designated community and highly recommended for the following issues:
- data ingest and data publication
- ingest of large amount of data
- data description with relevant metadata (mandatory and optional)
- data quality and data reuse
- digital data: standards, file formats
- creating, editing and publishing of collections/objects
- use of the repository in general
- discovered bugs or needed adjustments for own research projects
Legal and Regulatory Framework¶
“own[s] all necessary rights to publish this collection including all data and metadata and to allow re-use by third parties” (see Publikator screenshot).
“Data, collections, or metadata that allow conclusions to be drawn about individual persons may not be imported unless the author obtains explicit confirmation from the persons concerned or their legal representatives that they are in agreement with publication in the DARIAH-DE repository. This confirmation must be presented to DARIAH-DE in writing“.
As data can only be uploaded after the authentication of the user (see DARIAH Authentication and Authorization Infrastructure), misuse can be traced back to the perpetrator. If there is a misuse, DARIAH-DE reserves the right to delete the affected data from the repository.
After the publication the data is stored securely in the DARIAH-DE Repository and is publicly accessible. Following the open access policy of DARIAH-DE, creative commons licences are recommended to the community.
- All necessary rights and obligations of both parties are stated and confirmed.
- The repository has all necessary rights and permissions to undertake all necessary operations to ensure long-term preservation, accessibility and security of data.
- As far as sensitive data with disclosure risk are concerned the depositor agrees to not publish those data in the repository without permissions of the data subject whose rights have to be protected according to applicable law.
- Prohibitions to publish and disseminate harmful unlawful offending content are formulated and legal implications as removing of data, exclusion from publishing access to the repository or further legal implications are stated.
- The depositor agrees to follow the community standards of good scientific practice as stated in the recommendations of the German Research Association and to be subject to possible non-legal and legal implications if not.
Furthermore, most of the data stored in the DARIAH-DE Repository is provided with a license, which determines the rights of use. In the case that no licence is provided, German copyright applies. Every user who stores data in the DARIAH-DE Repository
Important Ethical and Disciplinary Norms¶
The technical infrastructure of the DARIAH-DE Repository runs on a well-supported operating system. The hardware, software and used technologies are appropriate to serve nationally and internationally research, teaching and learning by providing long term preservation, further processing, openly sharing and dissemination of digital research data according to ethical and scientific standards of the international research community. The Designated Community of the DARIAH-DE Repository follows the good scientific practice as recommended by the German Research Association, which also does the University of Göttingen. In terms of practice this means, as highlighted publically by the University on the respective website (https://www.uni-goettingen.de/en/604506.html) as well as in related documents and listed here:
- “The conduct of science rests on basic principles valid in all countries and in all scientific disciplines. The first among these is honesty towards oneself and towards others. Honesty is both an ethical principle and the basis for the rules, the details of which differ by discipline, of professional conduct in science, i.e. of good scientific practice“ (Memorandum of the German Research Foundation DFG, 2013:67)
- Cooperation in scientific working groups must allow the findings, made in specialized division of labour, to be communicated, subjected to reciprocal criticism and integrated into a common level of knowledge and experience.
- Experiments, numerical calculations or analysis have to be reproducible and therefore all important steps must be recorded.
- Primary data as the basis for publications shall be securely stored as the publication itself.
- “appropriate methods are used and all results are consistently doubted by oneself,
- academic qualification work is actually based on personal contribution,
- preliminary academic work should be adequately considered and correctly cited,
- the authors listed in a publication have actually contributed substantially to the creation of the work,
- one’s own research data can be checked and used by others within the framework of standards customary in the respective field,
- scientists and scholars who teach and instruct meet their responsibility for communicating these principles and ensure adequate supervision.“ (https://www.uni-goettingen.de/en/604506.html).
For more details see the relevant publications about Safeguarding Good Scientific Practice of the German Research Foundation as well as the guidelines of the University:
- Memorandum on Safeguarding Good Scientific Practice by the Commission on Professional Self Regulation in Science (German Research Foundation, the english version starts at page 61)
- DFG Leitlinien zur Sicherung guter wissenschaftlicher Praxis (2019)
- Rules Governing the Safeguarding of Good Scientific Practice (2016)
- Research Data Policy (2016)
Personal Sensitive Data with Disclosure Risk¶
The DARIAH-DE Repository recommends to be very careful in dealing with sensitive data throughout the whole research lifecycle of data collecting, handling and publishing. Disclosure risk is not only an issue for data allowing directly discovering of personal or sensitive data, but also indirectly by combination with other data.
Sensitive data can be defined as data that for ethical or legal reasons has to be protected against disclosure risk. Safeguarding of sensitive data may also be related to personal privacy or proprietary issues. Due to its open access commitment the DARIAH-DE Repository excludes publication of sensitive data and data with disclosure risk which needs a restricted access or are not allowed to be published according to applicable law. The depositor and author has to take care of legal and ethical criteria related to personal privacy – according to applicable law. Authors and depositors have to be aware of new regulations as consequence of the EU data protection directive.
More information and recommendations are available online:
- General Data Protection Regulation (European Law, EUR-Lex)
- Data Protection in the EU (European Commission)
- Law topic data protection (European Commission)
- Legal grounds for processing data (European Commission)
- Legal grounds processing sensitive data (European Commission)
- Portal for licence information on research data (DARIAH-DE)
- Guide on legal issues for the humanities (DARIAH-DE Working paper by Paul Klimpel and John H. Weitzman, language: germann)
- Data licences for research data in the humanities (DARIAH-DE Working Paper by Beer, N. et al., language: german)
- Legal framework for research data (DataJus Project)