Long Term Archiving of Digital Documents in Physics

IUPAP Workshop at Lyon 2001 4.-6.November 2001
Session 4 The View of End-User Physicists


 

Physics Archiving: Requirements, Perspectives, and some Approaches in Germany

Eberhard R. Hilf, Institute for Science Networking ISN; Oldenburg, Germany

The paper will be out by 15th November 2001.

http://physnet.uni-oldenburg.de/~hilf/vortraege/lyon01/

Abstract

It is argued that in Physics primarily needed for long term archiving are not the original documents, peer reviewed or not, but the full content of actual program codes, experimental data on one side, and the physics content of documents in as much as the authors, his/her institution, or the physics community wants it archived. This calls for a large effort in reconditioning information by qualtiy filtering such as multiple levels of vetting, peer reviewing, commenting, condensing, and especially rewriting the findings and results in a way that the content can be effectively be used in actual research and teaching.

 

Content

  1. Keypoints for LTADDP as seen from the author/user
  2. Requirements as seen from the user and the author
  3. Some approaches for long term archiving in Germany
  4. What we do at the Institute for Science Networking

Keypoints for Long Term Archiving in Physics as seen from author/user

  1. Analysis:
    1. Physics is the science of the laws of nature, explored by quantitative experiments and mathematical modelling.
    2. Knowledge is needed to be archived in the best applicable, retrievable, complete, understandable, effective, professional, and quality filtered forms, - not just as the authors handed it in originally.
    3. Research is served mostly by communication means: Alerting, preprints, recent reviewed articles, professional search engines, - thus no apparent need for Long tern archiving in physic, but in the heat of communication we do not know which part will finally be picked worthwhile to be archived, thus at creator of the document already has to assure LTADDP necessary standards.
    4. LTADDP is needed mostly for person-related reasons only:
      1. - history of science
      2. hall of fame
      3. examination works as entry of professional life
  2. Actions for LTADDP:
    1. Physics societies urge their members to condense, recondition their past contributions to physics in a condensed, compact, complete, most accurate and readable, re-usable form.
    2. Publishers stimulate authors to write present knowledge in a condensed, …… form.
    3. Libraries teach, and enforce the usage of metadata.
    4. Research institutions support development of intelligent content search and retrieval.
    5. International agreements needed for the standardization of workflow and multiple quality filter steps, and their metadata flow for the scene from  first publish then evaluate to the final condensed long term archiving.
    6. Enforce copyright policy that first publish has to be open,  either by author with a preprint server or by his/her institution and its library.
    7. Assure only open non-proprietary document formats for handed in documents.
  3. Some steps in Germany and Europe
    1. Metadata: IuK initiative information and communication of the learned societies promotes strict use of metadata and  works on practical tools: MyMetaMaker-webform serving the author/library to add correct DC-metadata to the document, using rdf and XML encoding.
    2. Oai Open Archive Initiative: DINI , the German Initiative for Network-Information, appealed to the University libraries to serve as Oai document providers, organized classes, where we teached how to install  Oai-compliant data- and service providers. Our ISN-Oai compliant document provider was approved 13th April 2001. Our Oai service provider is online. The best OAi-service provider is by the University Library of Tübingen.
    3. Successful keyword-matching tool for Physics (PACS) from/to Mathematics (MSC) now online  (Project: Carmen).
    4. Multilevel quality filters: GAP: German Academic Publishers are to have a fine grid of quality filters (from author, group, institute, University, peer reviewing)
    5. Long Term Archiving for Theses (Dissertations): In Germany a full scheme has been set up for all learned fields, the complete workflow from candidate through faculty to University library to LTA by The National Library (DDB) with a mutually agreed set of rules for acceptable formats (with the original file kept) and metadata (who is allowed to put in which how).
    6. Portals to Physics are set up by the DPG in collaboration with FIZ Karlsruhe, TIB, and ISN. Responsibility is with the new DPG-workgroup Information
    7. Responsibility stays with the Author and his/her institution: A distributed Document Provider PhysDoc is operating for the EPS since some years, with surf-or-search to about 100,000 documents in physics residing at the Websites of the about 2,000 physics institutions worldwide, served by the EPS through an international consortium of technicians, led by ISN.

Some additions:

      1. Physics research and teaching needs information exchange: full, instant, reusable, no financial barriers, nor proprietary gateways.
      2. Physics needs interactive easy to use mathematical communication means.
      3. Physics needs content search tools: search for S and for Ldt should yield the same documents independent of which one is written there.
      4. Physics needs very professional short term archiving with powerful search tools for full content understanding physics content (New intelligent search engines).
      5. Long Term archiving needs secure and no-information-lost retrieval even if the creator or his publisher is no longer there. That calls for nonproprietary formats and protocols.

OAD: Progress in OpenArchive Approach and Status

PhysDoc is by now OAi compliant data - and service provider, with the concept that the author's institutions archive themselves the documents, with the connection to the physics societies repositories (IoPP) for the prime refereed papers as long term archive.

What we do at the Institute for Science Networking

Search independence of source and ownership needed. Three policy lines are emerging:

  1. Distributing Queries sent out by search engine to all document providers (publishers).

Technically, wrappers are to be written for each document provider. The output is then parsed for the retrieval.

MetaPhys by P. Borrmann has been a good example collecting from all major physics publishers.

The advantage of this way is that it does not need any agreement or consent by the document providers.

The disadvantage of this way is that providers do change their browsing images, their concepts to present document information, and thus the wrappers have to be readjusted time by time.

  1. The Open Archive Initiative OAi concept: OAi defines one unique set of rules for the gateway to exchange document information from document providers to service providers.

The advantages are: all document providers have the same nonproprietary exchange protocol and set of metadata defined. They give consent to pull all sets of metadata by any service provider. They consent to share all software developed. Thus the document databases complying OAi are most easily and worldwide fully accible by anyone, which is what research people want. most likely to be read.

The disadvantages are that the shared metadata set is pretty basic; also, there are only few OAi compliant service providers, since these compete with their add-on services and are not willing to give their software away to the competitors. We assume that learned societies and university libraries most easily are willing to comply, see a good example by University of Tübingen, D3P.

  1. Bilateral negotiations of individual document providers with service providers on the amount of exchanged metadata and links.

The advantages are that individual and thus optimally fitting agreements can be found, that the document provider (publisher) keeps all the rights, and can define different agreements for different service providers, e.g. whether they are commercial or non-for-profit.

The key part of any such agreement is that the service provider promises not to hand the transferred to him sets of metadata to any other service provider without the explicit consent of the document provider.

  1. We act as document provider by PhysDoc, distributing and retrieving document information from the multitude of about 2000 documents and lists of documents, thus about 100.000 documents, with 40.000 entries, all located at their individual physics institution of the author.

We act as Service provider, via MetaPhys, using servlets.

  1. For OAi we act as document providers, extracting those about 1.000 documents which have correct metadata of the distributed PhysDoc system. We became officially registered as OAi compliant by 13th of April 2001.

We also act as OAi compliant service provider PhysDoc-OAD, to be officially registered next month. Here, at PhysDoc-OAD we serve the retrieval of

    1. about 01.000 documents of PhysDoc
    2. about 50.000 documents of ArXiv (2001, 2000 and some more),
    3. about 90.000 documents of IoPP

In total, at present we serve about 120.000 documents, increasing. Of these, we serve the document information of virtually all IoPP journals. The retrieval leads to the repository of IoPP and it depends then on the arrangement between user and the respective IoPP journal, whether the full document is retrieved or not.

Cooperation with Publishers

We praise the smooth, competent and fruitful cooperation of the Institute of Physics Publishing, UK, useful for both sides:  PhysDoc II serves the increase of reading IoPP-Journal stack documents, it serves these in the context of other related documents of other providers, thereby focussing on the competition of quality and add on services of providers, during the transfer of the 90.000 metadata files only about a 100 were found to be incorrect XML encoding [mostly misunderstanding of mathematical symbols which have a different meaning in XML] were corrected at ISN (one week work) and were transferred to IoPP for future usage.  The concept of bilateral agreement of metadata stack exchange allows a professional adaptation to the individual needs of the partners.

Keyword mapping

Theory of one field calls for tools for keyword finding  of the other and vice versa.

Here are some examples, which our code gave and best satisfy the users, which you might not have expected.
 
MSC 78A60 Lasers, masers, optical bistability, nonlinear optics <=> PACS
03.30.+p Special relativity

 

62.30.+d Mechanical and elastic waves; vibrations <=> 74S15 Boundary element methods

 

03.65.-w Quantum mechanics <=> 47-XX Operator theory

76M10 Finite element methods <=> 44.10.+i Heat conduction         

 

Citation, copy right and link of this document

This document may be copied, distributed, downloaded in any way even for talks and slides or quoted, as long as its content is not changed in any way, and the source is corectly cited as
E. R. Hilf; Physics Archiving: Requirements, Perspectives, and some Approaches in Germany;
http://physnet.uni-oldenburg.de/~hilf/vortraege/lyon01/;
at Long Term Archiving of Digital Documents in Physics,
IUPAP Workshop 5.Nov.2001.