SULAIR Home

Repository Information

TextGrid

URL: TextGrid, TextGrid Beta
Repository framework: TextGrid serves as a virtual research environment for philologists, linguists, musicologists and art historians.
TextGridRep is a middleware that implements a repository for research data in the humanities embedded in a grid infrastructure with search and authorization facilities. By now, the repository is based on the XML database eXist, the RDF database Sesame and storage Grid technologies. The middleware is currently being developed by the TextGrid project.
There are plans to integrate Fedora as underlying repository in addition to using Grid technology.

Repository Functions:
TextGridRep allows for storing any kind of data (texts, images etc.). All resources are stored as information objects which have metadata attached to them. The repository provides upload tools, a service for creating, reading, updating and deleting resources (TG-crud) as well as tools for retrieval (TG-search) and for specifying and enforcing role based access control (TG-auth*). All these services are available to users of the TextGridLab. The Grid infrastructure supports distribution and replication of the resources. A long-term preservation strategy is being developed in co-operation with other German grid projects and will be implemented. Part of this strategy is the deployment of permanent identifiers (PIDs) in liaison with the European Persistent Identifier Consortium (EPIC).
TextGrid acquired a comprehensive library of XML files for canonic German texts (literature, philosophy, history, sciences, art history, sociology, lexica, music, language and religion) and will make it available via its rich client platform as well as via a web-based interface.

Technologies and Programming Languages Used:
TextGridRep is mainly written in Java, with some parts implemented with PHP. It is a multi-layered architecture, that has a Web-Services-based interface (provided by TG-crud, TG-auth* and TG-search). Metadata about the resources (including full text indexes) are stored in an XML database eXist for retrieval, metadata about users, permissions, possible operations on resources etc. are stored in an LDAP directory. TextGridRep implements Role Based Access Control using this LDAP directory as backend, supports SAML-based authentication and authorization (Shibboleth). RDF is used for administrative metadata and relations between the resources.
TG-CRUD manages the objects within a Grid infrastructure, implementing an abstraction layer with the Grid Application Toolkit (GAT) framework. The Grid infrastructure itself is based on Globus Toolkit 4.0.8. All technologies used and developed by the project are open source.

Metadata:
A subset of Dublin Core is used together with additional RDF-based administrative metadata. Data and metadata is represented in XML. TextGrid developed a so-called base line encoding for different genres of resources (prose, verse, dictionary entries etc.) which basically are subsets of TEI p5. TextGridLab provides a metadata editor for creation. Metadata can also be bulk-imported via an upload tool and automatically extracted via an ingest tool.

Images:
Currently, facsimiles are stored in TIFF format, but other formats like PNG or JPG are also supported in TextGrid . Resources are represented logically via a so-called TextGrid URI. These URIs are matched in the middleware with UUIDs by which the replicated resources are stored and can be accessed in the Grid.
The infrastructure supports MTOM for streaming large files.

Timeline and Resources:
TextGrid is in its second project phase, funded by the federal government of Germany within the D-Grid initiative until May 2012. The current software is in beta stadium and the 1.0 release is planned for January 2011. Currently, over 15 FTEs are funded. Together with the 1.0 release, a legal form for TextGrid is planned to be established that will be responsible for sustainability after the current funding phase. There are business model considerations ranging from commercially offered services to further public funding. Currently, there is a strong commitment to Open Source and Open Access, although TextGrid is aware that a number of resources stored in the repository underlie more restrictive licences.
Preservation:
One of the goals of the TextGrid project is to provide a long term preservation service. A respective strategy is currently in development in co-operation with other D-Grid projects.

Development Wishlist:
After the 1.0 release new features will be developed. Some of the items on our agenda are: migration to Globus Toolkit 4.2 (and evaluation of GT 5) enhancements of the authorization technology to handle complex licence models, improvement of the authorization within the Grid by using SAML assertions, integration of Fedora Commons, enhancing high availability features.
Building a Community:
The development of a user community is an integral part of our project, and therefore our project proposal allocates resources specifically public relations purposes: both internal and external workshops (for developers, content providers, and other interested parties), conference presentations and publications, press releases, newsletters, flyers and informational materials. Our project consists of 10 partner institutions and we are very open to international cooperation.

e-codices

URL: e-codices

Repository framework: Homecooked

Repository Functions:
Image serving (5 different resolutions, 3 different views)
Searching metadata (manuscript descriptions), 9 different categories
Browsing (5 different categories)
For more information see: this link

Technologies and Programming Languages Used:
- PHP (framework: CodeIgniter)
- MySQL
- XML (TEI-P5 for manuscript descriptions, "homecooked" XML configuration files)
- XSLT
- Lucene (framework: SOLR)
- JavaScript (Prototype - to be replaced by jQuery)
All technologies used by e-codices for the web application are open source

Metadata:
- Manuscript descriptions in XML format (TEI-P5)
- Schema: link
- oXygen XML Editor: link
For more information see: link

Images:
From May 2005 through September 2008 a Canon EOS-1Ds Mark II (16.7 Megapixel) camera mounted on a Graz model camera table was used for manuscript photography. In October 2008, the photography unit upgraded to two Hasselblad H3DII39 (39 Megapixel) cameras, each with its own camera table. The proprietary RAW files are converted into TIFF format files. These image master files are archived in three copies at different locations. The uncompressed TIFF files (Canon: ~47 MB; Hasselblad: ~100 MB) provide all necessary data for color management. For internet use, images have been converted into JPEG files at five separate resolutions (max. 3 MB).
Timeline and Resources:
The project is currently funded by:
- The Andrew W. Mellon Foundation
- The Stavros Niarchos Foundation
- E-lib.ch, the Electronic Library of Switzerland (a project of the State Secretariat for Education and Research)
- Digitization commissions by libraries and collections in Switzerland

The project is currently funded until the end of the year 2010. One of the goals for this year is to prepare a sustainability plan to ensure long-term availability of the repository. Our goal for e-codices is to remain open access for the scholarly community. Currently there are 13 people (total ca. 6 FTE) working for e-codices (3 photographers, 10 scientific collaborators).

The goal is to include all major libraries and collections that hold medieval and early modern manuscripts in Switzerland into e-codices. The long-term goal is to digitize all medieval (ca. 7000) and a selection of early modern manuscripts held in Switzerland and make them available on the internet.
Preservation:
A preservation strategy is still under development.

Development Wishlist:
- Cf. "Project Timeline and Resources"
- We would like to go beyond the scope of existing manuscript descriptions and offer a platform which will support additional research, much like a real library which is always a centre for research on the manuscripts it holds.
Scholars should be able to amend and correct manuscript descriptions directly from within the e-codices website or even create new scholarly descriptions from scratch. Discussion forums and the ability to create transcriptions and link to existing editions are other possibilities to make the website more interactive, so that it would move from a storehouse of digitized manuscripts to an interdisciplinary workingspace. We think it would be an intriguing chance to offer one platform, on which Anglicists, Art Historians, Musicologists and scholars from all manuscript related disciplines could exchange information.

Building a Community:
e-codices is constantly communicating and exchanging ideas with the major players in the field of virtual manuscript libraries as well as the scholarly community. A "call for collaboration" was published on our website in June 2009, in which scholars were invited to suggest manuscripts for digitization. From among more than 150 proposals submitted, our internal evaluation process allowed us to select 40 manuscripts for digitization. Types of collaboration vary greatly, from writing new scholarly manuscript descriptions to developing tools for online manuscript study, to building connections with other virtual manuscript library websites.
Beyond this "call for collaboration" we are willing to provide our metadata for meta search purposes as well as our expertise for the advancement of community building for virtual manuscript libraries.

Oxford

URL: Many (editor's note: here's a good place to start)
Including:
Mediaeval Libraries of Great Britain (MLGB3)

Cultures of Knowledge (CofK)
Oxford and Cambridge Islamic Manuscript Collection Online (OCIMCO)
Future-Arch

Legacy:
Bodleian Library Western manuscripts to c. 1500
Early Manuscripts at Oxford University

Repository framework:
Fedora + homecooked

Repository Functions:
Preservation of arbitrary content
Overlaid with faceted search, triple store, message queing toolset
Overlaid with domain specific websites, search-and-browse, machine-readable feeds

Technologies and Programming Languages Used:
Fedora, SOLR, 4-store, Python, PHP

Metadata:
RDF, DC as an LCD format, Mods for book-like materials, W3C geo and time mini-formats, Other standards such as TEI and EAD are stored if available
Tools are domain specific

Images:
"we take what we can get"
Lossless masters are generally TIFF/JPEK2K
Basic delivery format is JPEG (or PNG for limited tonality material)
Advanced delivery uses Luna or Zoomify
Timeline and Resources:
Many projects/many timelines
Preservation: Yes, multiple copies split geographically and technologically. Processes divorced from storage so servers are expendible.

Development Wishlist:
Better interoperability
More developed toolset
Building a Community: Yes, all of them [the resources at Oxford described above].

St Gall and Reichenau Manuscripts

URL: http://stgall.library.ucla.edu/stgallmss/index.html

Repository framework:
home-grown repository:
1. Oracle Database; Database Schema internally designed (2004-2006)
2. Windows servers host images and XML (content) in file system
3. Java programming for most functions
4. TEI/XML Manuscript Description for most metadata and text transcriptions
5. XSLT to transform XML into HTML
6. Increasing amounts of javascript and jquery
7. Zoomify (Flash) for zooming and display of images

Repository Functions:
1. Image serving
2. Zooming with Zoomify
3. Side-by-side coordinated page turning of digital text (left hand frame) and digital image (right hand frame)
4. Limited browse functions
5. No searching

Our more generic "manuscript viewer" offers page turning (no text display) and searching of high-level metadata. See: link

Technologies and Programming Languages Used:
Java, Oracle (commercial), TopLink Zoomify / Flash (commercial), Javascript, jquery, XML, XSLT

Metadata:
TEI Manuscript Description, MODS, METS, Dublin Core, MARC, MARCXML,EAD

Images:
Filenaming conventions:
each image is an item in the database, each item has an ARK (Archival Resource Key (from California Digital Library we have an ARK minter)), each file uses the ARK as a primary part of its filename. An image is renamed with an ARK when it is associated with a database record, a "-1" or "-2" is appended to note whether it is the first, second, etc., image that has been associated with that record.

We use Zoomify to provide image zooming functions. Images are tiled into JPEGs and a Flash object is presented in the Web page.

For manuscripts our imaging standards are 600dpi RGB or equivalent.
Timeline and Resources:
We are completing our planning portion of this project in June 2010 (it is a two project). We will then begin an implementation phase of the project which will last until June 2012.
Resources included a portion of a programmer's time, a portion of my time, a portion of Stephen Davison's time, a postdoctoral fellow, 2 50% FTE graduate student researchers, and a portion of Prof. Pat Geary's time.
Preservation:
Images and XML will be preserved through whatever strategy all digital collections at UCLA are preserved. We are in the process of determining our preservation approach. Currently, all image masters are maintained on a nearline storage device and it is a staff-only filesystem. Only programmers and our repository system have write access to the system. Nightly incremental tape backups and every 90 days a full backup takes place. Tapes are stored off site.

Development Wishlist:
Annotation tools for annotating both text and image ... at the level of a character in text and at a similar granularity for images. Searching tools. Increased browsing functions.
Building a Community:
I think we are very interested in community development. We are not veteran community developers, but nor are we neophytes. We have some programmer time I think we could dedicate -- depending on objectives. I can participate as a TEI expert. Stephen Davison, head of UCLA DLP, would be very interested in participating re: infrastucture development and shared goals, etc.

Roman de la Rose Digital Library

URL: http://romandelarose.org
Repository framework: Custom.
Repository Functions: The repository is dumb. It just provides preservation storage and is mounted over nfs for tools which use it. Data flows from the repository to be transformed and packaged into a Java web app. The rose site loads almost all data from the web app dynamically as needed. The exception is images which are served out by a separate image server. The image server monitors the repository for changes and makes the images available.

Technologies and Programming Languages Used:
Open: Java, Google Web Toolkit, Lucene, Flash, JavaScript, Java Servlets, Tomcat

Commercial: FSI Server, FSI Viewer

Metadata:
Tools: XML editors, spreadsheet programs, custom
Metadata formats: XML (TEI), spreadsheets (CSV), simple text files, html fragments

Images:
The images are all tifs. Almost all of them are 8 bit, a handful are 16 bit. Some follow our guidelines, some do not.

Our guidelines were:
All images should be delivered in uncompressed TIFF 24-bit RGB format (8 bits per color, 3 colors per pixel)
Images should have a resolution of 300 - 600 pixels per inch on both dimensions (i.e., the aspect ratio of the original artifact must be preserved).

The files have to be named according to a specific scheme in order to fit in the framework. The naming scheme has semantic information such as folio. An example would be Douce195.001r which is the recto of the first folio for Douce 195.

Project Timeline and Resources:
The grant is done. The rose digital library is now supported by the library. At the moment I work on maintaining the site and we are hiring a consultant to work on more image tagging.

Preservation Strategy:
The files making up the Rose digital library are stored in a hierarchy in a specified way in our preservation storage. That storage is automatically backed up to tape. The machine is a Sun CIS/2 which uses SAM-QFS.

Development Wishlist:
Build a community site to contribute metadata and otherwise annotate content.
Create a good JavaScript page turner using the GWT.
Improve workflow such that scholars could directly add metadata to the test site and then have it move into the preservation archive and production site.
Building a Community:
(no response)

Parker on the Web

URL: Project Page: http://parkerweb.stanford.edu; General about pages for the project are at: http://parkerweb.stanford.edu/parker/actions/page?forward=about_project

Repository framework: Homegrown

  • Linux and SunOS server environment
  • Apache web server and Tomcat application server
  • J2EE technologies (Java, JSP, Servlets, JDBC)
  • Struts web application MVC framework
  • Lucene for manuscript description and bibliography search
  • Aware JPEG2000 image server for on-the-fly image decoding
  • Javascript, jquery, prototype for page viewing environments
  • Oracle database for for differential access
  • TEI for manuscript description metadata
  • FileMaker, EndNote databases for bibliographic metadata
  • Saxon XSLT processor for XML to HTML transformations

Repository Functions:
Major functions include:

  • search and discovery of catalog information for the collection
  • image-viewing (several flavors: page-turner environment, thumbnails, zpr view, static .jpg)
  • search and discovery of a significant bilbiography of secondary literature associated with the collection

Technologies and Programming Languages Used:
The Parker on the Web site brings together several commercial, open source, and locally built technologies to create an application for viewing and searching digital images, manuscript descriptions, and related bibliographic references. The two major components of the application are the search engine and image presentation application.

Search and Results Display

  • The Apache Group's open-source Lucene search engine powers the search and browse sections of the site. Lucene is used to create indexes of XML-encoded manuscript descriptions and a large database of related bibliographic references. The indexes are updated as new manuscripts are added to the site and as project bibliographers identify new citations to Parker manuscripts in related bibliography.
  • Brief results summary pages and all supporting pages in the site are rendered using the Struts framework (http://struts.apache.org/), JavaServer Pages, and Java tag libraries. The long manuscript descriptions, which contain transcriptions of full James catalogue entries, are derived from an XSLT transformation of the XML-encoded descriptions using the Saxon XSLT processor.

Image Display and Page Turning

  • Providing users with a fast and versatile means of smoothly viewing, zooming, and panning large, high-resolution images of manuscript pages was a critical challenge for the project. The image presentation component of the site takes advantage of JPEG2000 image file compression and an image server application that dynamically decodes JPEG2000 files for viewing and manipulation in the web browser without the need for plug-in or other third-party software. Scanned TIFF images of manuscript pages are encoded as lossy JPEG2000 files using the Aware SDK. Aware's JPEG2000 image server application decodes images and presents them as JPEG tiles to the web browser dynamically.
  • The interactive page-turn view and the single-page inspect view are flexible, custom-built JavaScript applications that allow users to zoom, pan, and rotate images smoothly in the web browser.

Server Environment

  • The Parker on the Web application is hosted on a dedicated Linux server environment and runs in a single Tomcat instance.

Metadata:
Catalog information is XML marked-up to a modified TEI-P4 schema. Metadata was created using the oXygen XML editor.

Images:
Parker manuscript images were created by Cambridge University Library imaging staff, working in dedicated space at Corpus Christi College:

  • Using state-of-the-art digital capture technology
    • 3 custom-built rigs with cradles
    • 2 Linhof Master Technika 2000 4 x 5 cameras and 2 Anagramm PictureGate Salvadore digital backs (8000 x 9700 pixels, producing an average file size of 222 MB)
    • 1 Hasselblad camera and a PhaseOne P45 digital back (7230 x 5428 pixels, producing an average file size of 120 MB)
    • 120 mm lenses used for most work; 135 mm lens for special items
  • With imaging workflow supported by Capture One software
  • Using a color-managed process, baseline RGB/DLF Benchmark for Faithful Digital Reproductions of Monographs and Serials: color
  • Saving master files in uncompressed TIFF 6.0 format for preservation archiving
  • Produced in an environment that was safe and protective of each manuscript
  • Validated by stringent quality assurance procedures at each stage

Master image files were delivered to the team at Stanford University Libraries, where they underwent additional quality checks, image processing, and preservation archiving in the Stanford Digital Repository. Sub-masters and derivative images were produced by the Stanford team as follows:

  • Two sub-masters were created for each master image file, both LZW-compressed. Color correction, deskew, and rotation were applied as necessary. One sub-master remained uncropped to include a color bar and scale. The second sub-master received identical rotation, color correction, and deskew, and the color bar was cropped to present the images with minimal visible background.
  • A JPEG2000 derivative was produced for each sub-master. JPEG2000 files were encoded using the Aware codec at a PSNR of 46.
  • The average size for each 5-file set (1 master, 2 sub-masters, 2 derivatives) was 500 MB for the Anagramm-produced images and 400 MB for the P45-produced images.

Project Timeline and Resources:

  • Parker on the Web is a multiyear project, which began in 2005 and was largely completed in September 2009 with the launch of Version 1.0 of the site. The project team is currently planning for semi-annual updates, the first of these being the 1.1 release on 27 April 2010. Release notes on this site describe features and fixes new to Version 1.1. The project continues to invite error reports, comments and questions, and encourages all users to communicate these using the contact us form.
  • Version 1.1 introduces enhanced functionality with the provision of "thumbnail" view, enabling display of the special physical features inherent in manuscript objects. These include: flaps, foldouts, spreads, bookmarks, spines and edges. In addition, a significant reshoot campaign was undertaken at the end of the formal project period in late Summer 2009. A portion of these reshoots have been integrated into version 1.1, while the remaining reshoots on hand will be integrated into the next incremental release of the site later in calendar year 2010. Corrections to manuscript descriptions and bibliographic entries are also incorporated in this release.
  • As with Version 1.0, 1.1 still offers two views of Parker on the Web, albeit with a modest improvement. In the first view, available without charge to all interested parties, users will see an image of each page in all the Parker manuscripts, a description of each manuscript, and be able to browse the collection by manuscript title and number. Viewing of images has been enhanced in this version with the specific inclusion of the "basic view" feature, providing users non-zoomable, yet significantly more legible page images in addition to page-turn view. As with 1.0, in Version 1.1 additional browsing, keyword and fielded searching, bibliography, and zoomable image viewing is available only to users covered by institutional licenses.

Preservation Strategy:
Master image files were delivered to the team at Stanford University Libraries, where they underwent additional quality checks, image processing, and preservation archiving in the Stanford Digital Repository. Project metadata was also ingested into the Digital Repository.

Development Wishlist:
This is a "pick list" for the moment that needs validation; many items require further vetting & will be influenced by some discussion & hopefully new direction resulting from the DMSS Tech Meeting at Stanford, 14-15 May 2010.

  • Reimplement search and browse functionality from a pure-Lucene implementation to Solr
  • Migration to Blacklight, including building of requisite common viewing environment
  • Implement facets categories if that is desirable (as part of Blacklight migration)
  • Implement requisite web services to support interoperability with other repositories
  • Additional article linking/clean-up (ie. JSTOR, etc)
  • Flash viewing alternative for tutorials
  • Pay-per-use access
  • OpenURL support for subscribing institutions
  • Download XML description for subscribed users
  • Add PDF links for digitized secondary sources (SU-scanned, Google-scanned)
  • Add James introduction
  • Add Descriptions of printed books in James catalogue
  • View conservation reports (needs content & specification)
  • Compare two images from the same or different manuscripts in one view
  • Highlight search terms in short & long manuscript description results
  • Link text of transcribed editions to manuscript images, and view image and transcribed text side-by-side (pending specification)
  • Personal annotation of manuscripts or pages
  • Personal saved searches
  • Personal bookmarking of manuscripts and bibliographic references
  • Create groups
  • Shared annotation, saved searches and bookmarks
  • Support for searching controlled value lists for author, names, place
  • Browse by place (with map)

Building a Community:
Parker on the Web will play a role in Stanford's efforts toward community development. We can offer a test version of site as a sandbox for development efforts, development expertise, and metadata knowledge.


« Back