For the purposes of answering query from the colleague in Hungary, we prepared short text explaining our web archiving project, started some two years ago. As it turns to be very informative, more than we expected, we think that it could be useful to publish it here.

The Public Library Cacak and its Digitization Center started to web archive our local cultural heritage on the Web in January 2009. When we say local cultural heritage on the Web it meant that we identified quality content sources, e.g. domains, valuable for archiving. So far we have a list of 130 web sites that have been archived every three months. We use HTTrack web crawler for this task, because it is a simple tool for this kind and size of job. During web archiving process we have limited crawler's access to the protected parts of the web sites, following rules set up by administrators in robots.txt files. If these rules haven't been set up or there is no robots.txt file on the web server, we assume that all of the content is open and publicly accessible, and we archive every type of content (pages with text and images generally speaking; the content which sits behind databases is inaccessible to HTTrack). We also set that web crawler use only certain amount of bandwidth, because we do not want to overfill traffic of archived websites and thus prevent other users to access them during the archiving process. These technical details are explained to the representatives of websites while we try to contact them.

In the term of other principles, most of them were developed for identifying web content which we want to archive. That means that certain criteria exists, which guidelines our decision if some web content will be or not part of the web archive. The first criteria is that content of the website has to be of cultural or historical significance for the Public Library Cacak, something like criteria for building Local History Collections. The other is that commercial presentations are generally excluded. We also set up the principle to avoid archiving of personal blogs, forums, Wikipedia pages, as well archiving any kind of improper content (like the content of sexual or racial connotation).

All of the content is stored locally and we do not plan to provide access to that content in the near future. One of the reasons is that we have to work on copyright issues. It turns to be very difficult to trace authors or publishers of the websites, so less than 10% of authors gave us permission to archive and preserve their web content, simple because we didn't receive any response from the majority of contacted persons. We have prepared email letters explaining what are we doing, what are our aims, what are web crawling and archiving, etc, and use them to contact potential authors and publishers. When we managed to contact them, in almost 100% cases we got positive answer for our intention.

Our web archive is rather small, only 130 domains, and some 13GB of locally stored content. There is over 255.000 stored files, in the terms of size. These files are indexed during the archiving process, which is a good characteristic of used web crawler. Index files could and, hopefully, will be used for building up search engine, for the future development of web archive. One bad characteristic of HTTrack is that it renames files during the archiving, so the original structure of the website is lost, as well file names.

For the web archiving practice our Digitization Center mostly relies on the experience of colleagues in the National Library of the Czech Republic, the WebArchiv team, with whom we had opportunity to work together in November 2008. As web archiving is quite new field in Serbia, we think that we are the only institution in Serbia which had done something in this direction, or at least we didn't hear about anybody else. Important issue is that law in Serbia doesn't cover this topic, so national institutions probably are waiting for the legal framework of web archiving. We know that the National Library of Serbia plans to archive national domain, but due to insufficient resources this task is on hold for now.

09.12.2010. 01:47

