National Library of Australia Web Crawl

Michael Still mentions my experience with the NLA crawler, and along with Steven Hanley speculate about how the NLA is choosing sites to crawl.

Looking at my Apache logs I can see that their is a reference in their browser info to this webpage about their current crawl, which says:

While the Library and its PANDORA partner institutions have been selectively archiving online publications since 1996, this current and first comprehensive crawl of the Australian web domain was begun in June 2005. For the purpose of this collection, the Australian web domain includes .au domain sites. In addition some sites identified by DNS lookup as having an IP address in located in Australia may be included.

The really interesting thing is that my website satisfies neither criteria, being a .org and hosted in the US, although it may be because the little box on this end of my ADSL connection (which redirects everyone to the main site if they forget the www at the start of the URL) does indeed have an Australian IP address..

However, their overview page says:

The purpose of the PANDORA Archive is to collect and provide long-term access to selected online publications and web sites that are about Australia, are by an Australian author on a subject of social, political, cultural, religious, scientific or economic significance and relevance to Australia, or are by an Australian author of recognised authority and make a contribution to international knowledge.

If that’s why I’m in there then I’m realy flattered!

Of course, it’s much more likely to be just be the fact that I got a link from here… 😉