Today I noticed my site getting a thorough spidering by the user agent “Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
” and appearing to be sourced from what appears to be an Amazon Web Services IP address, 72.44.62.136
(domU-12-31-37-00-02-76.usma3.compute.amazonaws.com.
).
The Page-store.com web site is minimal, with just a single page and a robots.txt that forbids all crawlers. It does describe what they are trying to do though, which is to spider everything and then sell some digested form of that gathered information onto new search engines so they don’t have to do the work themselves. In their words:
Page-store positions itself as a web wholesaler, supplying page and link information to vertical search engine companies on a per-use basis. The effect is to level the playing field between vertical search and general horizontal internet search.
If nothing else it scores highly on the buzzword bingo scale.
And I notice they don’t say what user agent they honour in robots.txt.
Might have to ban them at the firewall instead.
As of tonight (26 July 07) I noticed the exact same spider crawling my site for the first time.
It was following every link and ignoring my robots.txt so the easiest remedy was to block the address on my firewall. I also sent an e-mail to amazon but I don’t expect a reply.
Pingback: Sugerør » Blog Archive » Who is page-store.com?
Same story on a site that I manage. It tried to browse a directory protected by sessions and cookies; my site just directed it to a login page.
Curious thing though is that it tried to access pages that are not linked to anything and don’t have particularly obvious names. I wonder how it knew the pages were there? My logs are not public, but perhaps some of my users have left there tracks somewhere.
when you get one, try this URL
http://XXX.XXX.XXX.XXX:8080/.
replace the XXX with the ip of the crawler. Don’t forget the ending DOT.
Maybe you can shut it down. if you figure out how, please post it here
Our site http://www.kantelpunten.com is currently being visisted by this mistery page-store. Unfortunately I cannot change the firewall settings. It does not seem to look at any rules, because it is also following links with rel=”nofollow”. Our site is completely dynamic and there are a lot of those links which should not be cached at all (It is not looping but they will be busy for a long while). Currently I am retruning them a page not to crawl us (so it will not find new links anymore). But i’m thinking of making a script which jumps to other generated pages for ever and fill up their sellable database with bogus.
I assume they are not crawling just for fun. So does that mean that somebody wants to buy our content and make a similar site?
Robots/crawler for search engine are welcome. But i want them to follow the robots.txt and other directives otherwise they will be banned from my site.
There is one other thing which I noticed. This crawler will remember cookie information and use it in their requests.
Does anyone know what they are doing with the crawled data? For example what is the selling price?