|
jrouvier -> RE: Robot errors (6/8/2006 2:36:55)
|
Kitka, Yeah, you caught the 2nd user-agent error. Right after it was discovered that we weren't sending a user agent, I ran down the hall and gave the developers a little lesson in internet etiquette. In their rush to enter the user-agent, they misheard me. I wish I owned the domain spider.com. If I did I could sell it and use the money to buy a house, or at least a nice car. Sadly, the domain is only betaspider.com, without the dot between beta and spider. Off the top of my head I'm not sure how long the incorrect user-agent was being sent for, in any event it was too long :( We are collecting pages for a search engine. It's like the other general-purpose engines such as google, yahoo, msn, ask... in that it helps people find websites relevant to what they are looking for. However, differs in some very cool ways which, of course, I can't tell you about. I can tell you that we aren't harvesting email addresses, aren't going re-brand and re-publish the content we crawl, or do any of those evil things. As for it not honoring robots.txt, I know it's working for many people, but there very well could be a bug. If anyone can send me an example site that we are currently crawling that has a valid robots.txt, I'd very much like to hear about it. If there is a bug, it has to be fixed post-haste. It should be noted that we don't request robots.txt before every other request for a real page, we only grab it every 1000 documents or hour, whichever is first. So if you are expecting instant results, sorry, can't happen. You make a good point about the email address. I hadn't considered that people wouldn't want to send email, but you are right, we should just give the URL. I'll open a bug on this. If I had to guess, I'd say that sometime next week when I get the next version drop we will be sending +http://www.betaspider.com/ Well, I don't expect you to unban the site, at least not yet. I can't prove to you that there is a value in having your site indexed right now, but once we launch I'm very sure you will unban the IPs. Might I make one suggestion though? The IPs we use may change at some point in the future, if you want to be really sure that we never hit your site again, send me your domain name(s) and I will add them to our exclusion list. -Joe
|
|
|
|