navigation
a webmaster learning community
     Home    Register     Search      Help      Login    
Sponsors

Shopping Cart Software
Ecommerce software integrated into Frontpage, Dreamweaver and Golive templates. No monthly fees and available in ASP and PHP versions.

Website Templates
We also have a wide selection of Dreamweaver, Expression Web and Frontpage templates as well as webmaster tools and CSS layouts.

Frontpage website templates
Creative Website Templates for FrontPage, Dreamweaver, Flash, SwishMax

Search Forums
 

Advanced search
Recent Posts

 Todays Posts
 Most Active posts
 Posts since last visit
 My Recent Posts
 Mark posts read

Microsoft MVP

 

Robot errors

 
View related threads: (in this forum | in all forums)

Logged in as: Guest
Users viewing this topic: none
Printable Version 

All Forums >> Web Development >> Server Issues >> Robot errors
Page: [1]
 
dasher

 

Posts: 27
Joined: 9/17/2005
From: Australia
Status: offline

 
Robot errors - 5/1/2006 0:25:41   
Not seen this before.
My site error log shows hundreds of lines similar to this repeated every 2 hours over my entire site of 300+ pages:

1146 in two days.

[Sun Apr 30 04:40:45 2006] [error] [client 209.249.86.4] File does not exist: /home/ranvetau/public_html/404.shtml
[Sun Apr 30 04:40:45 2006] [error] [client 209.249.86.4] File does not exist: /home/ranvetau/public_html/equine_ulcers.htm/frequent_questions.htm

The ip 209.249.86.4 is www.above.net

Any ideas on what is going on?
Kitka

 

Posts: 2513
Joined: 1/31/2002
From: Australia
Status: offline

 
RE: Robot errors - 5/1/2006 2:24:20   
Looking at your actual access logs (not error log), what User Agent were they reporting?

For instance, Googlebot reports itself like this:

66.249.66.71 - - [31/Apr/2006:18:00:54 +1000] "GET /contact.htm HTTP/1.1" 200 7254 "-" "GoogleBot/2.1"

and a visitor using Firefox might look something like this:

99.999.99.9 - - [31/Apr/2006:19:47:54 +1000] "GET /contact.htm HTTP/1.1" 200 5993 "http://www.example.com/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.12) Gecko/20050919 Firefox/1.0.7"

The examples in bold, are the User Agents. If they are really nasty, the place where the UA should be, will show as empty, like this: "-" and so you might see the log entry looking a bit like this:

99.999.99.9 - - [31/Apr/2006:19:47:54 +1000] "GET /contact.htm HTTP/1.1" 200 5993 "-" "-"

If they haven't stopped, or they come back to do the same thing, it is possible to block them using an .htaccess file - if you are hosted on Unix/Apache that is. I'm not sure how to do it on Windows servers.

_____________________________

Kitka
**It is impossible to make anything foolproof because fools are so ingenious.**


(in reply to dasher)
dasher

 

Posts: 27
Joined: 9/17/2005
From: Australia
Status: offline

 
RE: Robot errors - 5/1/2006 3:28:33   
Here is the access log of one hit

209.249.86.4 - - [30/Apr/2006:15:11:30 +1000] "GET /robots.txt HTTP/1.0" 200 168 "-" "-"
209.249.86.4 - - [30/Apr/2006:15:11:50 +1000] "GET /needs_for_stress.htm/tying_up.htm HTTP/1.0" 404 - "-" "-"
209.249.86.4 - - [30/Apr/2006:15:12:01 +1000] "GET /needs_for_stress.htm/shinsoreness.htm HTTP/1.0" 404 - "-" "-"
209.249.86.4 - - [30/Apr/2006:15:12:11 +1000] "GET /needs_for_stress.htm/frequent_questions.htm HTTP/1.0" 404 - "-" "-"

I guess from what you say this is a nasty?

What is nasty? Should I block this site and why?

Rather than screw around with my .htaccess file I can via cpanel deny the site or do you have the write for the htaccess file. Which is better?

(in reply to Kitka)
Kitka

 

Posts: 2513
Joined: 1/31/2002
From: Australia
Status: offline

 
RE: Robot errors - 5/1/2006 4:07:14   
quote:

What is nasty? Should I block this site and why?


In your shoes, I would block it.

A good analogy I read somewhere goes like this:

I knock on your front door and ask to come in. You reply, saying "Who are you?" and I say, "I'm not telling you."

Would you let me inside your house? I wouldn't. This is very similar.

The User Agent string says who/what you are and only those who have something to hide will not say.

quote:

Rather than screw around with my .htaccess file I can via cpanel deny the site or do you have the write for the htaccess file. Which is better?


As you have cPanel, it's probably easiest and best to employ that. But don't just block 209.249.86.4. It is likely that they may come from a slightly different IP in future, so block the shortened form: 209.249.86.

That will catch all IPs between 209.249.86.0 and 209.249.86.255.

_____________________________

Kitka
**It is impossible to make anything foolproof because fools are so ingenious.**


(in reply to dasher)
dasher

 

Posts: 27
Joined: 9/17/2005
From: Australia
Status: offline

 
RE: Robot errors - 5/1/2006 7:41:53   
Thanks mate

Do you mean

209.249.86.
or
209.249.86

< Message edited by dasher -- 5/1/2006 7:48:23 >

(in reply to Kitka)
Kitka

 

Posts: 2513
Joined: 1/31/2002
From: Australia
Status: offline

 
RE: Robot errors - 5/1/2006 17:12:01   
The first one:

209.249.86.

_____________________________

Kitka
**It is impossible to make anything foolproof because fools are so ingenious.**


(in reply to dasher)
dasher

 

Posts: 27
Joined: 9/17/2005
From: Australia
Status: offline

 
RE: Robot errors - 5/1/2006 18:58:36   
One last thing.

Now getting heaps in my error log.

[Tue May 2 08:11:38 2006] [error] [client 209.249.86.4] File does not exist: /home/ranvetau/public_html/403.shtml
[Tue May 2 08:11:38 2006] [error] [client 209.249.86.4] client denied by server configuration: /home/ranvetau/public_html/pro_mix.htm

Will these eventually cease or will they continue. Is there a way to stop filling up the error log with these?

(in reply to Kitka)
Kitka

 

Posts: 2513
Joined: 1/31/2002
From: Australia
Status: offline

 
RE: Robot errors - 5/1/2006 19:49:05   
You can't stop the errors being reported. That is what the error log is for, and it will report all errors, be they 404, 403 or 500 etc.

However, you can reduce the errors a bit. You appear to not have 403.shtml file - which is causing 404 errors. In cPanel click on the "Error pages" icon and create one from there. That will also add an important bit of code to your .htaccess file, which will help direct a 403 error to the right page.

_____________________________

Kitka
**It is impossible to make anything foolproof because fools are so ingenious.**


(in reply to dasher)
dasher

 

Posts: 27
Joined: 9/17/2005
From: Australia
Status: offline

 
RE: Robot errors - 5/1/2006 22:55:17   
Went to error pages, selected 403, selected ip address, this is now in htaccess

<Files 403.shtml>
order allow,deny
allow from all
</Files>

deny from 209.249.86.

All okay?

(in reply to Kitka)
Kitka

 

Posts: 2513
Joined: 1/31/2002
From: Australia
Status: offline

 
RE: Robot errors - 5/1/2006 23:17:40   
quote:

All okay?


Looks fine from here.

quote:

[Sun Apr 30 04:40:45 2006] [error] [client 209.249.86.4] File does not exist: /home/ranvetau/public_html/404.shtml


This line seems to indicate you don't have 404.shtml page set up either. You can create one the same way as the 403.

It is a good idea to put user-friendly text on there, because it is often seen by humans rather than bots. Always offer a way to find what they were looking for, like a link to your site map or index page and a search field.

_____________________________

Kitka
**It is impossible to make anything foolproof because fools are so ingenious.**


(in reply to dasher)
dasher

 

Posts: 27
Joined: 9/17/2005
From: Australia
Status: offline

 
RE: Robot errors - 5/2/2006 1:38:40   
Thanks Kitka, you've been a great help.

Des

(in reply to Kitka)
jrouvier

 

Posts: 2
Joined: 6/8/2006
Status: offline

 
RE: Robot errors - 6/8/2006 1:01:42   
Greetings,

I am the admin for the spider in question running from 209.249.86.4. There was a bug in the spider that caused it to send out requests without a user-agent. The 'eng team has been thoroughly admonished for their carelessness, and the spider now correctly identifies itself as "Charlotte/1.0b". It does honor robots.txt and will go away if you tell it to. If for some reason it doesn't, or you have some other issue, I can be reached at charlotte@betaspider.com.

Sorry for any confusion.

-Joe Rouvier

(in reply to dasher)
Kitka

 

Posts: 2513
Joined: 1/31/2002
From: Australia
Status: offline

 
RE: Robot errors - 6/8/2006 1:40:00   
Hi Joe,

Many thanks for dropping by and clarifying the name of the spider. However, I will not be unbanning the IP any time soon - without a good reason to do so.

Firstly, you say it respects robots.txt, but I've read in another Webmaster forum that Charlotte ignores it. Secondly, there is no information about what your spider is actually collecting data for.

This is the User Agent I have seen in some of our sites:
quote:

Mozilla/5.0 (compatible; Charlotte/1.0b; charlotte@beta.spider.com)


And placing beta.spider.com or www.beta.spider.com in my browser gives an error saying that the URL could not be found.

Most reputable spiders give a URL detailing what their purpose is and how to disallow it via robots.txt e.g. Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Nobody in their right mind is going to attempt to email you for information, when you could very likely be spammers or scammers and therefore have neatly presented you with their private email address.

If you can convince me that having your spider crawl my sites is in some way beneficial for those sites, I'll happily unban it. :)

_____________________________

Kitka
**It is impossible to make anything foolproof because fools are so ingenious.**


(in reply to jrouvier)
jrouvier

 

Posts: 2
Joined: 6/8/2006
Status: offline

 
RE: Robot errors - 6/8/2006 2:36:55   
Kitka,

Yeah, you caught the 2nd user-agent error. Right after it was discovered that we weren't sending a user agent, I ran down the hall and gave the developers a little lesson in internet etiquette. In their rush to enter the user-agent, they misheard me. I wish I owned the domain spider.com. If I did I could sell it and use the money to buy a house, or at least a nice car. Sadly, the domain is only betaspider.com, without the dot between beta and spider. Off the top of my head I'm not sure how long the incorrect user-agent was being sent for, in any event it was too long :(

We are collecting pages for a search engine. It's like the other general-purpose engines such as google, yahoo, msn, ask... in that it helps people find websites relevant to what they are looking for. However, differs in some very cool ways which, of course, I can't tell you about. I can tell you that we aren't harvesting email addresses, aren't going re-brand and re-publish the content we crawl, or do any of those evil things.

As for it not honoring robots.txt, I know it's working for many people, but there very well could be a bug. If anyone can send me an example site that we are currently crawling that has a valid robots.txt, I'd very much like to hear about it. If there is a bug, it has to be fixed post-haste. It should be noted that we don't request robots.txt before every other request for a real page, we only grab it every 1000 documents or hour, whichever is first. So if you are expecting instant results, sorry, can't happen.

You make a good point about the email address. I hadn't considered that people wouldn't want to send email, but you are right, we should just give the URL. I'll open a bug on this. If I had to guess, I'd say that sometime next week when I get the next version drop we will be sending +http://www.betaspider.com/

Well, I don't expect you to unban the site, at least not yet. I can't prove to you that there is a value in having your site indexed right now, but once we launch I'm very sure you will unban the IPs. Might I make one suggestion though? The IPs we use may change at some point in the future, if you want to be really sure that we never hit your site again, send me your domain name(s) and I will add them to our exclusion list.

-Joe

(in reply to Kitka)
Kitka

 

Posts: 2513
Joined: 1/31/2002
From: Australia
Status: offline

 
RE: Robot errors - 6/8/2006 3:55:07   
quote:

if you want to be really sure that we never hit your site again, send me your domain name(s)


LOL! Good try at getting my email address, but it didn't work. :)

I'll take my chances with robots.txt and other methods thanks. :)

_____________________________

Kitka
**It is impossible to make anything foolproof because fools are so ingenious.**


(in reply to jrouvier)
Page:   [1]

All Forums >> Web Development >> Server Issues >> Robot errors
Page: [1]
Jump to: 1





New Messages No New Messages
Hot Topic w/ New Messages Hot Topic w/o New Messages
Locked w/ New Messages Locked w/o New Messages
 Post New Thread
 Reply to Message
 Post New Poll
 Submit Vote
 Delete My Own Post
 Delete My Own Thread
 Rate Posts