|
| |
|
|
dasher
Posts: 27 Joined: 9/17/2005 From: Australia Status: offline
|
Robot errors - 5/1/2006 0:25:41
Not seen this before. My site error log shows hundreds of lines similar to this repeated every 2 hours over my entire site of 300+ pages: 1146 in two days. [Sun Apr 30 04:40:45 2006] [error] [client 209.249.86.4] File does not exist: /home/ranvetau/public_html/404.shtml [Sun Apr 30 04:40:45 2006] [error] [client 209.249.86.4] File does not exist: /home/ranvetau/public_html/equine_ulcers.htm/frequent_questions.htm The ip 209.249.86.4 is www.above.net Any ideas on what is going on?
|
|
|
|
Kitka
Posts: 2513 Joined: 1/31/2002 From: Australia Status: offline
|
RE: Robot errors - 5/1/2006 2:24:20
Looking at your actual access logs (not error log), what User Agent were they reporting? For instance, Googlebot reports itself like this: 66.249.66.71 - - [31/Apr/2006:18:00:54 +1000] "GET /contact.htm HTTP/1.1" 200 7254 "-" "GoogleBot/2.1" and a visitor using Firefox might look something like this: 99.999.99.9 - - [31/Apr/2006:19:47:54 +1000] "GET /contact.htm HTTP/1.1" 200 5993 "http://www.example.com/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.12) Gecko/20050919 Firefox/1.0.7" The examples in bold, are the User Agents. If they are really nasty, the place where the UA should be, will show as empty, like this: "-" and so you might see the log entry looking a bit like this: 99.999.99.9 - - [31/Apr/2006:19:47:54 +1000] "GET /contact.htm HTTP/1.1" 200 5993 "-" "-" If they haven't stopped, or they come back to do the same thing, it is possible to block them using an .htaccess file - if you are hosted on Unix/Apache that is. I'm not sure how to do it on Windows servers.
_____________________________
Kitka **It is impossible to make anything foolproof because fools are so ingenious.**
|
|
|
|
dasher
Posts: 27 Joined: 9/17/2005 From: Australia Status: offline
|
RE: Robot errors - 5/1/2006 3:28:33
Here is the access log of one hit 209.249.86.4 - - [30/Apr/2006:15:11:30 +1000] "GET /robots.txt HTTP/1.0" 200 168 "-" "-" 209.249.86.4 - - [30/Apr/2006:15:11:50 +1000] "GET /needs_for_stress.htm/tying_up.htm HTTP/1.0" 404 - "-" "-" 209.249.86.4 - - [30/Apr/2006:15:12:01 +1000] "GET /needs_for_stress.htm/shinsoreness.htm HTTP/1.0" 404 - "-" "-" 209.249.86.4 - - [30/Apr/2006:15:12:11 +1000] "GET /needs_for_stress.htm/frequent_questions.htm HTTP/1.0" 404 - "-" "-" I guess from what you say this is a nasty? What is nasty? Should I block this site and why? Rather than screw around with my .htaccess file I can via cpanel deny the site or do you have the write for the htaccess file. Which is better?
|
|
|
|
Kitka
Posts: 2513 Joined: 1/31/2002 From: Australia Status: offline
|
RE: Robot errors - 5/1/2006 4:07:14
quote:
What is nasty? Should I block this site and why? In your shoes, I would block it. A good analogy I read somewhere goes like this: I knock on your front door and ask to come in. You reply, saying "Who are you?" and I say, "I'm not telling you." Would you let me inside your house? I wouldn't. This is very similar. The User Agent string says who/what you are and only those who have something to hide will not say. quote:
Rather than screw around with my .htaccess file I can via cpanel deny the site or do you have the write for the htaccess file. Which is better? As you have cPanel, it's probably easiest and best to employ that. But don't just block 209.249.86.4. It is likely that they may come from a slightly different IP in future, so block the shortened form: 209.249.86. That will catch all IPs between 209.249.86.0 and 209.249.86.255.
_____________________________
Kitka **It is impossible to make anything foolproof because fools are so ingenious.**
|
|
|
|
dasher
Posts: 27 Joined: 9/17/2005 From: Australia Status: offline
|
RE: Robot errors - 5/1/2006 7:41:53
Thanks mate Do you mean 209.249.86. or 209.249.86
< Message edited by dasher -- 5/1/2006 7:48:23 >
|
|
|
|
Kitka
Posts: 2513 Joined: 1/31/2002 From: Australia Status: offline
|
RE: Robot errors - 5/1/2006 17:12:01
The first one: 209.249.86.
_____________________________
Kitka **It is impossible to make anything foolproof because fools are so ingenious.**
|
|
|
|
dasher
Posts: 27 Joined: 9/17/2005 From: Australia Status: offline
|
RE: Robot errors - 5/1/2006 18:58:36
One last thing. Now getting heaps in my error log. [Tue May 2 08:11:38 2006] [error] [client 209.249.86.4] File does not exist: /home/ranvetau/public_html/403.shtml [Tue May 2 08:11:38 2006] [error] [client 209.249.86.4] client denied by server configuration: /home/ranvetau/public_html/pro_mix.htm Will these eventually cease or will they continue. Is there a way to stop filling up the error log with these?
|
|
|
|
Kitka
Posts: 2513 Joined: 1/31/2002 From: Australia Status: offline
|
RE: Robot errors - 5/1/2006 19:49:05
You can't stop the errors being reported. That is what the error log is for, and it will report all errors, be they 404, 403 or 500 etc. However, you can reduce the errors a bit. You appear to not have 403.shtml file - which is causing 404 errors. In cPanel click on the "Error pages" icon and create one from there. That will also add an important bit of code to your .htaccess file, which will help direct a 403 error to the right page.
_____________________________
Kitka **It is impossible to make anything foolproof because fools are so ingenious.**
|
|
|
|
dasher
Posts: 27 Joined: 9/17/2005 From: Australia Status: offline
|
RE: Robot errors - 5/1/2006 22:55:17
Went to error pages, selected 403, selected ip address, this is now in htaccess <Files 403.shtml> order allow,deny allow from all </Files> deny from 209.249.86. All okay?
|
|
|
|
Kitka
Posts: 2513 Joined: 1/31/2002 From: Australia Status: offline
|
RE: Robot errors - 5/1/2006 23:17:40
quote:
All okay? Looks fine from here. quote:
[Sun Apr 30 04:40:45 2006] [error] [client 209.249.86.4] File does not exist: /home/ranvetau/public_html/404.shtml This line seems to indicate you don't have 404.shtml page set up either. You can create one the same way as the 403. It is a good idea to put user-friendly text on there, because it is often seen by humans rather than bots. Always offer a way to find what they were looking for, like a link to your site map or index page and a search field.
_____________________________
Kitka **It is impossible to make anything foolproof because fools are so ingenious.**
|
|
|
|
dasher
Posts: 27 Joined: 9/17/2005 From: Australia Status: offline
|
RE: Robot errors - 5/2/2006 1:38:40
Thanks Kitka, you've been a great help. Des
|
|
|
|
jrouvier
Posts: 2 Joined: 6/8/2006 Status: offline
|
RE: Robot errors - 6/8/2006 1:01:42
Greetings, I am the admin for the spider in question running from 209.249.86.4. There was a bug in the spider that caused it to send out requests without a user-agent. The 'eng team has been thoroughly admonished for their carelessness, and the spider now correctly identifies itself as "Charlotte/1.0b". It does honor robots.txt and will go away if you tell it to. If for some reason it doesn't, or you have some other issue, I can be reached at charlotte@betaspider.com. Sorry for any confusion. -Joe Rouvier
|
|
|
|
jrouvier
Posts: 2 Joined: 6/8/2006 Status: offline
|
RE: Robot errors - 6/8/2006 2:36:55
Kitka, Yeah, you caught the 2nd user-agent error. Right after it was discovered that we weren't sending a user agent, I ran down the hall and gave the developers a little lesson in internet etiquette. In their rush to enter the user-agent, they misheard me. I wish I owned the domain spider.com. If I did I could sell it and use the money to buy a house, or at least a nice car. Sadly, the domain is only betaspider.com, without the dot between beta and spider. Off the top of my head I'm not sure how long the incorrect user-agent was being sent for, in any event it was too long :( We are collecting pages for a search engine. It's like the other general-purpose engines such as google, yahoo, msn, ask... in that it helps people find websites relevant to what they are looking for. However, differs in some very cool ways which, of course, I can't tell you about. I can tell you that we aren't harvesting email addresses, aren't going re-brand and re-publish the content we crawl, or do any of those evil things. As for it not honoring robots.txt, I know it's working for many people, but there very well could be a bug. If anyone can send me an example site that we are currently crawling that has a valid robots.txt, I'd very much like to hear about it. If there is a bug, it has to be fixed post-haste. It should be noted that we don't request robots.txt before every other request for a real page, we only grab it every 1000 documents or hour, whichever is first. So if you are expecting instant results, sorry, can't happen. You make a good point about the email address. I hadn't considered that people wouldn't want to send email, but you are right, we should just give the URL. I'll open a bug on this. If I had to guess, I'd say that sometime next week when I get the next version drop we will be sending +http://www.betaspider.com/ Well, I don't expect you to unban the site, at least not yet. I can't prove to you that there is a value in having your site indexed right now, but once we launch I'm very sure you will unban the IPs. Might I make one suggestion though? The IPs we use may change at some point in the future, if you want to be really sure that we never hit your site again, send me your domain name(s) and I will add them to our exclusion list. -Joe
|
|
New Messages |
No New Messages |
Hot Topic w/ New Messages |
Hot Topic w/o New Messages |
Locked w/ New Messages |
Locked w/o New Messages |
|
Post New Thread
Reply to Message
Post New Poll
Submit Vote
Delete My Own Post
Delete My Own Thread
Rate Posts
|
|
|