OutFront Forums
     Home    Register     Search      Help      Login    

Follow Us
On Facebook
On Twitter
RSS
Via Email

Recent Posts
Todays Posts
Most Active posts
Posts since last visit
My Recent Posts
Mark posts read

Sponsors
Shopping Cart Software
Ecommerce software integrated into Frontpage, Dreamweaver and Golive templates. No monthly fees and available in ASP and PHP versions.
Website Templates
We also have a wide selection of Dreamweaver, Expression Web and Frontpage templates as well as webmaster tools and CSS layouts.
Frontpage website templates
Creative Website Templates for FrontPage, Dreamweaver, Flash, SwishMax

 

Robots.txt Wildcards

 
View related threads: (in this forum | in all forums)

Logged in as: Guest
Users viewing this topic: none
Printable Version 

All Forums >> Web Development >> Search Engine Optimization and Web Business >> Robots.txt Wildcards
Page: [1]
 
griffen

 

Posts: 2
Joined: 11/5/2009
Status: offline

 
Robots.txt Wildcards - 11/5/2009 17:20:26   
I've been working through the robots.txt file and I think I'm none the wiser.

I've been reviewing google webmaster tools and it's discovered my php header and footer.

It's now going through all folders searching for those particular files, /path/headerFile.inc.....

I therefore want to block the searching of any path that may contain a specific filename.

Can I therefore do the following: Disallow: *headerFile.inc
womble

 

Posts: 6007
Joined: 3/14/2005
From: Living on the edge
Status: offline

 
RE: Robots.txt Wildcards - 11/7/2009 5:10:48   
The easiest way, as you don't really want only segments of a page such as includes indexing, is to simply block access to the directory containing your include files. For example the robots.txt file I have for the current site I'm working on is:

USER-AGENT: *
DISALLOW: /test/
DISALLOW: /scripts/
DISALLOW: /images/
DISALLOW: /styles/
DISALLOW: /includes/
DISALLOW: /cgi-bin/
DISALLOW: /files/


That way I don't have to block individual files, as anything in the includes directory for example is out of bounds. The pages though still get indexed, and all of them have include files for the header and footer, and for the navigation at the least. As far as the search engines are concerned though it's all just code on the page.

If you do need to block just one specific page for some reason, you can of course simply list that page:

/private.html


...or you can use the usual * asterisk to match a sequence of characters. For instance to block access to any sub-directory beginning with 'private', you can use:

User-Agent: *
Disallow: /private*/


If you want to disallow any string that includes a specific sequence of characters you can also use the * wildcard.

User-Agent: Googlebot
Disallow: /*private=?


That would disallow Googlebot from any URL including the string private=?, whether it's in a directory name or a filename. It doesn't matter where the string occurs in the URL, Googlebot's barred from it. It may not work for all search engines though. Google and Yahoo support wildcards in robots.txt files, but I'm not sure if they all do.

_____________________________

~~ "A cruel god ain't no god at all" ~~
~~ Erase hate. Practice love. ~~

(in reply to griffen)
Smith.John

 

Posts: 2
Joined: 11/7/2009
Status: offline

 
RE: Robots.txt Wildcards - 11/7/2009 6:24:58   
Don't write any thing except for directories of admin

(in reply to womble)
griffen

 

Posts: 2
Joined: 11/5/2009
Status: offline

 
RE: Robots.txt Wildcards - 11/7/2009 16:27:13   
The problem though is that google's found my header and footer and is going through every folder now looking for headerFile.inc which includes URL rewrite.

So if I've got a domain:

madeup.com

It will go through any folder

madeup.com/madeup2
madeup.com/madeup3

Looking for any header file

madeup.com/madeup2/headerFile.inc
madeup.com/madeup3/headerFile.inc

I basically want to tell Google to stop looking for headerFile.inc without having to list out every folder.

Is this possible?



< Message edited by Mojo -- 11/10/2009 10:35:52 >

(in reply to Smith.John)
womble

 

Posts: 6007
Joined: 3/14/2005
From: Living on the edge
Status: offline

 
RE: Robots.txt Wildcards - 11/8/2009 5:56:24   
You don't have to list every folder - just use the wildcard, as explained above.

User-Agent: Googlebot
Disallow: /*headerFile


...will disallow Googlebot (of course if you want to block it from all bots, you need to replace the 'Googlebot with the asterisk wildcard) from any URL including the string headerFile, whether it's in a directory name at whatever level or a filename. It doesn't matter where the string occurs in the URL.

For ease of maintenance though it's easier to keep all include files in one directory, and you really shouldn't need to have multiple copies of the same include file in different locations (the idea of an include file is that duplicate content is contained in the one file and included in each page, so if you make changes, you only have to make the changes in one place but every page with that included content will be affected), unless the content of them's different - in which case they really need to have different names, otherwise you risk accidentally overwriting one with a different file of the same name.

Where the includes directory is in the directory hierarchy doesn't matter, because you can reference the file relatively from each page it's included in. If the directory with your includes in is at the same level as the page (i.e. the includes directory's a sub-directory within the directory you page is in), the relative URL will be:

includes/headerFile.inc


If it's a level higher than the current directory (i.e. a level above the location of the page) the relative URL will be:

../includes/headerFile.inc


...and so on.

Bots will automatically try and crawl all your directories, but if you have a valid robots.txt file (test it out to make sure it's valid at one of the various robots.txt testing sites online), in the right location (i.e. the root of your domain), all well-behaved bots will ignore directories the robots.txt file tells them to ignore.

_____________________________

~~ "A cruel god ain't no god at all" ~~
~~ Erase hate. Practice love. ~~

(in reply to griffen)
Page:   [1]

All Forums >> Web Development >> Search Engine Optimization and Web Business >> Robots.txt Wildcards
Page: [1]
Jump to: 1





New Messages No New Messages
Hot Topic w/ New Messages Hot Topic w/o New Messages
Locked w/ New Messages Locked w/o New Messages
 Post New Thread
 Reply to Message
 Post New Poll
 Submit Vote
 Delete My Own Post
 Delete My Own Thread
 Rate Posts