Project

General

Profile

Actions

Feature #14798

closed

Robots.txt and indexed search

Added by Jody Cleveland about 19 years ago. Updated over 7 years ago.

Status:
Rejected
Priority:
Should have
Assignee:
-
Category:
Indexed Search
Target version:
Start date:
2005-06-06
Due date:
% Done:

0%

Estimated time:
PHP Version:
5.5
Tags:
Complexity:
Sprint Focus:

Description

I've got a handful of those pdf files that I don't want
indexed. So, to satisfy google, and other search engines, I use a
robots.txt file so they aren't indexed.

My requested feature is, getting typo to honor robots.txt files, and
skip indexing files listed in there.

(issue imported from #M1170)

Actions #2

Updated by Jody Cleveland about 19 years ago

I would think anything listed in robots, as long as it was within the site. More like this one:

http://www.mysite.com/mysitedir/fileadmin/test.pdf [^]

Actions #3

Updated by Michael Stucki almost 19 years ago

I will implement this if you can do some research for me about Robots.txt:

- Is there an RFC?
- Where is the file expected to be found: Only in / or in any directory of the rootline?
- Does it accept regular expressions or only plain strings?
- Any other special formattings?

Actions #4

Updated by Jody Cleveland over 18 years ago

Here's the RFC:

3.3 Formal Syntax

This is a BNF-like description, using the conventions of RFC 822 [5],
except that "|" is used to designate alternatives. Briefly, literals
are quoted with "", parentheses "(" and ")" are used to group
elements, optional elements are enclosed in [brackets], and elements
may be preceded with <n>* to designate n or more repetitions of the
following element; n defaults to 0.
robotstxt    = *blankcomment
              | blankcomment record *( 1*commentblank 1*record )
*blankcomment
blankcomment = 1
(blank | commentline)
commentblank = *commentline blank *(blankcomment)
blank = *space CRLF
CRLF = CR LF
record = *commentline agentline *(commentline | agentline)
1*ruleline *(commentline | ruleline)
agentline    = "User-agent:" *space agent  [comment] CRLF
ruleline = (disallowline | allowline | extension)
disallowline = "Disallow" ":" *space path [comment] CRLF
allowline = "Allow" ":" *space rpath [comment] CRLF
extension = token : *space value [comment] CRLF
value = <any CHAR except CR or LF or "#">
commentline  = comment CRLF
comment = blank "#" anychar
space = 1
(SP | HT)
rpath = "/" path
agent = token
anychar = <any CHAR except CR or LF>
CHAR = <any US-ASCII character (octets 0 - 127)>
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>
The syntax for "token" is taken from RFC 1945 [2], reproduced here for
convenience:
token        = 1*<any CHAR except CTLs or tspecials>
tspecials    = "(" | ")" | "<" | ">" | "@" 
             | "," | ";" | ":" | "\" | <">
             | "/" | "[" | "]" | "?" | "=" 
             | "{" | "}" | SP | HT
The syntax for "path" is defined in RFC 1808 [6], reproduced here for
convenience:
path        = fsegment *( "/" segment )
fsegment = 1*pchar
segment = *pchar
pchar       = uchar | ":" | "@" | "&" | "=" 
uchar = unreserved | escape
unreserved = alpha | digit | safe | extra
escape      = "%" hex hex
hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
"a" | "b" | "c" | "d" | "e" | "f"
alpha       = lowalpha | hialpha
lowalpha    = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
"j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
"s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
"J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
"S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
digit       = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9"
safe        = "$" | "-" | "_" | "." | "+" 
extra = "!" | "*" | "'" | "(" | ")" | ","

I belive the robots.txt file needs to be in the root of the site:

This section contains an example of how a /robots.txt may be used.

A fictional site may have the following URLs:
http://www.fict.org/
http://www.fict.org/index.html
http://www.fict.org/robots.txt
http://www.fict.org/server.html
http://www.fict.org/services/fast.html
http://www.fict.org/services/slow.html
http://www.fict.org/orgo.gif
http://www.fict.org/org/about.html
http://www.fict.org/org/plans.html
http://www.fict.org/%7Ejim/jim.html
http://www.fict.org/%7Emak/mak.html
The site may in the /robots.txt have specific rules for robots that
send a HTTP User-agent "UnhipBot/0.1", "WebCrawler/3.0", and
"Excite/1.0", and a set of default rules:
  1. /robots.txt for http://www.fict.org/
  2. comments to
User-agent: unhipbot
Disallow: /
User-agent: webcrawler
User-agent: excite
Disallow:
User-agent: *
Disallow: /org/plans.html
Allow: /org/
Allow: /serv
Allow: /~mak
Disallow: /
The following matrix shows which robots are allowed to access URLs:
unhipbot       webcrawler           other
& excite
http://www.fict.org/ No Yes No
http://www.fict.org/index.html No Yes No
http://www.fict.org/robots.txt Yes Yes Yes
http://www.fict.org/server.html No Yes Yes
http://www.fict.org/services/fast.html No Yes Yes
http://www.fict.org/services/slow.html No Yes Yes
http://www.fict.org/orgo.gif No Yes No
http://www.fict.org/org/about.html No Yes Yes
http://www.fict.org/org/plans.html No Yes No
http://www.fict.org/%7Ejim/jim.html No Yes No
http://www.fict.org/%7Emak/mak.html No Yes Yes

I took all this from this document:
http://www.robotstxt.org/wc/norobots-rfc.html

I hope that helps, and I really appreciate you looking into this. If there's anything else you need, let me know.

Actions #5

Updated by Mathias Schreiber over 9 years ago

  • Description updated (diff)
  • Status changed from New to Accepted
  • Target version changed from 0 to 7.0
  • PHP Version set to 5.5
Actions #6

Updated by Mathias Schreiber over 9 years ago

  • Target version changed from 7.0 to 7.1 (Cleanup)
Actions #7

Updated by Benni Mack about 9 years ago

  • Target version changed from 7.1 (Cleanup) to 7.4 (Backend)
Actions #8

Updated by Susanne Moog almost 9 years ago

  • Target version changed from 7.4 (Backend) to 7.5
Actions #9

Updated by Benni Mack over 8 years ago

  • Target version changed from 7.5 to 8 LTS
Actions #10

Updated by Tymoteusz Motylewski over 7 years ago

  • Status changed from Accepted to Rejected

I would not bind indexed search indexer with robots.txt file.
I see many usecases, where you don't want google to index some files/pages, but you want them to pop up in the indexed search results.

In my opinion the problem should be solved in a different way - e.g. by adding a "no index" flag to the FAL record representing the PDF file.

Actions

Also available in: Atom PDF