Feature #14798: Robots.txt and indexed search - TYPO3 Core - TYPO3 Forge

Actions

Copy link

Feature #14798

closed

Robots.txt and indexed search

Added by Jody Cleveland about 19 years ago. Updated almost 8 years ago.

Status:

Rejected

Priority:

Should have

Assignee:

Category:

Indexed Search

Target version:

8 LTS

Start date:

2005-06-06

Due date:

% Done:

Estimated time:

PHP Version:

5.5

Tags:

Complexity:

Sprint Focus:

Description

I've got a handful of those pdf files that I don't want
indexed. So, to satisfy google, and other search engines, I use a
robots.txt file so they aren't indexed.

My requested feature is, getting typo to honor robots.txt files, and
skip indexing files listed in there.

(issue imported from #M1170)

Actions

Copy link

Updated by Michael Stucki about 19 years ago

Are you talking of internal or external files?

Let's say your site is http://www.mysite.com/mysitedir/ and you have several links, which of them do you think should be checked against Robots.txt?

http://www.mysite.com/mysitedir/fileadmin/test.pdf
http://www.mysite.com/mysitedir/Intro.5.0.html
http://www.mysite.com/anothersitedir/fileadmin/test.pdf
http://www.mysite.com/anothersitedir/Introl.5.0.html
http://www.anothersite.com/mysitedir/fileadmin/test.pdf
http://www.anothersite.com/mysitedir/Intro.5.0.html

Actions

Copy link

Updated by Jody Cleveland about 19 years ago

I would think anything listed in robots, as long as it was within the site. More like this one:

http://www.mysite.com/mysitedir/fileadmin/test.pdf [^]

Actions

Copy link

Updated by Michael Stucki about 19 years ago

I will implement this if you can do some research for me about Robots.txt:

- Is there an RFC?
- Where is the file expected to be found: Only in / or in any directory of the rootline?
- Does it accept regular expressions or only plain strings?
- Any other special formattings?

Actions

Copy link

Updated by Jody Cleveland almost 19 years ago

Here's the RFC:

3.3 Formal Syntax

This is a BNF-like description, using the conventions of RFC 822 [5],
  except that "|" is used to designate alternatives.  Briefly, literals
  are quoted with "", parentheses "(" and ")" are used to group
  elements, optional elements are enclosed in [brackets], and elements
  may be preceded with &lt;n&gt;* to designate n or more repetitions of the
  following element; n defaults to 0.

robotstxt    = *blankcomment
              | blankcomment record *( 1*commentblank 1*record )
                   *blankcomment
    blankcomment = 1(blank | commentline)
    commentblank = *commentline blank *(blankcomment)
    blank        = *space CRLF
    CRLF         = CR LF
    record       = *commentline agentline *(commentline | agentline)
                   1*ruleline *(commentline | ruleline)

agentline    = "User-agent:" *space agent  [comment] CRLF
    ruleline     = (disallowline | allowline | extension)
    disallowline = "Disallow" ":" *space path [comment] CRLF
    allowline    = "Allow" ":" *space rpath [comment] CRLF
    extension    = token : *space value [comment] CRLF
    value        = &lt;any CHAR except CR or LF or &quot;#&quot;&gt;

commentline  = comment CRLF
    comment      = blank "#" anychar
    space        = 1(SP | HT)
    rpath        = "/" path
    agent        = token
    anychar      = &lt;any CHAR except CR or LF&gt;
    CHAR         = &lt;any US-ASCII character (octets 0 - 127)&gt;
    CTL          = &lt;any US-ASCII control character
                        (octets 0 - 31) and DEL (127)>
    CR           = &lt;US-ASCII CR, carriage return (13)&gt;
    LF           = &lt;US-ASCII LF, linefeed (10)&gt;
    SP           = &lt;US-ASCII SP, space (32)&gt;
    HT           = &lt;US-ASCII HT, horizontal-tab (9)&gt;

The syntax for "token" is taken from RFC 1945 [2], reproduced here for
   convenience:

token        = 1*&lt;any CHAR except CTLs or tspecials&gt;

tspecials    = "(" | ")" | "<" | ">" | "@" 
             | "," | ";" | ":" | "\" | <">
             | "/" | "[" | "]" | "?" | "=" 
             | "{" | "}" | SP | HT

The syntax for "path" is defined in RFC 1808 [6], reproduced here for
  convenience:

path        = fsegment *( "/" segment )
    fsegment    = 1*pchar
    segment     =  *pchar

pchar       = uchar | ":" | "@" | "&" | "=" 
    uchar       = unreserved | escape
    unreserved  = alpha | digit | safe | extra

escape      = "%" hex hex
    hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                         "a" | "b" | "c" | "d" | "e" | "f"

alpha       = lowalpha | hialpha

lowalpha    = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
                  "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
                  "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" 
    hialpha     = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
                  "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
                  "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"

digit       = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
                  "8" | "9"

safe        = "$" | "-" | "_" | "." | "+" 
    extra       = "!" | "*" | "'" | "(" | ")" | ","

I belive the robots.txt file needs to be in the root of the site:

This section contains an example of how a /robots.txt may be used.

A fictional site may have the following URLs:

http://www.fict.org/
     http://www.fict.org/index.html
     http://www.fict.org/robots.txt
     http://www.fict.org/server.html
     http://www.fict.org/services/fast.html
     http://www.fict.org/services/slow.html
     http://www.fict.org/orgo.gif
     http://www.fict.org/org/about.html
     http://www.fict.org/org/plans.html
     http://www.fict.org/%7Ejim/jim.html
     http://www.fict.org/%7Emak/mak.html

The site may in the /robots.txt have specific rules for robots that
   send a HTTP User-agent "UnhipBot/0.1", "WebCrawler/3.0", and
   "Excite/1.0", and a set of default rules:

/robots.txt for http://www.fict.org/
comments to webmaster@fict.org

User-agent: unhipbot
      Disallow: /

User-agent: webcrawler
      User-agent: excite
      Disallow:

User-agent: *
      Disallow: /org/plans.html
      Allow: /org/
      Allow: /serv
      Allow: /~mak
     Disallow: /

The following matrix shows which robots are allowed to access URLs:

unhipbot       webcrawler           other
                                                                                                & excite
     http://www.fict.org/                                      No                  Yes                     No
     http://www.fict.org/index.html                   No                  Yes                     No
     http://www.fict.org/robots.txt                    Yes                  Yes                    Yes
     http://www.fict.org/server.html                 No                  Yes                    Yes
     http://www.fict.org/services/fast.html     No                  Yes                    Yes
     http://www.fict.org/services/slow.html   No                  Yes                    Yes
     http://www.fict.org/orgo.gif                       No                  Yes                     No
     http://www.fict.org/org/about.html           No                 Yes                     Yes
     http://www.fict.org/org/plans.html           No                 Yes                     No
     http://www.fict.org/%7Ejim/jim.html       No                 Yes                     No
     http://www.fict.org/%7Emak/mak.html  No                 Yes                    Yes

I took all this from this document:
http://www.robotstxt.org/wc/norobots-rfc.html

I hope that helps, and I really appreciate you looking into this. If there's anything else you need, let me know.

Actions

Copy link

Updated by Mathias Schreiber over 9 years ago

Description updated (diff)
Status changed from New to Accepted
Target version changed from 0 to 7.0
PHP Version set to 5.5

Actions

Copy link

Updated by Mathias Schreiber over 9 years ago

Target version changed from 7.0 to 7.1 (Cleanup)

Actions

Copy link

Updated by Benni Mack about 9 years ago

Target version changed from 7.1 (Cleanup) to 7.4 (Backend)

Actions

Copy link

Updated by Susanne Moog almost 9 years ago

Target version changed from 7.4 (Backend) to 7.5

Actions

Copy link

Updated by Benni Mack almost 9 years ago

Target version changed from 7.5 to 8 LTS

Actions

Copy link

#10

Updated by Tymoteusz Motylewski almost 8 years ago

Status changed from Accepted to Rejected

I would not bind indexed search indexer with robots.txt file.
I see many usecases, where you don't want google to index some files/pages, but you want them to pop up in the indexed search results.

In my opinion the problem should be solved in a different way - e.g. by adding a "no index" flag to the FAL record representing the PDF file.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

TYPO3 Core

Custom queries

Watchers (1)

Feature #14798

Robots.txt and indexed search

Updated by Michael Stucki about 19 years ago

Updated by Jody Cleveland about 19 years ago

Updated by Michael Stucki about 19 years ago

Updated by Jody Cleveland almost 19 years ago

Updated by Mathias Schreiber over 9 years ago

Updated by Mathias Schreiber over 9 years ago

Updated by Benni Mack about 9 years ago

Updated by Susanne Moog almost 9 years ago

Updated by Benni Mack almost 9 years ago

Updated by Tymoteusz Motylewski almost 8 years ago