Crawler (爬虫) cronjob,自动刷新

Crawler Extension

The official documentation for the crawler extension is here: crawler documentation on typo3.org. The typical use of the crawler is to index pages that can then be found by the indexed_search extension. It is possible to index not only Typo3-pages but also external files like pdf, doc etc. and even external websites. The official documentation for indexed_search is here: indexed_search documentation on typo3.org.

[edit] Basic Crawler + External files setup

There are many people having problems setting up the crawler extension and indexer for external file indexing. This wiki-entry is an effort to possibly help you not have to spend too much time on setting up the crawler. Some of this may seem obvious to you, but it may not be for a newbie.

[edit] Installation of crawler, indexed_search and additional tools

The crawler-extension is used typically together with indexed_search. The crawler can do the indexing that is then used by indexed_search. The indexed_search extension comes with TYPO3, but not installed. So, in the backend click TOOLS >EXT MANAGER and change the Menu drop down to INSTALL EXTENSIONS. You should see the indexed_search extension in under the ‘Frontend Plugins’ section. Click the grey circle w/ + sign icon to install. This should add the indexed_search to the ‘Frontend Plugins’ section in the EXT MANAGER > LOADED EXTENSIONS page.

See indexed search wiki-page (Ext_indexed_search) and the documentation on typo3.org for more info.

The crawler extension must be downloaded from the TYPO3 Extension Repository (TER): crawler. Follow the TYPO3 instructions for installing from the TER.

To index external files you will need a few external tools that convert the various doc types (.pdf, .doc, .ppt, etc.) to text/html so they can be indexed. The necessary tools are:

pdftotext and pdfinfo
unzip
catdoc
xlhtml
ppthtml
unrtf

Be sure to install the proper version of these scripts for the server you are runnig on (linux/solaris/windows/whatever). An issue we had was that we were (unknowingly) trying to install a linux compiled pdftotext on a solaris server. It didn’t work! Fortunately we found that it was already installed previously.

[edit] Enable Indexing

Once installed and configured, the indexed search plugin indexes any pages that are cached, by default. If external files are linked on normal pages, such as links to pdf files, normal indexed search will index them provided you have setup the appropriate external tools. (Crawler is not needed to espressly index external files).

Pages that are not cached, such as pages with non-cached plugins (tt_products, etc.) or set to NO-CACHE are not indexed. If you want to add these types of pages to indexed search, you must use crawler (and here we are!)

To configure the Indexed Search extension, click the ‘Indexed Search Engine’ link in the EXTENSION MANAGER to view the plugins info/config page:

– We can disable the indexed_search’s normal indexing behavior, since it is unnecessary while using crawler. Select the “disableFrontendIndexing” box, and the useCrawlerForExternalFiles box

– Set the path to all of the external tools you installed in the previous step. Note be sure to only enter the PATH, from the root of the server. All my scripts are installed in /util/bin/ for example.

– SEE INDEXED SEARCH FOR MORE INFORMATION ABOUT CONFIG

When finished making changes to the settings, and click UPDATE.
Now switch to TEMPLATE module and choose to your site’s main template (page) and add these lines to the Setup TypoScript:

TS TypoScript:

config.index_enable = 1
config.index_externals = 1

“config.index_enable” enables indexing for the page and subpages unless a sub-template sets this to 0. “config.index_externals” enables indexing of files, like pdf or doc, depending on how you have set things up in previous steps.

Now indexing is enabled.

[edit] configure crawler

To configure crawler, navigate to the PAGE PROPERTIES of the root page. We now need to place some config in the TSConfig section to tell the crawler what to do when it crawls. Heres is an example:

#set up a crawl for users that arent logged in
tx_crawler.crawlerCfg.paramSets.test = 
tx_crawler.crawlerCfg.paramSets.test {
	cHash = 1
	procInstrFilter = tx_indexedsearch_reindex, tx_indexedsearch_crawler
	baseUrl = http://192.168.0.71/cms/
}

#set up a crawl for users who have group id of 1
tx_crawler.crawlerCfg.paramSets.grp1 < tx_crawler.crawlerCfg.paramSets.test
tx_crawler.crawlerCfg.paramSets.grp1{
	userGroups = 1
}

#set up a crawl for users who have group id of 2
tx_crawler.crawlerCfg.paramSets.grp2 < tx_crawler.crawlerCfg.paramSets.test
tx_crawler.crawlerCfg.paramSets.grp2{
	userGroups = 2
}

This sets up crawls for all the pages with no special vars after the ?id=# part of the url. You can do much fancier things, but you must read the documentation.

There are also a few more examples below.

[edit] Setting up the groups

Also, the crawler documentation says you should be able to specify a list of ids in the userGroups var. This is very misleading. It would seem intuitive to be able to specify all the groups in this one line, such as:

#set up a crawl for all users? NOPE. DOESN'T WORK!!!
tx_crawler.crawlerCfg.paramSets.grps < tx_crawler.crawlerCfg.paramSets.test
tx_crawler.crawlerCfg.paramSets.grps{
	userGroups = 1,2,3,4,5,6,7,8
}

One would assume that the crawler would index content for each group enabling any subset of those groups to search the content. This is not the case. One must specify a new tx_crawler.crawlerCfg.paramSets. object for each subset of groups. So if you have a user in groups 1,4,5 and another in groups 3,7,8 you would need to have this in your TSConfig (remember to add 0 and -2 i.e. any login!):

#set up a crawl for users who have group id of 1,4,5
tx_crawler.crawlerCfg.paramSets.grp1 < tx_crawler.crawlerCfg.paramSets.test
tx_crawler.crawlerCfg.paramSets.grp1{
	userGroups = 0,-2,1,4,5
}

#set up a crawl for users who have group id of 3,7,8
tx_crawler.crawlerCfg.paramSets.grp2 < tx_crawler.crawlerCfg.paramSets.test
tx_crawler.crawlerCfg.paramSets.grp2{
	userGroups = 0,-2,3,7,8
}

That should be all you have to do. See the troubleshooting section below in case it doesn’t work.

[edit] Making sure configuration works

Make sure there is a page on the site with a link to an external document. In the backend, click Web->Info from the left navigation pane then change the drop down box to ‘Site Crawler’. Click on the site’s root page if it wasn’t already selected and select ‘infinite’ from the 3rd dropdown box on the page. You should see the whole page hierarchy.

Click the ‘re-indexing’ item in the ‘Processing Instructions’ list box, then click the ‘Crawl URLs’. TYPO3 should generate a sort of fetch list of URLs (the crawler queue) which can be viewed by choosing ‘Crawler Log’ from the left most drop down. At this point, TYPO3 seems to know nothing about the external files contained in the pages.

Now we will test to make sure the external indexing is working. Scroll down in the page hierarchy and find a page which has a link to some external file. Notice the refresh button next to the qid number in each row (when your mouse is over it, you get a ‘Read’ tooltip). Click that refresh button on the row with your page. It will index that document and add any external files to the crawl queue. After your page is done indexing, there should be a new row under the page you indexed for each external file link in your page. On my typo install, the new rows are essentially blank, sporting only a new qid and a set_id of 0.

Now that you have some new rows, click the Read/reload button for the new rows to index them. Once indexed, all you will see is the addition of an ‘OK’ in the status column. The document will now be added to the Tools->Indexing->List:External documents page.

Finally, search for some words in your external documents!

[edit] Troubleshooting

[edit] troubles with external document indexing

I have read quite a few mailing list posts regarding external document indexing, and many of them had different fixes. The major problems were:

Long filenames in windows
Open_basedir doesnt have the tool directory. Placing this in the apache2.conf worked for me:

shell script:

  <Directory "/var/www/cms">
      php_admin_value open_basedir "/usr/bin/:/usr/share/:/var/www/"
  </Directory>

[edit] potential problems with URL-rewriting (realurl or cooluri)

The url’s the crawler visits need to be existent. You can see in the Web->Info module that the url’s that are in the crawler-joblist look like of http://www.servername.de/index.php?id=1000&L=2. It is necessary that the URLs in the queue are valid. If you are using the cooluri-extension (the same should apply to realurl?), instead of http://www.servername.de/index.php?id=1000&L=2 you get URLs like http://www.servername.de/en/economics/. The crawler needs the former variant to be functional as well. This can be achieved by setting

TS TypoScript:

config.redirectOldLinksToNew = 0

[edit] Running the crawler via the command line

So far, we have done some configuration that basically results in a joblist or “queue” (once you have submitted URL’s from the backend as described above) and that should enable indexing of external files. Now how to automate the process of actually executing those jobs? The trick is to set up a cronjob for the crawler. The prerequisite for setting up a cronjob is to be able to run the crawler via command-line.

As with the External file indexing, this can be a bit difficult to get running as well. There is the official documentation on the typo3.org, but it may not be sufficient.

[edit] Adding the user

Command line crawling needs a special TYPO3 backend user. So go into the backend and create a user named _cli_crawler. The user does not have to be in a group, can have access to nothing, and can have any password you like.

[edit] Making the CLI script executable

On applicable platforms, the CLI script must be executable by the backend TYPO3 crontab user (usually user is “_cli_crawler” and does not have to be part of any groups or have special permissions). Don’t forget that you’ll need to make the script executable (chmod +x crawler_cli.phpsh) each time you upgrade the crawler extension (tip: add a second crontab entry that keeps trying to set the correct permissions).

[edit] Getting the path right

The first error I had was an invalid path to the running script. This was due to one line that defines ‘PATH_thisScript’ in the script which I ended up hard coding. So vi the crawler_cli.phpsh script and edit the ‘PATH_thisScript’ definition to something like this:

shell script:

$curf = '/var/www/cms/typo3conf/ext/crawler/cli/crawler_cli.phpsh';
define('PATH_thisScript',$curf);

Note that this ‘PATH_thisScript’ variable MUST be set correctly as it is used in the init.php script, which is included in crawler_cli.phpsh!

Now try to run the script. The script would give errors if it was not run as root on my machine (mkdir permission errors).

BEFORE YOU START TO MODIFY THE SCRIPT – ANOTHER OPINION TO PATH TROUBLE: I had the same path trouble when trying to execute the file directly on the command line by navigating to the directory and entering “./crawler_cli.phpsh”. BUT after looking a little closer I saw that just by entering the full path to run the executable, everything works perfectly… “/var/www/html/typo3conf/ext/crawler/cli/crawler_cli.phpsh”. I don’t think you need to do any mods to the crawler_cli.phpsh file.

[edit] Troubleshooting

Post your problems/fixes here!

Crawler-Extension crawler_2.1.0.t3x comes with another Path-Problem: There’s a fix: http://bugs.typo3.org/view.php?id=9279 – Thanks to Daniel Poetzinger!

on SuSE-Linux you can add a cronjob by putting a shell-script into /etc/cron.hourly. You should not add there a file with crontab-syntax but a simple shell script:)

If you get status error messages in the crawler logs, check if fsockopen works on the server and whether it can access the url you are specifying. In Intranet’s this might be an issue. For the configuration variable “baseUrl” try “http://localhost/” or the IP address of the server instead of the public domain. Also check if the port is working (eg port 80).

[edit] check php.ini

If the crawler script exits without a error message you should check if you’ve included mysql.so in the responsible php.ini file:

 extension=mysql.so

On most systems there are diffrent php.ini files for the apache PHP module and the command line interface PHP! On my Debian Sarge the line extension=mysql.so was missing in /etc/php4/cli/php.ini but included in /etc/php4/apache2/php.ini! If mysql.so is missing crawler_cli.php will fail without any error message establishing the the database connection (crawler_cli.phpsh calls init.php and herein sql_pconnect fails).

There are other reasons that can make php.ini to abort when it is required from our cli script. Please, in case of an unexpected exit check your php.ini file, specially those lines executing an exit command. In my case, system variable $_SERVER[‘REMOTE_ADDR’] was not defined causing php.ini to abort.

[edit] Running the crawler via a cronjob

Now that you can run the crawler via command-line (that is: tell it to execute the jobs you have submitted to the queue that can be seen in the Info module), you can automate the process. When you want all your pages to reindex automatically, then you will have to set up a cronjob. This cronjob will run every minute, check if there are any pages to reindex (or other jobs) and do so.

[edit] Install cronjob via shell

First you have to connect to your server via SSH, so that you have a open shell. You can do this with Terminal (Mac) or Putty (Windows).

The following commands will install the cronjob:

crontab -l will list all installed cronjobs.

crontab -e will open the vi-Editor. Here you can edit the cronjobs.

Now press a to jump into the Edit-Mode.

Now insert a cronjob for your Crawler-Script. Example: * * * * * /homepages/35/u123456/htdocs/typo3page/typo3conf/ext/crawler/cli/crawl.sh

When you’re finished, press escape and then :wq. This will save your cronjob-table and exit the editor.

That’s it! Now the Crawler runs every minute and checks, if there are any pages to reindex. To set up which pages should be indexed, you need to fill the queue with jobs. There are several ways to do that (explained below). One possibility: You could go and set up [Indexing Configurations].

[edit] More Infos

Syntax of Cronjobs (only in german)

vi-Editor or just google…

[edit] Troubleshooting

[edit] if the crawler does not execute the jobs in the queue

Make sure that your Script can be executed. Test to execute it via the command line (Ext_crawler#Running_the_crawler_via_the_command_line).

Is the path to your Script correct?

[edit] How to fill the queue (IndexingConfiguration Method and Alternatives)

There are several ways to fill the queue:

Backend
script (combined with cronjob)
Indexing Configurations

[edit] Filling the queue in the backend

This has been described above. In short: Go to the Info -> Site Crawler Module. Choose “Start Crawling”. needs editing and a screenshot

[edit] build the joblist / queue via command line or cronjob

The command to submit url’s to the queue looks like this:

path/to/your/Typo3/installation/typo3/cli_dispatch.phpsh crawler_im 1049 -d 99 -proc tx_indexedsearch_reindex -n 1000 -o queue

the parameters have the following meaning:

crawler_im: use the tool to build the queue
1049: the page-ID from where the crawler should start to crawl the page tree
-d 99: how many levels of recursion
-proc tx_indexedsearch_reindex: this script is meant to reindex
-n 1000: how many entries per minute
-o queue: build the queue

This is an alternative to use IndexingConfigurations to fill the joblist. Submitting url’s can also be automated via cronjob (see example below) to include new pages.

[edit] Indexing Configurations (not yet written)

Prerequisite: The indexed_search extension installed and running.

needs editing and screenshots

See here http://typo3.org/documentation/document-library/extension-manuals/doc_indexed_search/2.10.0/view/1/4/ in the documentation of indexed_search

Questions: Where to put which of the ndexing Configurations. How to troubleshoot if the are not working?

[edit] Periodic indexing of the website (Page tree)

[edit] Periodic indexing of records (Database Records)

[edit] Indexing External websites (External URL)

Note! External URLs have to reside on sub-URLs to the URL that you specify as the starting point for the indexing configuration. I.e. if your starting point is http://www.somedomain.com/subdirectory1/ then links leading to http://www.somedomain.com/subdirectory2/ won’t get indexed even though they are directly linked to. This also applies to get-parameters. So specifying http://www.somedomain.com/?page=first will invalidate links to all pages that doesn’t start exactly with that specific URL for example http://www.somedomain.com/subdirectory/

[edit] Indexing directories of files (Filepath on server)

[edit] patch for the page tree Indexing Configuration?

problem: The page tree Indexing Configuration allows to crawl a page tree (or sub-tree) only upt to three levels down. A selection “infinitely” or “999” is missing. You could patch indexed_search, see http://bugs.typo3.org/view.php?id=7049

[edit] Clear the joblist in the queue

If you do not like that the joblist (the crawler queue) (the table crawler_queue) grows too large, you could do the following: Put a new script like this

#!/bin/bash
mysql -user=USERNAME -password=PASSWORD -database=DATABASENAME -e TRUNCATE TABLE tx_crawler_queue

somewhere where you can execute it. (replace USERNAME etc with the corresponding values for your Installation of Typo3). Check that you can really execute the script from the command line. (Is the crawler-log in the Info->Site Crawler module empty?). Then configure a cronjob that executes this script from time to time to empty the crawler log. For example: If you create the joblist once a week, you could run this script before you create the joblist.)

WARNING: This is a very simple “solution” to get rid of an ever-growing joblist. This script empties the entire queue, regardless whether a job is finished or still pending. A better solution would be if the crawler could be configured in an way that every job that is done is deleted from the queue, or if jobs that are finished and older than x days are deleted.

If you want to clear the entries older than 2 days try this:

#!/bin/bash
mysql -user=USERNAME -password=PASSWORD -database=DATABASENAME -e "DELETE FROM tx_crawler_queue WHERE UNIX_TIMESTAMP(DATE_SUB(CURDATE(),INTERVAL 2 DAY)) > exec_time AND exec_time > 0;"

[edit] Performance

Some people have reported that the crawler takes round about 15 seconds for crawling a URL. Since this is a bit long some profiling was done. The profiling data shows that the time accumulates while the PHP function fgets(), line 762 in method requestUrl, in file class.tx_crawler_lib.php is executed.

Profiling result of the crawler.

One possible issue could be that in line 761 the PHP function feof() is waiting for a timeout. On line 751 the header Connection: keep-alive was set which was causing the delay. If this line is commented out the processing of the queue works like a charm. We still need to observe if the timeout of 2 seconds which is set for fsockopen() is enough for processing the individual requests.

Profiling result of the crawler without the header keep-alive.

In the latest version from SVN the line with the keep-alive header was removed.

[edit] Implementing a where-clause

While you are hacking the code for performance improvements, you should go on and enter just to more lines in the code in order to get your _WHERE clauses parsed. This will allow you to restrict the selected records as you like, e.g. tt_news = &tx_ttnews[tt_news]=[_TABLE:tt_news;_PID:928;_WHERE: AND crdate < 12345678] Insert

 $where = isset($subpartParams['_WHERE']) ? $subpartParams['_WHERE'] : ;

in line 387 and

 $where.

in line 395 and you got it.

[edit] crawler configurations and cronjobs: examples

Here are a few more examples of crawler configurations (for page tsconfig) and cronjobs.

[edit] example set of crawler configurations

Note on tt_new: If you are using tt_news as an USER_INT object, it is not cached and the configuration below won’t work. Make sure that the page with the single-view uses tt_news as an USER-object. For example, create an extension template with

TS TypoScript:

plugin.tt_news = USER

(in the case of the example below, this extension-template should be on page 18)

TS TypoScript:

# taken from the crawler-docu, p. 8
tx_crawler.crawlerCfg.paramSets {
  language = &L=[|_TABLE:pages_language_overlay;_FIELD:sys_language_uid]
  language.procInstrFilter =tx_indexedsearch_reindex, tx_indexedsearch_crawler
  language.baseUrl = http://www.servername.de/
}
 
#for tt_news from #http://typo3.toaster-schwerin.de/typo3_english/2006_05/msg00355.html
#_PID:7 is the sysfolder with the news records 
#pidsOnly = 18 is the page with the news singeview.
tx_crawler.crawlerCfg.paramSets {
  tt_news = &tx_ttnews[tt_news]=[_TABLE:tt_news;_PID:7]
  tt_news.procInstrFilter = tx_indexedsearch_reindex, tx_cachemgm_recache
  tt_news.cHash = 1
  tt_news.pidsOnly = 18
}
 
# for mininews 
# _PID:1246 is the sysfolder with the mininews records 
# pidsOnly = 1247 is the page with the mininews archive view.
tx_crawler.crawlerCfg.paramSets {
  mininews = &tx_mininews_pi1[showUid]=[_TABLE:tx_mininews_news;_PID:1246]
  mininews.procInstrFilter = tx_indexedsearch_reindex, tx_indexedsearch_crawler
  mininews.cHash = 1
  mininews.pidsOnly = 1247
  mininews.baseUrl = http://www.servername.de/
}

[edit] example set of cronjobs

shell script:

# empty the joblist (Thursday, 23.50 Uhr)
50 23 * * 4 /srv/www/htdocs/Truncate_crawler_queue
# do the jobs on the list (every minute)
* * * * * /srv/www/htdocs/typo3/cli_dispatch.phpsh crawler
# build the joblist (every Friday at midnight)
0 0 * * 5 php /srv/www/htdocs/typo3/cli_dispatch.phpsh crawler_im 1000 -d 99 -proc tx_indexedsearch_reindex -n 1000 -o queue

Index search

Crawler Extension

[edit] Basic Crawler + External files setup

[edit] Installation of crawler, indexed_search and additional tools

[edit] Enable Indexing

[edit] configure crawler

[edit] Setting up the groups

[edit] Making sure configuration works

[edit] Troubleshooting

[edit] troubles with external document indexing

[edit] potential problems with URL-rewriting (realurl or cooluri)

[edit] Running the crawler via the command line

[edit] Adding the user

[edit] Making the CLI script executable

[edit] Getting the path right

[edit] Troubleshooting

[edit] check php.ini

[edit] Running the crawler via a cronjob

[edit] Install cronjob via shell

[edit] More Infos

[edit] Troubleshooting

[edit] if the crawler does not execute the jobs in the queue

[edit] How to fill the queue (IndexingConfiguration Method and Alternatives)

[edit] Filling the queue in the backend

[edit] build the joblist / queue via command line or cronjob

[edit] Indexing Configurations (not yet written)

[edit] Periodic indexing of the website (Page tree)

[edit] Periodic indexing of records (Database Records)

[edit] Indexing External websites (External URL)

[edit] Indexing directories of files (Filepath on server)

[edit] patch for the page tree Indexing Configuration?

[edit] Clear the joblist in the queue

[edit] Performance

[edit] Implementing a where-clause

[edit] crawler configurations and cronjobs: examples

[edit] example set of crawler configurations

[edit] example set of cronjobs

相关链接：

发表评论 取消回复

[edit] Periodic indexing of the website (Page tree)

[edit] Periodic indexing of records (Database Records)

[edit] Indexing External websites (External URL)

[edit] Indexing directories of files (Filepath on server)

发表评论取消回复