mirror of
https://gitee.com/clygintang/Dockfile-Coreseek.git
synced 2025-07-21 00:00:15 +08:00
11472 lines
514 KiB
XML
Executable File
11472 lines
514 KiB
XML
Executable File
<?xml version="1.0" encoding="ISO-8859-1"?>
|
|
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
|
|
"http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
|
|
<!ENTITY iuml "ï">
|
|
]>
|
|
<book lang="en">
|
|
|
|
<title>Sphinx 2.0.1-beta reference manual</title>
|
|
<subtitle>Free open-source SQL full-text search engine</subtitle>
|
|
|
|
<bookinfo>
|
|
<copyright>
|
|
<year>2001-2011</year>
|
|
<holder>Andrew Aksyonoff</holder>
|
|
</copyright>
|
|
<copyright>
|
|
<year>2008-2011</year>
|
|
<holder>Sphinx Technologies Inc, <ulink
|
|
url="http://sphinxsearch.com">http://sphinxsearch.com</ulink></holder>
|
|
</copyright>
|
|
</bookinfo>
|
|
|
|
|
|
<chapter id="intro"><title>Introduction</title>
|
|
|
|
|
|
<sect1 id="about"><title>About</title>
|
|
<para>
|
|
Sphinx is a full-text search engine, publicly distributed under GPL version 2.
|
|
Commercial licensing (eg. for embedded use) is available upon request.
|
|
</para>
|
|
<para>
|
|
Technically, Sphinx is a standalone software package provides
|
|
fast and relevant full-text search functionality to client applications.
|
|
It was specially designed to integrate well with SQL databases storing
|
|
the data, and to be easily accessed scripting languages. However, Sphinx
|
|
does not depend on nor require any specific database to function.
|
|
</para>
|
|
<para>
|
|
Applications can access Sphinx search daemon (searchd) using any of
|
|
the three different access methods: a) via native search API (SphinxAPI),
|
|
b) via Sphinx own implementation of MySQL network protocol (using a small
|
|
SQL subset called SphinxQL), or c) via MySQL server with a pluggable
|
|
storage engine (SphinxSE).
|
|
</para>
|
|
<para>
|
|
Official native SphinxAPI implementations for PHP, Perl, Ruby, and Java
|
|
are included within the distribution package. API is very lightweight
|
|
so porting it to a new language is known to take a few hours or days.
|
|
Third party API ports and plugins exist for Perl, C#, Haskell,
|
|
Ruby-on-Rails, and possibly other languages and frameworks.
|
|
</para>
|
|
<para>
|
|
Starting version 1.10-beta, Sphinx supports two different indexing
|
|
backends: "disk" index backend, and "realtime" (RT) index backend.
|
|
Disk indexes support online full-text index rebuilds, but online updates
|
|
can only be done on non-text (attribute) data. RT indexes additionally
|
|
allow for online full-text index updates. Previous versions only
|
|
supported disk indexes.
|
|
</para>
|
|
<para>
|
|
Data can be loaded into disk indexes using a so-called data source.
|
|
Built-in sources can fetch data directly from MySQL, PostgreSQL, ODBC
|
|
compliant database (MS SQL, Oracle, etc), or a pipe in a custom XML format.
|
|
Adding new data sources drivers (eg. to natively support other DBMSes)
|
|
is designed to be as easy as possible. RT indexes, as of 1.10-beta,
|
|
can only be populated using SphinxQL.
|
|
</para>
|
|
<para>
|
|
As for the name, Sphinx is an acronym which is officially decoded
|
|
as SQL Phrase Index. Yes, I know about CMU's Sphinx project.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="features"><title>Sphinx features</title>
|
|
<para>
|
|
Key Sphinx features are:
|
|
<itemizedlist>
|
|
<listitem><para>high indexing and searching performance;</para></listitem>
|
|
<listitem><para>advanced indexing and querying tools (flexible and feature-rich text tokenizer, querying language, several different ranking modes, etc);</para></listitem>
|
|
<listitem><para>advanced result set post-processing (SELECT with expressions, WHERE, ORDER BY, GROUP BY etc over text search results);</para></listitem>
|
|
<listitem><para>proven scalability up to billions of documents, terabytes of data, and thousands of queries per second;</para></listitem>
|
|
<listitem><para>easy integration with SQL and XML data sources, and SphinxAPI, SphinxQL, or SphinxSE search interfaces;</para></listitem>
|
|
<listitem><para>easy scaling with distributed searches.</para></listitem>
|
|
</itemizedlist>
|
|
To expand a bit, Sphinx:
|
|
<itemizedlist>
|
|
<listitem><para>has high indexing speed (upto 10-15 MB/sec per core on an internal benchmark);</para></listitem>
|
|
<listitem><para>has high search speed (upto 150-250 queries/sec per core against 1,000,000 documents, 1.2 GB of data on an internal benchmark);</para></listitem>
|
|
<listitem><para>has high scalability (biggest known cluster indexes over 3,000,000,000 documents, and busiest one peaks over 50,000,000 queries/day);</para></listitem>
|
|
<listitem><para>provides good relevance ranking through combination of phrase proximity ranking and statistical (BM25) ranking;</para></listitem>
|
|
<listitem><para>provides distributed searching capabilities;</para></listitem>
|
|
<listitem><para>provides document excerpts (snippets) generation;</para></listitem>
|
|
<listitem><para>provides searching from within application with SphinxAPI or SphinxQL interfaces, and from within MySQL with pluggable SphinxSE storage engine;</para></listitem>
|
|
<listitem><para>supports boolean, phrase, word proximity and other types of queries;</para></listitem>
|
|
<listitem><para>supports multiple full-text fields per document (upto 32 by default);</para></listitem>
|
|
<listitem><para>supports multiple additional attributes per document (ie. groups, timestamps, etc);</para></listitem>
|
|
<listitem><para>supports stopwords;</para></listitem>
|
|
<listitem><para>supports morphological word forms dictionaries;</para></listitem>
|
|
<listitem><para>supports tokenizing exceptions;</para></listitem>
|
|
<listitem><para>supports both single-byte encodings and UTF-8;</para></listitem>
|
|
<listitem><para>supports stemming (stemmers for English, Russian and Czech are built-in; and stemmers for
|
|
French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, Norwegian, Danish, Finnish, Hungarian,
|
|
are available by building third party <ulink url="http://snowball.tartarus.org/">libstemmer library</ulink>);</para></listitem>
|
|
<listitem><para>supports MySQL natively (all types of tables, including MyISAM, InnoDB, NDB, Archive, etc are supported);</para></listitem>
|
|
<listitem><para>supports PostgreSQL natively;</para></listitem>
|
|
<listitem><para>supports ODBC compliant databases (MS SQL, Oracle, etc) natively;</para></listitem>
|
|
<listitem><para>...has 50+ other features not listed here, refer to API and configuration manual!</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="getting"><title>Where to get Sphinx</title>
|
|
<para>Sphinx is available through its official Web site at <ulink url="http://sphinxsearch.com/">http://sphinxsearch.com/</ulink>.
|
|
</para>
|
|
<para>Currently, Sphinx distribution tarball includes the following software:
|
|
<itemizedlist>
|
|
<listitem><para><filename>indexer</filename>: an utility which creates fulltext indexes;</para></listitem>
|
|
<listitem><para><filename>search</filename>: a simple command-line (CLI) test utility which searches through fulltext indexes;</para></listitem>
|
|
<listitem><para><filename>searchd</filename>: a daemon which enables external software (eg. Web applications) to search through fulltext indexes;</para></listitem>
|
|
<listitem><para><filename>sphinxapi</filename>: a set of searchd client API libraries for popular Web scripting languages (PHP, Python, Perl, Ruby).</para></listitem>
|
|
<listitem><para><filename>spelldump</filename>: a simple command-line tool to extract the items from an <filename>ispell</filename> or <filename>MySpell</filename>
|
|
(as bundled with OpenOffice) format dictionary to help customize your index, for use with <link linkend="conf-wordforms">wordforms</link>.</para></listitem>
|
|
<listitem><para><filename>indextool</filename>: an utility to dump miscellaneous debug information about the index, added in version 0.9.9-rc2.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="license"><title>License</title>
|
|
<para>
|
|
This program is free software; you can redistribute it and/or modify
|
|
it under the terms of the GNU General Public License as published by
|
|
the Free Software Foundation; either version 2 of the License,
|
|
or (at your option) any later version. See COPYING file for details.
|
|
</para>
|
|
<para>
|
|
This program is distributed in the hope that it will be useful,
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
|
|
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
|
|
more details.
|
|
</para>
|
|
<para>
|
|
You should have received a copy of the GNU General Public License
|
|
along with this program; if not, write to the Free Software Foundation, Inc.,
|
|
59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
|
|
</para>
|
|
<para>
|
|
Non-GPL licensing (for OEM/ISV embedded use) can also be arranged, please
|
|
<ulink url="http://sphinxsearch.com/contacts.html">contact us</ulink> to discuss
|
|
commercial licensing possibilities.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="credits"><title>Credits</title>
|
|
<bridgehead>Author</bridgehead>
|
|
<para>
|
|
Sphinx initial author (and a benevolent dictator ever since):
|
|
<itemizedlist>
|
|
<listitem><para>Andrew Aksyonoff, <ulink url="http://shodan.ru">http://shodan.ru</ulink></para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<bridgehead>Team</bridgehead>
|
|
<para>
|
|
Past and present employees of Sphinx Technologies Inc who should be
|
|
noted on their work on Sphinx (in alphabetical order):
|
|
<itemizedlist>
|
|
<listitem><para>Alexander Klimenko</para></listitem>
|
|
<listitem><para>Alexey Dvoichenkov</para></listitem>
|
|
<listitem><para>Alexey Vinogradov</para></listitem>
|
|
<listitem><para>Ilya Kuznetsov</para></listitem>
|
|
<listitem><para>Stanislav Klinov</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<bridgehead>Contributors</bridgehead>
|
|
<para>People who contributed to Sphinx and their contributions (in no particular order):
|
|
<itemizedlist>
|
|
<listitem><para>Robert "coredev" Bengtsson (Sweden), initial version of PostgreSQL data source</para></listitem>
|
|
<listitem><para>Len Kranendonk, Perl API</para></listitem>
|
|
<listitem><para>Dmytro Shteflyuk, Ruby API</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
Many other people have contributed ideas, bug reports, fixes, etc.
|
|
Thank you!
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="history"><title>History</title>
|
|
<para>
|
|
Sphinx development was started back in 2001, because I didn't manage
|
|
to find an acceptable search solution (for a database driven Web site)
|
|
which would meet my requirements. Actually, each and every important aspect was a problem:
|
|
<itemizedlist>
|
|
<listitem><para>search quality (ie. good relevance)
|
|
<itemizedlist><listitem><para>statistical ranking methods performed rather bad, especially on large collections of small documents (forums, blogs, etc)</para></listitem></itemizedlist>
|
|
</para></listitem>
|
|
<listitem><para>search speed
|
|
<itemizedlist><listitem><para>especially if searching for phrases which contain stopwords, as in "to be or not to be"</para></listitem></itemizedlist>
|
|
</para></listitem>
|
|
<listitem><para>moderate disk and CPU requirements when indexing
|
|
<itemizedlist><listitem><para>important in shared hosting enivronment, not to mention the indexing speed.</para></listitem></itemizedlist>
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
Despite the amount of time passed and numerous improvements made in the
|
|
other solutions, there's still no solution which I personally would
|
|
be eager to migrate to.
|
|
</para>
|
|
<para>
|
|
Considering that and a lot of positive feedback received from Sphinx users
|
|
during last years, the obvious decision is to continue developing Sphinx
|
|
(and, eventually, to take over the world).
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
</chapter>
|
|
<chapter id="installation"><title>Installation</title>
|
|
|
|
|
|
<sect1 id="supported-system"><title>Supported systems</title>
|
|
<para>
|
|
Most modern UNIX systems with a C++ compiler should be able
|
|
to compile and run Sphinx without any modifications.
|
|
</para>
|
|
<para>
|
|
Currently known systems Sphinx has been successfully running on are:
|
|
<itemizedlist>
|
|
<listitem><para>Linux 2.4.x, 2.6.x (many various distributions)</para></listitem>
|
|
<listitem><para>Windows 2000, XP</para></listitem>
|
|
<listitem><para>FreeBSD 4.x, 5.x, 6.x, 7.x</para></listitem>
|
|
<listitem><para>NetBSD 1.6, 3.0</para></listitem>
|
|
<listitem><para>Solaris 9, 11</para></listitem>
|
|
<listitem><para>Mac OS X</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
CPU architectures known to work include X86, X86-64, SPARC64, ARM.
|
|
</para>
|
|
<para>
|
|
Chance are good that Sphinx should work on other Unix platforms as well;
|
|
please report any platforms missing from this list that worked for you!
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="required-tools"><title>Required tools</title>
|
|
<para>
|
|
On UNIX, you will need the following tools to build
|
|
and install Sphinx:
|
|
<itemizedlist>
|
|
<listitem><para>a working C++ compiler. GNU gcc is known to work.</para></listitem>
|
|
<listitem><para>a good make program. GNU make is known to work.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
On Windows, you will need Microsoft Visual C/C++ Studio .NET 2003 or 2005.
|
|
Other compilers/environments will probably work as well, but for the
|
|
time being, you will have to build makefile (or other environment
|
|
specific project files) manually.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="installing"><title>Installing Sphinx on Linux</title>
|
|
<para><orderedlist>
|
|
<listitem>
|
|
<para>
|
|
Extract everything from the distribution tarball (haven't you already?)
|
|
and go to the <filename>sphinx</filename> subdirectory. (We are using
|
|
version 2.0.1-beta here for the sake of example only; be sure to change this
|
|
to a specific version you're using.)
|
|
</para>
|
|
<para><literallayout><userinput>$ tar xzvf sphinx-2.0.1-beta.tar.gz
|
|
$ cd sphinx
|
|
</userinput></literallayout></para></listitem>
|
|
<listitem>
|
|
<para>Run the configuration program:</para>
|
|
<para><literallayout><userinput>$ ./configure</userinput></literallayout></para>
|
|
<para>
|
|
There's a number of options to configure. The complete listing may
|
|
be obtained by using <option>--help</option> switch. The most important ones are:
|
|
<itemizedlist>
|
|
<listitem><para><option>--prefix</option>, which specifies where to install Sphinx; such as <option>--prefix=/usr/local/sphinx</option> (all of the examples use this prefix)</para></listitem>
|
|
<listitem><para><option>--with-mysql</option>, which specifies where to look for MySQL
|
|
include and library files, if auto-detection fails;</para></listitem>
|
|
<listitem><para><option>--with-pgsql</option>, which specifies where to look for PostgreSQL
|
|
include and library files.</para></listitem>
|
|
</itemizedlist>
|
|
</para></listitem>
|
|
<listitem>
|
|
<para>Build the binaries:</para>
|
|
<para><literallayout><userinput>$ make</userinput></literallayout></para></listitem>
|
|
<listitem>
|
|
<para>Install the binaries in the directory of your choice:
|
|
(defaults to <filename>/usr/local/bin/</filename> on *nix systems,
|
|
but is overridden with <option>configure --prefix</option>)</para>
|
|
<para><literallayout><userinput>$ make install</userinput></literallayout></para></listitem>
|
|
</orderedlist></para>
|
|
</sect1>
|
|
|
|
<sect1 id="installing-windows"><title>Installing Sphinx on Windows</title>
|
|
<para>Installing Sphinx on a Windows server is often easier than installing on a Linux environment;
|
|
unless you are preparing code patches, you can use the pre-compiled binary files from the Downloads
|
|
area on the website.</para>
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>Extract everything from the .zip file you have downloaded -
|
|
<filename>sphinx-2.0.1-beta-win32.zip</filename>,
|
|
or <filename>sphinx-2.0.1-beta-win32-pgsql.zip</filename> if you need PostgresSQL support as well.
|
|
(We are using version 2.0.1-beta here for the sake of example only;
|
|
be sure to change this to a specific version you're using.)
|
|
You can use Windows Explorer in Windows XP and up to extract the files,
|
|
or a freeware package like 7Zip to open the archive.</para>
|
|
<para>For the remainder of this guide, we will assume that the folders are unzipped into <filename>C:\Sphinx</filename>,
|
|
such that <filename>searchd.exe</filename> can be found in <filename>C:\Sphinx\bin\searchd.exe</filename>. If you decide
|
|
to use any different location for the folders or configuration file, please change it accordingly.</para></listitem>
|
|
<listitem>
|
|
<para>Edit the contents of sphinx.conf.in - specifically entries relating to @CONFDIR@ - to paths suitable for your system.</para></listitem>
|
|
<listitem>
|
|
<para>Install the <filename>searchd</filename> system as a Windows service:</para>
|
|
<para><userinput>C:\Sphinx\bin> C:\Sphinx\bin\searchd --install --config C:\Sphinx\sphinx.conf.in --servicename SphinxSearch</userinput></para></listitem>
|
|
<listitem>
|
|
<para>The <filename>searchd</filename> service will now be listed in the Services panel
|
|
within the Management Console, available from Administrative Tools. It will not have been
|
|
started, as you will need to configure it and build your indexes with <filename>indexer</filename>
|
|
before starting the service. A guide to do this can be found under
|
|
<link linkend="quick-tour">Quick tour</link>.</para>
|
|
<para>During the next steps of the install (which involve running indexer pretty much as
|
|
you would on Linux) you may find that you get an error relating to libmysql.dll not being found.
|
|
If you have MySQL installed, you should find a copy of this library in your Windows directory,
|
|
or sometimes in Windows\System32, or failing that in the MySQL core directories. If you
|
|
do receive an error please copy libmysql.dll into the bin directory.</para></listitem>
|
|
</orderedlist>
|
|
</sect1>
|
|
|
|
<sect1 id="install-problems"><title>Known installation issues</title>
|
|
<para>
|
|
If <filename>configure</filename> fails to locate MySQL headers and/or libraries,
|
|
try checking for and installing <filename>mysql-devel</filename> package. On some systems,
|
|
it is not installed by default.
|
|
</para>
|
|
<para>
|
|
If <filename>make</filename> fails with a message which look like
|
|
<programlisting>
|
|
/bin/sh: g++: command not found
|
|
make[1]: *** [libsphinx_a-sphinx.o] Error 127
|
|
</programlisting>
|
|
try checking for and installing <filename>gcc-c++</filename> package.
|
|
</para>
|
|
<para>
|
|
If you are getting compile-time errors which look like
|
|
<programlisting>
|
|
sphinx.cpp:67: error: invalid application of `sizeof' to
|
|
incomplete type `Private::SizeError<false>'
|
|
</programlisting>
|
|
this means that some compile-time type size check failed.
|
|
The most probable reason is that off_t type is less than 64-bit
|
|
on your system. As a quick hack, you can edit sphinx.h and replace off_t
|
|
with DWORD in a typedef for SphOffset_t, but note that this will prohibit
|
|
you from using full-text indexes larger than 2 GB. Even if the hack helps,
|
|
please report such issues, providing the exact error message and
|
|
compiler/OS details, so I could properly fix them in next releases.
|
|
</para>
|
|
<para>
|
|
If you keep getting any other error, or the suggestions above
|
|
do not seem to help you, please don't hesitate to contact me.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="quick-tour"><title>Quick Sphinx usage tour</title>
|
|
<para>
|
|
All the example commands below assume that you installed Sphinx
|
|
in <filename>/usr/local/sphinx</filename>, so <filename>searchd</filename> can
|
|
be found in <filename>/usr/local/sphinx/bin/searchd</filename>.
|
|
</para>
|
|
<para>
|
|
To use Sphinx, you will need to:
|
|
</para>
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>Create a configuration file.</para>
|
|
<para>
|
|
Default configuration file name is <filename>sphinx.conf</filename>.
|
|
All Sphinx programs look for this file in current working directory
|
|
by default.
|
|
</para>
|
|
<para>
|
|
Sample configuration file, <filename>sphinx.conf.dist</filename>, which has
|
|
all the options documented, is created by <filename>configure</filename>.
|
|
Copy and edit that sample file to make your own configuration: (assuming Sphinx is installed into <filename>/usr/local/sphinx/</filename>)
|
|
</para>
|
|
<para><literallayout><userinput>$ cd /usr/local/sphinx/etc
|
|
$ cp sphinx.conf.dist sphinx.conf
|
|
$ vi sphinx.conf</userinput></literallayout></para>
|
|
<para>
|
|
Sample configuration file is setup to index <filename>documents</filename>
|
|
table from MySQL database <filename>test</filename>; so there's <filename>example.sql</filename>
|
|
sample data file to populate that table with a few documents for testing purposes:
|
|
</para>
|
|
<para><literallayout><userinput>$ mysql -u test < /usr/local/sphinx/etc/example.sql</userinput></literallayout></para></listitem>
|
|
<listitem>
|
|
<para>Run the indexer to create full-text index from your data:</para>
|
|
<para><literallayout><userinput>$ cd /usr/local/sphinx/etc
|
|
$ /usr/local/sphinx/bin/indexer --all</userinput></literallayout></para></listitem>
|
|
<listitem>
|
|
<para>Query your newly created index!</para></listitem>
|
|
</orderedlist>
|
|
<para>
|
|
To query the index from command line, use <filename>search</filename> utility:
|
|
</para>
|
|
<para><literallayout><userinput>$ cd /usr/local/sphinx/etc
|
|
$ /usr/local/sphinx/bin/search test</userinput></literallayout></para>
|
|
<para>
|
|
To query the index from your PHP scripts, you need to:
|
|
</para>
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>Run the search daemon which your script will talk to:</para>
|
|
<para><literallayout><userinput>$ cd /usr/local/sphinx/etc
|
|
$ /usr/local/sphinx/bin/searchd</userinput></literallayout>
|
|
</para></listitem>
|
|
<listitem>
|
|
<para>
|
|
Run the attached PHP API test script (to ensure that the daemon
|
|
was succesfully started and is ready to serve the queries):
|
|
</para>
|
|
<para><literallayout><userinput>$ cd sphinx/api
|
|
$ php test.php test</userinput></literallayout>
|
|
</para></listitem>
|
|
<listitem>
|
|
<para>
|
|
Include the API (it's located in <filename>api/sphinxapi.php</filename>)
|
|
into your own scripts and use it.
|
|
</para></listitem>
|
|
</orderedlist>
|
|
<para>
|
|
Happy searching!
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
</chapter>
|
|
<chapter id="indexing"><title>Indexing</title>
|
|
|
|
|
|
<sect1 id="sources"><title>Data sources</title>
|
|
<para>
|
|
The data to be indexed can generally come from very different
|
|
sources: SQL databases, plain text files, HTML files, mailboxes,
|
|
and so on. From Sphinx point of view, the data it indexes is a
|
|
set of structured <glossterm>documents</glossterm>, each of which has the
|
|
same set of <glossterm>fields</glossterm>. This is biased towards SQL, where
|
|
each row correspond to a document, and each column to a field.
|
|
</para>
|
|
<para>
|
|
Depending on what source Sphinx should get the data from,
|
|
different code is required to fetch the data and prepare it for indexing.
|
|
This code is called <glossterm>data source driver</glossterm> (or simply
|
|
<glossterm>driver</glossterm> or <glossterm>data source</glossterm> for brevity).
|
|
</para>
|
|
<para>
|
|
At the time of this writing, there are drivers for MySQL and
|
|
PostgreSQL databases, which can connect to the database using
|
|
its native C/C++ API, run queries and fetch the data. There's
|
|
also a driver called xmlpipe, which runs a specified command
|
|
and reads the data from its <filename>stdout</filename>.
|
|
See <xref linkend="xmlpipe"/> section for the format description.
|
|
</para>
|
|
<para>
|
|
There can be as many sources per index as necessary. They will be
|
|
sequentially processed in the very same order which was specifed in
|
|
index definition. All the documents coming from those sources
|
|
will be merged as if they were coming from a single source.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="attributes"><title>Attributes</title>
|
|
<para>
|
|
Attributes are additional values associated with each document
|
|
that can be used to perform additional filtering and sorting during search.
|
|
</para>
|
|
<para>
|
|
It is often desired to additionally process full-text search results
|
|
based not only on matching document ID and its rank, but on a number
|
|
of other per-document values as well. For instance, one might need to
|
|
sort news search results by date and then relevance,
|
|
or search through products within specified price range,
|
|
or limit blog search to posts made by selected users,
|
|
or group results by month. To do that efficiently, Sphinx allows
|
|
to attach a number of additional <glossterm>attributes</glossterm>
|
|
to each document, and store their values in the full-text index.
|
|
It's then possible to use stored values to filter, sort,
|
|
or group full-text matches.
|
|
</para>
|
|
<para>Attributes, unlike the fields, are not full-text indexed. They
|
|
are stored in the index, but it is not possible to search them as full-text,
|
|
and attempting to do so results in an error.</para>
|
|
<para>For example, it is impossible to use the extended matching mode expression
|
|
<option>@column 1</option> to match documents where column is 1, if column is an
|
|
attribute, and this is still true even if the numeric digits are normally indexed.</para>
|
|
<para>Attributes can be used for filtering, though, to restrict returned
|
|
rows, as well as sorting or <link linkend="clustering">result grouping</link>;
|
|
it is entirely possible to sort results purely based on attributes, and ignore the search
|
|
relevance tools. Additionally, attributes are returned from the search daemon, while the
|
|
indexed text is not.</para>
|
|
<para>
|
|
A good example for attributes would be a forum posts table. Assume
|
|
that only title and content fields need to be full-text searchable -
|
|
but that sometimes it is also required to limit search to a certain
|
|
author or a sub-forum (ie. search only those rows that have some
|
|
specific values of author_id or forum_id columns in the SQL table);
|
|
or to sort matches by post_date column; or to group matching posts
|
|
by month of the post_date and calculate per-group match counts.
|
|
</para>
|
|
<para>
|
|
This can be achieved by specifying all the mentioned columns
|
|
(excluding title and content, that are full-text fields) as
|
|
attributes, indexing them, and then using API calls to
|
|
setup filtering, sorting, and grouping. Here as an example.
|
|
</para>
|
|
<bridgehead>Example sphinx.conf part:</bridgehead>
|
|
<programlisting>
|
|
...
|
|
sql_query = SELECT id, title, content, \
|
|
author_id, forum_id, post_date FROM my_forum_posts
|
|
sql_attr_uint = author_id
|
|
sql_attr_uint = forum_id
|
|
sql_attr_timestamp = post_date
|
|
...
|
|
</programlisting>
|
|
<bridgehead>Example application code (in PHP):</bridgehead>
|
|
<programlisting>
|
|
// only search posts by author whose ID is 123
|
|
$cl->SetFilter ( "author_id", array ( 123 ) );
|
|
|
|
// only search posts in sub-forums 1, 3 and 7
|
|
$cl->SetFilter ( "forum_id", array ( 1,3,7 ) );
|
|
|
|
// sort found posts by posting date in descending order
|
|
$cl->SetSortMode ( SPH_SORT_ATTR_DESC, "post_date" );
|
|
</programlisting>
|
|
<para>
|
|
Attributes are named. Attribute names are case insensitive.
|
|
Attributes are <emphasis>not</emphasis> full-text indexed; they are stored in the index as is.
|
|
Currently supported attribute types are:
|
|
<itemizedlist>
|
|
<listitem><para>unsigned integers (1-bit to 32-bit wide);</para></listitem>
|
|
<listitem><para>UNIX timestamps;</para></listitem>
|
|
<listitem><para>floating point values (32-bit, IEEE 754 single precision);</para></listitem>
|
|
<listitem><para>string ordinals (specially computed integers);</para></listitem>
|
|
<listitem><para><link linkend="conf-sql-attr-string">strings</link> (since 1.10-beta);</para></listitem>
|
|
<listitem><para><link linkend="mva">MVA</link>, multi-value attributes (variable-length lists of 32-bit unsigned integers).</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
The complete set of per-document attribute values is sometimes
|
|
referred to as <glossterm>docinfo</glossterm>. Docinfos can either be
|
|
<itemizedlist>
|
|
<listitem><para>stored separately from the main full-text index data ("extern" storage, in <filename>.spa</filename> file), or</para></listitem>
|
|
<listitem><para>attached to each occurence of document ID in full-text index data ("inline" storage, in <filename>.spd</filename> file).</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
When using extern storage, a copy of <filename>.spa</filename> file
|
|
(with all the attribute values for all the documents) is kept in RAM by
|
|
<filename>searchd</filename> at all times. This is for performance reasons;
|
|
random disk I/O would be too slow. On the contrary, inline storage does not
|
|
require any additional RAM at all, but that comes at the cost of greatly
|
|
inflating the index size: remember that it copies <emphasis>all</emphasis>
|
|
attribute value <emphasis>every</emphasis> time when the document ID
|
|
is mentioned, and that is exactly as many times as there are
|
|
different keywords in the document. Inline may be the only viable
|
|
option if you have only a few attributes and need to work with big
|
|
datasets in limited RAM. However, in most cases extern storage
|
|
makes both indexing and searching <emphasis>much</emphasis> more efficient.
|
|
</para>
|
|
<para>
|
|
Search-time memory requirements for extern storage are
|
|
(1+number_of_attrs)*number_of_docs*4 bytes, ie. 10 million docs with
|
|
2 groups and 1 timestamp will take (1+2+1)*10M*4 = 160 MB of RAM.
|
|
This is <emphasis>PER DAEMON</emphasis>, not per query. <filename>searchd</filename>
|
|
will allocate 160 MB on startup, read the data and keep it shared between queries.
|
|
The children will <emphasis>NOT</emphasis> allocate any additional
|
|
copies of this data.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="mva"><title>MVA (multi-valued attributes)</title>
|
|
<para>
|
|
MVAs, or multi-valued attributes, are an important special type of per-document attributes in Sphinx.
|
|
MVAs make it possible to attach lists of values to every document.
|
|
They are useful for article tags, product categories, etc.
|
|
Filtering and group-by (but not sorting) on MVA attributes is supported.
|
|
</para>
|
|
<para>
|
|
Currently, MVA list entries are limited to unsigned 32-bit integers.
|
|
The list length is not limited, you can have an arbitrary number of values
|
|
attached to each document as long as RAM permits (<filename>.spm</filename> file
|
|
that contains the MVA values will be precached in RAM by <filename>searchd</filename>).
|
|
The source data can be taken either from a separate query, or from a document field;
|
|
see source type in <link linkend="conf-sql-attr-multi">sql_attr_multi</link>.
|
|
In the first case the query will have to return pairs of document ID and MVA values,
|
|
in the second one the field will be parsed for integer values.
|
|
There are absolutely no requirements as to incoming data order; the values will be
|
|
automatically grouped by document ID (and internally sorted within the same ID)
|
|
during indexing anyway.
|
|
</para>
|
|
<para>
|
|
When filtering, a document will match the filter on MVA attribute
|
|
if <emphasis>any</emphasis> of the values satisfy the filtering condition.
|
|
(Therefore, documents that pass through exclude filters will not
|
|
contain any of the forbidden values.)
|
|
When grouping by MVA attribute, a document will contribute to as
|
|
many groups as there are different MVA values associated with that document.
|
|
For instance, if the collection contains exactly 1 document having a 'tag' MVA
|
|
with values 5, 7, and 11, grouping on 'tag' will produce 3 groups with
|
|
'@count' equal to 1 and '@groupby' key values of 5, 7, and 11 respectively.
|
|
Also note that grouping by MVA might lead to duplicate documents in the result set:
|
|
because each document can participate in many groups, it can be chosen as the best
|
|
one in in more than one group, leading to duplicate IDs. PHP API historically
|
|
uses ordered hash on the document ID for the resulting rows; so you'll also need to use
|
|
<link linkend="api-func-setarrayresult">SetArrayResult()</link> in order
|
|
to employ group-by on MVA with PHP API.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="indexes"><title>Indexes</title>
|
|
<para>
|
|
To be able to answer full-text search queries fast, Sphinx needs
|
|
to build a special data structure optimized for such queries from
|
|
your text data. This structure is called <glossterm>index</glossterm>; and
|
|
the process of building index from text is called <glossterm>indexing</glossterm>.
|
|
</para>
|
|
<para>
|
|
Different index types are well suited for different tasks.
|
|
For example, a disk-based tree-based index would be easy to
|
|
update (ie. insert new documents to existing index), but rather
|
|
slow to search. Therefore, Sphinx architecture allows for different
|
|
<glossterm>index types</glossterm> to be implemented easily.
|
|
</para>
|
|
<para>
|
|
The only index type which is implemented in Sphinx at the moment is
|
|
designed for maximum indexing and searching speed. This comes at a cost
|
|
of updates being really slow; theoretically, it might be slower to
|
|
update this type of index than than to reindex it from scratch.
|
|
However, this very frequently could be worked around with
|
|
muiltiple indexes, see <xref linkend="live-updates"/> for details.
|
|
</para>
|
|
<para>
|
|
It is planned to implement more index types, including the
|
|
type which would be updateable in real time.
|
|
</para>
|
|
<para>
|
|
There can be as many indexes per configuration file as necessary.
|
|
<filename>indexer</filename> utility can reindex either all of them
|
|
(if <option>--all</option> option is specified), or a certain explicitly
|
|
specified subset. <filename>searchd</filename> utility will serve all
|
|
the specified indexes, and the clients can specify what indexes to
|
|
search in run time.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="data-restrictions"><title>Restrictions on the source data</title>
|
|
<para>
|
|
There are a few different restrictions imposed on the source data
|
|
which is going to be indexed by Sphinx, of which the single most
|
|
important one is:
|
|
</para>
|
|
<para><emphasis role="bold">
|
|
ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS (32-BIT OR 64-BIT, DEPENDING ON BUILD TIME SETTINGS).
|
|
</emphasis></para>
|
|
<para>
|
|
If this requirement is not met, different bad things can happen.
|
|
For instance, Sphinx can crash with an internal assertion while indexing;
|
|
or produce strange results when searching due to conflicting IDs.
|
|
Also, a 1000-pound gorilla might eventually come out of your
|
|
display and start throwing barrels at you. You've been warned.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="charsets"><title>Charsets, case folding, and translation tables</title>
|
|
<para>
|
|
When indexing some index, Sphinx fetches documents from
|
|
the specified sources, splits the text into words, and does
|
|
case folding so that "Abc", "ABC" and "abc" would be treated
|
|
as the same word (or, to be pedantic, <glossterm>term</glossterm>).
|
|
</para>
|
|
<para>
|
|
To do that properly, Sphinx needs to know
|
|
<itemizedlist>
|
|
<listitem><para>what encoding is the source text in;</para></listitem>
|
|
<listitem><para>what characters are letters and what are not;</para></listitem>
|
|
<listitem><para>what letters should be folded to what letters.</para></listitem>
|
|
</itemizedlist>
|
|
This should be configured on a per-index basis using
|
|
<option><link linkend="conf-charset-type">charset_type</link></option> and
|
|
<option><link linkend="conf-charset-table">charset_table</link></option> options.
|
|
<option><link linkend="conf-charset-type">charset_type</link></option>
|
|
specifies whether the document encoding is single-byte (SBCS) or UTF-8.
|
|
<option><link linkend="conf-charset-table">charset_table</link></option>
|
|
specifies the table that maps letter characters to their case
|
|
folded versions. The characters that are not in the table are considered
|
|
to be non-letters and will be treated as word separators when indexing
|
|
or searching through this index.
|
|
</para>
|
|
<para>
|
|
Note that while default tables do not include space character
|
|
(ASCII code 0x20, Unicode U+0020) as a letter, it's in fact
|
|
<emphasis>perfectly legal</emphasis> to do so. This can be
|
|
useful, for instance, for indexing tag clouds, so that space-separated
|
|
word sets would index as a <emphasis>single</emphasis> search query term.
|
|
</para>
|
|
<para>
|
|
Default tables currently include English and Russian characters.
|
|
Please do submit your tables for other languages!
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sql"><title>SQL data sources (MySQL, PostgreSQL)</title>
|
|
<para>
|
|
With all the SQL drivers, indexing generally works as follows.
|
|
<itemizedlist>
|
|
<listitem><para>connection to the database is established;</para></listitem>
|
|
<listitem><para>pre-query (see <xref linkend="conf-sql-query-pre"/>) is executed
|
|
to perform any necessary initial setup, such as setting per-connection encoding with MySQL;</para></listitem>
|
|
<listitem><para>main query (see <xref linkend="conf-sql-query"/>) is executed and the rows it returns are indexed;</para></listitem>
|
|
<listitem><para>post-query (see <xref linkend="conf-sql-query-post"/>) is executed
|
|
to perform any necessary cleanup;</para></listitem>
|
|
<listitem><para>connection to the database is closed;</para></listitem>
|
|
<listitem><para>indexer does the sorting phase (to be pedantic, index-type specific post-processing);</para></listitem>
|
|
<listitem><para>connection to the database is established again;</para></listitem>
|
|
<listitem><para>post-index query (see <xref linkend="conf-sql-query-post-index"/>) is executed
|
|
to perform any necessary final cleanup;</para></listitem>
|
|
<listitem><para>connection to the database is closed again.</para></listitem>
|
|
</itemizedlist>
|
|
Most options, such as database user/host/password, are straightforward.
|
|
However, there are a few subtle things, which are discussed in more detail here.
|
|
</para>
|
|
<bridgehead id="ranged-queries">Ranged queries</bridgehead>
|
|
<para>
|
|
Main query, which needs to fetch all the documents, can impose
|
|
a read lock on the whole table and stall the concurrent queries
|
|
(eg. INSERTs to MyISAM table), waste a lot of memory for result set, etc.
|
|
To avoid this, Sphinx supports so-called <glossterm>ranged queries</glossterm>.
|
|
With ranged queries, Sphinx first fetches min and max document IDs from
|
|
the table, and then substitutes different ID intervals into main query text
|
|
and runs the modified query to fetch another chunk of documents.
|
|
Here's an example.
|
|
</para>
|
|
<example id="ex-ranged-queries"><title>Ranged query usage example</title>
|
|
<programlisting>
|
|
# in sphinx.conf
|
|
|
|
sql_query_range = SELECT MIN(id),MAX(id) FROM documents
|
|
sql_range_step = 1000
|
|
sql_query = SELECT * FROM documents WHERE id>=$start AND id<=$end
|
|
</programlisting>
|
|
</example>
|
|
<para>
|
|
If the table contains document IDs from 1 to, say, 2345, then sql_query would
|
|
be run three times:
|
|
<orderedlist>
|
|
<listitem><para>with <option>$start</option> replaced with 1 and <option>$end</option> replaced with 1000;</para></listitem>
|
|
<listitem><para>with <option>$start</option> replaced with 1001 and <option>$end</option> replaced with 2000;</para></listitem>
|
|
<listitem><para>with <option>$start</option> replaced with 2000 and <option>$end</option> replaced with 2345.</para></listitem>
|
|
</orderedlist>
|
|
Obviously, that's not much of a difference for 2000-row table,
|
|
but when it comes to indexing 10-million-row MyISAM table,
|
|
ranged queries might be of some help.
|
|
</para>
|
|
<bridgehead><option>sql_post</option> vs. <option>sql_post_index</option></bridgehead>
|
|
<para>
|
|
The difference between post-query and post-index query is in that post-query
|
|
is run immediately when Sphinx received all the documents, but further indexing
|
|
<emphasis role="bold">may</emphasis> still fail for some other reason. On the contrary,
|
|
by the time the post-index query gets executed, it is <emphasis role="bold">guaranteed</emphasis>
|
|
that the indexing was succesful. Database connection is dropped and re-established
|
|
because sorting phase can be very lengthy and would just timeout otherwise.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="xmlpipe"><title>xmlpipe data source</title>
|
|
<para>
|
|
xmlpipe data source was designed to enable users to plug data into
|
|
Sphinx without having to implement new data sources drivers themselves.
|
|
It is limited to 2 fixed fields and 2 fixed attributes, and is deprecated
|
|
in favor of <xref linkend="xmlpipe2"/> now. For new streams, use xmlpipe2.
|
|
</para>
|
|
<para>
|
|
To use xmlpipe, configure the data source in your configuration file
|
|
as follows:
|
|
<programlisting>
|
|
source example_xmlpipe_source
|
|
{
|
|
type = xmlpipe
|
|
xmlpipe_command = perl /www/mysite.com/bin/sphinxpipe.pl
|
|
}
|
|
</programlisting>
|
|
The <filename>indexer</filename> will run the command specified
|
|
in <option><link linkend="conf-xmlpipe-command">xmlpipe_command</link></option>,
|
|
and then read, parse and index the data it prints to <filename>stdout</filename>.
|
|
More formally, it opens a pipe to given command and then reads
|
|
from that pipe.
|
|
</para>
|
|
<para>
|
|
indexer will expect one or more documents in custom XML format.
|
|
Here's the example document stream, consisting of two documents:
|
|
<example id="ex-xmlpipe-document"><title>XMLpipe document stream</title>
|
|
<programlisting>
|
|
<document>
|
|
<id>123</id>
|
|
<group>45</group>
|
|
<timestamp>1132223498</timestamp>
|
|
<title>test title</title>
|
|
<body>
|
|
this is my document body
|
|
</body>
|
|
</document>
|
|
|
|
<document>
|
|
<id>124</id>
|
|
<group>46</group>
|
|
<timestamp>1132223498</timestamp>
|
|
<title>another test</title>
|
|
<body>
|
|
this is another document
|
|
</body>
|
|
</document>
|
|
</programlisting>
|
|
</example>
|
|
</para>
|
|
<para>
|
|
Legacy xmlpipe legacy driver uses a builtin parser
|
|
which is pretty fast but really strict and does not actually
|
|
fully support XML. It requires that all the fields <emphasis>must</emphasis>
|
|
be present, formatted <emphasis>exactly</emphasis> as in this example, and
|
|
occur <emphasis>exactly</emphasis> in the same order. The only optional
|
|
field is <option>timestamp</option>; it defaults to 1.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="xmlpipe2"><title>xmlpipe2 data source</title>
|
|
<para>
|
|
xmlpipe2 lets you pass arbitrary full-text and attribute data to Sphinx
|
|
in yet another custom XML format. It also allows to specify the schema
|
|
(ie. the set of fields and attributes) either in the XML stream itself,
|
|
or in the source settings.
|
|
</para>
|
|
<para>
|
|
When indexing xmlpipe2 source, indexer runs the given command, opens
|
|
a pipe to its stdout, and expects well-formed XML stream. Here's sample
|
|
stream data:
|
|
<example id="ex-xmlpipe2-document"><title>xmlpipe2 document stream</title>
|
|
<programlisting>
|
|
<?xml version="1.0" encoding="utf-8"?>
|
|
<sphinx:docset>
|
|
|
|
<sphinx:schema>
|
|
<sphinx:field name="subject"/>
|
|
<sphinx:field name="content"/>
|
|
<sphinx:attr name="published" type="timestamp"/>
|
|
<sphinx:attr name="author_id" type="int" bits="16" default="1"/>
|
|
</sphinx:schema>
|
|
|
|
<sphinx:document id="1234">
|
|
<content>this is the main content <![CDATA[[and this <cdata> entry
|
|
must be handled properly by xml parser lib]]></content>
|
|
<published>1012325463</published>
|
|
<subject>note how field/attr tags can be
|
|
in <b class="red">randomized</b> order</subject>
|
|
<misc>some undeclared element</misc>
|
|
</sphinx:document>
|
|
|
|
<sphinx:document id="1235">
|
|
<subject>another subject</subject>
|
|
<content>here comes another document, and i am given to understand,
|
|
that in-document field order must not matter, sir</content>
|
|
<published>1012325467</published>
|
|
</sphinx:document>
|
|
|
|
<!-- ... even more sphinx:document entries here ... -->
|
|
|
|
<sphinx:killlist>
|
|
<id>1234</id>
|
|
<id>4567</id>
|
|
</sphinx:killlist>
|
|
|
|
</sphinx:docset>
|
|
</programlisting>
|
|
</example>
|
|
</para>
|
|
<para>
|
|
Arbitrary fields and attributes are allowed.
|
|
They also can occur in the stream in arbitrary order within each document; the order is ignored.
|
|
There is a restriction on maximum field length; fields longer than 2 MB will be truncated to 2 MB (this limit can be changed in the source).
|
|
</para>
|
|
<para>
|
|
The schema, ie. complete fields and attributes list, must be declared
|
|
before any document could be parsed. This can be done either in the
|
|
configuration file using <option>xmlpipe_field</option> and <option>xmlpipe_attr_XXX</option>
|
|
settings, or right in the stream using <sphinx:schema> element.
|
|
<sphinx:schema> is optional. It is only allowed to occur as the very
|
|
first sub-element in <sphinx:docset>. If there is no in-stream
|
|
schema definition, settings from the configuration file will be used.
|
|
Otherwise, stream settings take precedence.
|
|
</para>
|
|
<para>
|
|
Unknown tags (which were not declared neither as fields nor as attributes)
|
|
will be ignored with a warning. In the example above, <misc> will be ignored.
|
|
All embedded tags and their attributes (such as <b> in <subject>
|
|
in the example above) will be silently ignored.
|
|
</para>
|
|
<para>
|
|
Support for incoming stream encodings depends on whether <filename>iconv</filename>
|
|
is installed on the system. xmlpipe2 is parsed using <filename>libexpat</filename>
|
|
parser that understands US-ASCII, ISO-8859-1, UTF-8 and a few UTF-16 variants
|
|
natively. Sphinx <filename>configure</filename> script will also check
|
|
for <filename>libiconv</filename> presence, and utilize it to handle
|
|
other encodings. <filename>libexpat</filename> also enforces the
|
|
requirement to use UTF-8 charset on Sphinx side, because the
|
|
parsed data it returns is always in UTF-8.
|
|
<!-- TODO: check this vs latin-1 -->
|
|
</para>
|
|
<para>
|
|
XML elements (tags) recognized by xmlpipe2 (and their attributes where applicable) are:
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>sphinx:docset</term>
|
|
<listitem><para>Mandatory top-level element, denotes and contains xmlpipe2 document set.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>sphinx:schema</term>
|
|
<listitem><para>Optional element, must either occur as the very first child
|
|
of sphinx:docset, or never occur at all. Declares the document schema.
|
|
Contains field and attribute declarations. If present, overrides
|
|
per-source settings from the configuration file.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>sphinx:field</term>
|
|
<listitem><para>Optional element, child of sphinx:schema. Declares a full-text field.
|
|
Known attributes are:
|
|
<itemizedlist>
|
|
<listitem><para>"name", specifies the XML element name that will be treated as a full-text field in the subsequent documents.</para></listitem>
|
|
<listitem><para>"attr", specifies whether to also index this field as a string or word count attribute. Possible values are "string" and "wordcount". Introduced in version 1.10-beta.</para></listitem>
|
|
</itemizedlist>
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>sphinx:attr</term>
|
|
<listitem><para>Optional element, child of sphinx:schema. Declares an attribute.
|
|
Known attributes are:
|
|
<itemizedlist>
|
|
<listitem><para>"name", specifies the element name that should be treated as an attribute in the subsequent documents.</para></listitem>
|
|
<listitem><para>"type", specifies the attribute type. Possible values are "int", "timestamp", "str2ordinal", "bool", "float" and "multi".</para></listitem>
|
|
<listitem><para>"bits", specifies the bit size for "int" attribute type. Valid values are 1 to 32.</para></listitem>
|
|
<listitem><para>"default", specifies the default value for this attribute that should be used if the attribute's element is not present in the document.</para></listitem>
|
|
</itemizedlist>
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>sphinx:document</term>
|
|
<listitem><para>Mandatory element, must be a child of sphinx:docset.
|
|
Contains arbitrary other elements with field and attribute values
|
|
to be indexed, as declared either using sphinx:field and sphinx:attr
|
|
elements or in the configuration file. The only known attribute
|
|
is "id" that must contain the unique integer document ID.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>sphinx:killlist</term>
|
|
<listitem><para>Optional element, child of sphinx:docset.
|
|
Contains a number of "id" elements whose contents are document IDs
|
|
to be put into a <link linkend="conf-sql-query-killlist">kill-list</link> for this index.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="live-updates"><title>Live index updates</title>
|
|
<para>
|
|
There are two major approaches to maintaining the full-text index
|
|
contents up to date. Note, however, that both these approaches deal
|
|
with the task of <emphasis>full-text data updates</emphasis>, and not
|
|
attribute updates. Instant attribute updates are supported since
|
|
version 0.9.8. Refer to <link linkend="api-func-updateatttributes">UpdateAttributes()</link>
|
|
API call description for details.
|
|
</para>
|
|
<para>
|
|
First, you can use disk-based indexes, partition them manually,
|
|
and only rebuild the smaller partitions (so-called "deltas") frequently.
|
|
By minimizing the rebuild size, you can reduce the average indexing lag
|
|
to something as low as 30-60 seconds. This approach was the the only one
|
|
available in versions 0.9.x. On huge collections it actually might be
|
|
the most efficient one. Refer to <xref linkend="delta-updates"/>
|
|
for details.
|
|
</para>
|
|
<para>
|
|
Second, versions 1.x (starting with 1.10-beta) add support for so-called
|
|
real-time indexes (RT indexes for short) that on-the-fly updates of the
|
|
full-text data. Updates on a RT index can appear in the search results in
|
|
1-2 milliseconds, ie. 0.001-0.002 seconds. However, RT index are less
|
|
efficient for bulk indexing huge amounts of data. Refer to
|
|
<xref linkend="rt-indexes"/> for details.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="delta-updates"><title>Delta index updates</title>
|
|
<para>
|
|
There's a frequent situation when the total dataset is too big
|
|
to be reindexed from scratch often, but the amount of new records
|
|
is rather small. Example: a forum with a 1,000,000 archived posts,
|
|
but only 1,000 new posts per day.
|
|
</para>
|
|
<para>
|
|
In this case, "live" (almost real time) index updates could be
|
|
implemented using so called "main+delta" scheme.
|
|
</para>
|
|
<para>
|
|
The idea is to set up two sources and two indexes, with one
|
|
"main" index for the data which only changes rarely (if ever),
|
|
and one "delta" for the new documents. In the example above,
|
|
1,000,000 archived posts would go to the main index, and newly
|
|
inserted 1,000 posts/day would go to the delta index. Delta index
|
|
could then be reindexed very frequently, and the documents can
|
|
be made available to search in a matter of minutes.
|
|
</para>
|
|
<para>
|
|
Specifying which documents should go to what index and
|
|
reindexing main index could also be made fully automatical.
|
|
One option would be to make a counter table which would track
|
|
the ID which would split the documents, and update it
|
|
whenever the main index is reindexed.
|
|
<example id="ex-live-updates">
|
|
<title>Fully automated live updates</title>
|
|
<programlisting>
|
|
# in MySQL
|
|
CREATE TABLE sph_counter
|
|
(
|
|
counter_id INTEGER PRIMARY KEY NOT NULL,
|
|
max_doc_id INTEGER NOT NULL
|
|
);
|
|
|
|
# in sphinx.conf
|
|
source main
|
|
{
|
|
# ...
|
|
sql_query_pre = SET NAMES utf8
|
|
sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM documents
|
|
sql_query = SELECT id, title, body FROM documents \
|
|
WHERE id<=( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
|
|
}
|
|
|
|
source delta : main
|
|
{
|
|
sql_query_pre = SET NAMES utf8
|
|
sql_query = SELECT id, title, body FROM documents \
|
|
WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
|
|
}
|
|
|
|
index main
|
|
{
|
|
source = main
|
|
path = /path/to/main
|
|
# ... all the other settings
|
|
}
|
|
|
|
# note how all other settings are copied from main,
|
|
# but source and path are overridden (they MUST be)
|
|
index delta : main
|
|
{
|
|
source = delta
|
|
path = /path/to/delta
|
|
}
|
|
</programlisting>
|
|
</example>
|
|
</para>
|
|
<para>
|
|
Note how we're overriding <code>sql_query_pre</code> in the delta source.
|
|
We need to explicitly have that override. Otherwise <code>REPLACE</code> query
|
|
would be run when indexing delta source too, effectively nullifying it. However,
|
|
when we issue the directive in the inherited source for the first time, it removes
|
|
<emphasis>all</emphasis> inherited values, so the encoding setup is also lost.
|
|
So <code>sql_query_pre</code> in the delta can not just be empty; and we need
|
|
to issue the encoding setup query explicitly once again.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="index-merging"><title>Index merging</title>
|
|
<para>
|
|
Merging two existing indexes can be more efficient that indexing the data
|
|
from scratch, and desired in some cases (such as merging 'main' and 'delta'
|
|
indexes instead of simply reindexing 'main' in 'main+delta' partitioning
|
|
scheme). So <filename>indexer</filename> has an option to do that.
|
|
Merging the indexes is normally faster than reindexing but still
|
|
<emphasis>not</emphasis> instant on huge indexes. Basically,
|
|
it will need to read the contents of both indexes once and write
|
|
the result once. Merging 100 GB and 1 GB index, for example,
|
|
will result in 202 GB of IO (but that's still likely less than
|
|
the indexing from scratch requires).
|
|
</para>
|
|
<para>
|
|
The basic command syntax is as follows:
|
|
<programlisting>
|
|
indexer --merge DSTINDEX SRCINDEX [--rotate]
|
|
</programlisting>
|
|
Only the DSTINDEX index will be affected: the contents of SRCINDEX will be merged into it.
|
|
<option>--rotate</option> switch will be required if DSTINDEX is already being served by <filename>searchd</filename>.
|
|
The initially devised usage pattern is to merge a smaller update from SRCINDEX into DSTINDEX.
|
|
Thus, when merging the attributes, values from SRCINDEX will win if duplicate document IDs are encountered.
|
|
Note, however, that the "old" keywords will <emphasis>not</emphasis> be automatically removed in such cases.
|
|
For example, if there's a keyword "old" associated with document 123 in DSTINDEX, and a keyword "new" associated
|
|
with it in SRCINDEX, document 123 will be found by <emphasis>both</emphasis> keywords after the merge.
|
|
You can supply an explicit condition to remove documents from DSTINDEX to mitigate that;
|
|
the relevant switch is <option>--merge-dst-range</option>:
|
|
<programlisting>
|
|
indexer --merge main delta --merge-dst-range deleted 0 0
|
|
</programlisting>
|
|
This switch lets you apply filters to the destination index along with merging.
|
|
There can be several filters; all of their conditions must be met in order
|
|
to include the document in the resulting mergid index. In the example above,
|
|
the filter passes only those records where 'deleted' is 0, eliminating all
|
|
records that were flagged as deleted (for instance, using
|
|
<link linkend="api-func-updateatttributes">UpdateAttributes()</link> call).
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
</chapter>
|
|
<chapter id="rt-indexes"><title>Real-time indexes</title>
|
|
<para>
|
|
Real-time indexes (or RT indexes for brevity) are a new backend
|
|
that lets you insert, update, or delete documents (rows) on the fly.
|
|
RT indexes were added in version 1.10-beta. While querying of RT indexes
|
|
is possible using any of the SphinxAPI, SphinxQL, or SphinxSE, updating
|
|
them is only possible via SphinxQL at the moment. Full SphinxQL
|
|
reference is available in <xref linkend="sphinxql-reference"/>.
|
|
</para>
|
|
|
|
|
|
<sect1 id="rt-overview"><title>RT indexes overview</title>
|
|
<para>
|
|
RT indexes should be declared in <filename>sphinx.conf</filename>,
|
|
just as every other index type. Notable differences from the regular,
|
|
disk-based indexes are that a) data sources are not required and ignored,
|
|
and b) you should explicitly enumerate all the text fields, not just
|
|
attributes. Here's an example:
|
|
</para>
|
|
<example id="ex-rt-updates">
|
|
<title>RT index declaration</title>
|
|
<programlisting>
|
|
index rt
|
|
{
|
|
type = rt
|
|
path = /usr/local/sphinx/data/rt
|
|
rt_field = title
|
|
rt_field = content
|
|
rt_attr_uint = gid
|
|
}
|
|
</programlisting>
|
|
</example>
|
|
<para>
|
|
RT INDEXES ARE CURRENTLY (AS OF VERSION 1.10-beta) A WORK IN PROGRESS.
|
|
Therefore, they might lack certain features: for instance, prefix/infix
|
|
indexing, MVA attributes, etc are not supported yet. There also might be
|
|
performance and stability issues. However, all the regular indexing features
|
|
and most of the searching features are already in place, our internal
|
|
testing passes, and last but not least a number of production instances
|
|
are already using RT indexes with good results.
|
|
</para>
|
|
<para>
|
|
RT index can be accessed using MySQL protocol. INSERT, REPLACE, DELETE, and
|
|
SELECT statements against RT index are supported. For instance, this
|
|
is an example session with the sample index above:
|
|
</para>
|
|
<programlisting>
|
|
$ mysql -h 127.0.0.1 -P 9306
|
|
Welcome to the MySQL monitor. Commands end with ; or \g.
|
|
Your MySQL connection id is 1
|
|
Server version: 1.10-dev (r2153)
|
|
|
|
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
|
|
|
|
mysql> INSERT INTO rt VALUES ( 1, 'first record', 'test one', 123 );
|
|
Query OK, 1 row affected (0.05 sec)
|
|
|
|
mysql> INSERT INTO rt VALUES ( 2, 'second record', 'test two', 234 );
|
|
Query OK, 1 row affected (0.00 sec)
|
|
|
|
mysql> SELECT * FROM rt;
|
|
+------+--------+------+
|
|
| id | weight | gid |
|
|
+------+--------+------+
|
|
| 1 | 1 | 123 |
|
|
| 2 | 1 | 234 |
|
|
+------+--------+------+
|
|
2 rows in set (0.02 sec)
|
|
|
|
mysql> SELECT * FROM rt WHERE MATCH('test');
|
|
+------+--------+------+
|
|
| id | weight | gid |
|
|
+------+--------+------+
|
|
| 1 | 1643 | 123 |
|
|
| 2 | 1643 | 234 |
|
|
+------+--------+------+
|
|
2 rows in set (0.01 sec)
|
|
|
|
mysql> SELECT * FROM rt WHERE MATCH('@title test');
|
|
Empty set (0.00 sec)
|
|
</programlisting>
|
|
<para>
|
|
Both partial and batch INSERT syntaxes are supported, ie.
|
|
you can specify a subset of columns, and insert several rows at a time.
|
|
Deletions are also possible using DELETE statement; the only currently
|
|
supported syntax is DELETE FROM <index> WHERE id=<id>.
|
|
REPLACE is also supported, enabling you to implement updates.
|
|
</para>
|
|
<programlisting>
|
|
mysql> INSERT INTO rt ( id, title ) VALUES ( 3, 'third row' ), ( 4, 'fourth entry' );
|
|
Query OK, 2 rows affected (0.01 sec)
|
|
|
|
mysql> SELECT * FROM rt;
|
|
+------+--------+------+
|
|
| id | weight | gid |
|
|
+------+--------+------+
|
|
| 1 | 1 | 123 |
|
|
| 2 | 1 | 234 |
|
|
| 3 | 1 | 0 |
|
|
| 4 | 1 | 0 |
|
|
+------+--------+------+
|
|
4 rows in set (0.00 sec)
|
|
|
|
mysql> DELETE FROM rt WHERE id=2;
|
|
Query OK, 0 rows affected (0.00 sec)
|
|
|
|
mysql> SELECT * FROM rt WHERE MATCH('test');
|
|
+------+--------+------+
|
|
| id | weight | gid |
|
|
+------+--------+------+
|
|
| 1 | 1500 | 123 |
|
|
+------+--------+------+
|
|
1 row in set (0.00 sec)
|
|
|
|
mysql> INSERT INTO rt VALUES ( 1, 'first record on steroids', 'test one', 123 );
|
|
ERROR 1064 (42000): duplicate id '1'
|
|
|
|
mysql> REPLACE INTO rt VALUES ( 1, 'first record on steroids', 'test one', 123 );
|
|
Query OK, 1 row affected (0.01 sec)
|
|
|
|
mysql> SELECT * FROM rt WHERE MATCH('steroids');
|
|
+------+--------+------+
|
|
| id | weight | gid |
|
|
+------+--------+------+
|
|
| 1 | 1500 | 123 |
|
|
+------+--------+------+
|
|
1 row in set (0.01 sec)
|
|
</programlisting>
|
|
<para>
|
|
Data stored in RT index should survive clean shutdown. When binary logging
|
|
is enabled, it should also survive crash and/or dirty shutdown, and recover
|
|
on subsequent startup.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="rt-caveats"><title>Known caveats with RT indexes</title>
|
|
<para>
|
|
As of 1.10-beta, RT indexes are a beta quality feature: while no major,
|
|
showstopper-class issues are known, there still are a few known usage quirks.
|
|
Those quirks are listed in this section.
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem><para>Prefix and infix indexing are not supported yet.</para></listitem>
|
|
<listitem><para>MVAs are not supported yet.</para></listitem>
|
|
<listitem><para>Disk chunks optimization routine is not implemented yet.</para></listitem>
|
|
<listitem><para>On initial index creation, attributes are reordered by type,
|
|
in the following order: uint, bigint, float, timestamp, string. So when
|
|
using INSERT without an explicit column names list, specify all uint
|
|
column values first, then bigint, etc.</para></listitem>
|
|
<listitem><para>Default conservative RAM chunk limit (<option>rt_mem_limit</option>)
|
|
of 32M can lead to poor performance on bigger indexes, you should raise it to
|
|
256..1024M if you're planning to index gigabytes.</para></listitem>
|
|
<listitem><para>High DELETE/REPLACE rate can lead to kill-list fragmentation
|
|
and impact searching performance.</para></listitem>
|
|
<listitem><para>No transaction size limits are currently imposed;
|
|
too many concurrent INSERT/REPLACE transactions might therefore
|
|
consume a lot of RAM.</para></listitem>
|
|
<listitem><para>In case of a damaged binlog, recovery will stop on the
|
|
first damaged transaction, even though it's technically possible
|
|
to keep looking further for subsequent undamaged transactions, and
|
|
recover those. This mid-file damage case (due to flaky HDD/CDD/tape?)
|
|
is supposed to be extremely rare, though.</para></listitem>
|
|
<listitem><para>Multiple INSERTs grouped in a single transaction perform
|
|
better than equivalent single-row transactions and are recommended for
|
|
batch loading of data.</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="rt-internals"><title>RT index internals</title>
|
|
<para>
|
|
RT index is internally chunked. It keeps a so-called RAM chunk
|
|
that stores all the most recent changes. RAM chunk memory usage
|
|
is rather strictly limited with per-index
|
|
<link linkend="conf-rt-mem-limit">rt_mem_limit</link> directive.
|
|
Once RAM chunk grows over this limit, a new disk chunk is created
|
|
from its data, and RAM chunk is reset. Thus, while most changes
|
|
on the RT index will be performed in RAM only and complete instantly
|
|
(in milliseconds), those changes that overflow the RAM chunk will
|
|
stall for the duration of disk chunk creation (a few seconds).
|
|
</para>
|
|
<para>
|
|
Disk chunks are, in fact, just regular disk-based indexes.
|
|
But they're a part of an RT index and automatically managed by it,
|
|
so you need not configure nor manage them manually. Because a new
|
|
disk chunk is created every time RT chunk overflows the limit, and
|
|
because in-memory chunk format is close to on-disk format, the disk
|
|
chunks will be approximately <option>rt_mem_limit</option> bytes
|
|
in size each.
|
|
</para>
|
|
<para>
|
|
Generally, it is better to set the limit bigger, to minimize both
|
|
the frequency of flushes, and the index fragmentation (number of disk
|
|
chunks). For instance, on a dedicated search server that handles
|
|
a big RT index, it can be advised to set <option>rt_mem_limit</option>
|
|
to 1-2 GB. A global limit on all indexes is also planned, but not yet
|
|
implemented yet as of 1.10-beta.
|
|
</para>
|
|
<para>
|
|
Disk chunk full-text index data can not be actually modified,
|
|
so the full-text field changes (ie. row deletions and updates)
|
|
suppress a previous row version from a disk chunk using a kill-list,
|
|
but do not actually physically purge the data. Therefore, on workloads
|
|
with high full-text updates ratio index might eventually get polluted
|
|
by these previous row versions, and searching performance would
|
|
degrade. Physical index purging that would improve the performance
|
|
is planned, but not yet implemented as of 1.10-beta.
|
|
</para>
|
|
<para>
|
|
Data in RAM chunk gets saved to disk on clean daemon shutdown, and
|
|
then loaded back on startup. However, on daemon or server crash,
|
|
updates from RAM chunk might be lost. To prevent that, binary logging
|
|
of transactions can be used; see <xref linkend="rt-binlog"/> for details.
|
|
</para>
|
|
<para>
|
|
Full-text changes in RT index are transactional. They are stored
|
|
in a per-thread accumulator until COMMIT, then applied at once.
|
|
Bigger batches per single COMMIT should result in faster indexing.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="rt-binlog"><title>Binary logging</title>
|
|
<para>
|
|
Binary logs are essentially a recovery mechanism. With binary logs
|
|
enabled, <filename>searchd</filename> writes every given transaction
|
|
to the binlog file, and uses that for recovery after an unclean shutdown.
|
|
On clean shutdown, RAM chunks are saved to disk, and then all the binlog
|
|
files are unlinked.
|
|
</para>
|
|
<para>
|
|
During normal operation, a new binlog file will be opened every time
|
|
when <option>binlog_max_log_size</option> limit (which defaults to 128M)
|
|
is reached. Older, already closed binlog files are kept until all of the
|
|
transactions stored in them (from all indexes) are flushed as a disk chunk.
|
|
Setting the limit to 0 pretty much prevents binlog from being unlinked
|
|
at all while <filename>searchd</filename> is running; however, it will
|
|
still be unlinked on clean shutdown.
|
|
</para>
|
|
<para>
|
|
There are 3 different binlog flushing strategies, controlled by
|
|
<link linkend="conf-binlog-flush">binlog_flush</link> directive
|
|
which takes the values of 0, 1, or 2. 0 means to flush the log
|
|
to OS and sync it to disk every second; 1 means flush and sync
|
|
every transaction; and 2 (the default mode) means flush every
|
|
transaction but sync every second. Sync is relatively slow because
|
|
it has to perform physical disk writes, so mode 1 is the safest
|
|
(every committed transaction is guaranteed to be written on disk)
|
|
but the slowest. Flushing log to OS prevents from data loss on
|
|
<filename>searchd</filename> crashes but not system crashes.
|
|
Mode 2 is the default.
|
|
</para>
|
|
<para>
|
|
On recovery after an unclean shutdown, binlogs are replayed
|
|
and all logged transactions since the last good on-disk state
|
|
are restored. Transactions are checksummed so in case of binlog
|
|
file corruption garbage data will <b>not</b> be replayed; such
|
|
a broken transaction will be detected and, currently, will stop
|
|
replay. Transactions also start with a magic marker and timestamped,
|
|
so in case of binlog damage in the middle of the file, it's technically
|
|
possible to skip broken transactions and keep replaying from the next
|
|
good one, and/or it's possible to replay transactions until a given
|
|
timestamp (point-in-time recovery), but none of that is implemented yet
|
|
as of 1.10-beta.
|
|
</para>
|
|
<para>
|
|
One unwanted side effect of binlogs is that activel updating
|
|
a small RT index that fully fits into a RAM chunk part will lead
|
|
to an ever-growing binlog that can never be unlinked until clean
|
|
shutdown. Binlogs are essentially append-only deltas against
|
|
the last known good saved state on disk, and unless RAM chunk
|
|
gets saved, they can not be unlinked. An ever-growing binlog
|
|
is not very good for disk use and crash recovery time. Starting
|
|
with 2.0.1-beta you can configure <filename>searchd</filename>
|
|
to perform a periodic RAM chunk flush to fix that problem
|
|
using a <link linkend="conf-rt-flush-period">rt_flush_period</link>
|
|
directive. With periodic flushes enabled, <filename>searchd</filename>
|
|
will keep a separate thread, checking whether RT indexes RAM
|
|
chunks need to be written back to disk. Once that happens,
|
|
the respective binlogs can be (and are) safely unlinked.
|
|
</para>
|
|
<para>
|
|
Note that <code>rt_flush_period</code> only controls the
|
|
frequency at which the <emphasis>checks</emphasis> happen.
|
|
There are no <emphasis>guarantees</emphasis> that the
|
|
particular RAM chunk will get saved. For instance, it does
|
|
not make sense to regularly re-save a huge RAM chunk that
|
|
only gets a few rows worh of updates. The search daemon
|
|
determine whether to actually perform the flush with a few
|
|
heuristics.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
</chapter>
|
|
<chapter id="searching"><title>Searching</title>
|
|
|
|
|
|
<!-- TODO
|
|
<sect1 id="searching-overview"><title>Overview</title>
|
|
</sect1>
|
|
-->
|
|
|
|
|
|
<sect1 id="matching-modes"><title>Matching modes</title>
|
|
<para>
|
|
So-called matching modes are a legacy feature that used to provide
|
|
(very) limited query syntax and ranking support. Currently, they are
|
|
deprecated in favor of <link linkend="extended-syntax">full-text query
|
|
language</link> and so-called <link linkend="weighting">rankers</link>.
|
|
Starting with version 0.9.9-release, it is thus strongly recommended
|
|
to use SPH_MATCH_EXTENDED and proper query syntax rather than any other
|
|
legacy mode. All those other modes are actually internally converted
|
|
to extended syntax anyway. SphinxAPI still defaults to SPH_MATCH_ALL
|
|
but that is for compatibility reasons only.
|
|
</para>
|
|
<para>
|
|
There are the following matching modes available:
|
|
<itemizedlist>
|
|
<listitem><para>SPH_MATCH_ALL, matches all query words (default mode);</para></listitem>
|
|
<listitem><para>SPH_MATCH_ANY, matches any of the query words;</para></listitem>
|
|
<listitem><para>SPH_MATCH_PHRASE, matches query as a phrase, requiring perfect match;</para></listitem>
|
|
<listitem><para>SPH_MATCH_BOOLEAN, matches query as a boolean expression (see <xref linkend="boolean-syntax"/>);</para></listitem>
|
|
<listitem><para>SPH_MATCH_EXTENDED, matches query as an expression in Sphinx internal query language
|
|
(see <xref linkend="extended-syntax"/>);</para></listitem>
|
|
<listitem><para>SPH_MATCH_EXTENDED2, an alias for SPH_MATCH_EXTENDED;</para></listitem>
|
|
<listitem><para>SPH_MATCH_FULLSCAN, matches query, forcibly using the "full scan" mode as below.
|
|
NB, any query terms will be ignored, such that filters, filter-ranges and grouping
|
|
will still be applied, but no text-matching.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
SPH_MATCH_EXTENDED2 was used during 0.9.8 and 0.9.9 development cycle,
|
|
when the internal matching engine was being rewritten (for the sake of
|
|
additional functionality and better performance). By 0.9.9-release,
|
|
the older version was removed, and SPH_MATCH_EXTENDED and SPH_MATCH_EXTENDED2
|
|
are now just aliases.
|
|
</para>
|
|
<para>
|
|
The SPH_MATCH_FULLSCAN mode will be automatically activated in place of the specified matching mode when the following conditions are met:
|
|
<orderedlist>
|
|
<listitem><para>The query string is empty (ie. its length is zero).</para></listitem>
|
|
<listitem><para><link linkend="conf-docinfo">docinfo</link> storage is set to <code>extern</code>.</para></listitem>
|
|
</orderedlist>
|
|
In full scan mode, all the indexed documents will be considered as matching.
|
|
Such queries will still apply filters, sorting, and group by, but will not perform any full-text searching.
|
|
This can be useful to unify full-text and non-full-text searching code, or to offload SQL server
|
|
(there are cases when Sphinx scans will perform better than analogous MySQL queries).
|
|
An example of using the full scan mode might be to find posts in a forum.
|
|
By selecting the forum's user ID via <code>SetFilter()</code> but not actually providing any search text,
|
|
Sphinx will match every document (i.e. every post) where <code>SetFilter()</code> would match -
|
|
in this case providing every post from that user. By default this will be ordered by relevancy,
|
|
followed by Sphinx document ID in ascending order (earliest first).
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="boolean-syntax"><title>Boolean query syntax</title>
|
|
<para>
|
|
Boolean queries allow the following special operators to be used:
|
|
<itemizedlist>
|
|
<listitem><para>explicit operator AND: <programlisting>hello & world</programlisting></para></listitem>
|
|
<listitem><para>operator OR: <programlisting>hello | world</programlisting></para></listitem>
|
|
<listitem><para>operator NOT:
|
|
<programlisting>
|
|
hello -world
|
|
hello !world
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>grouping: <programlisting>( hello world )</programlisting></para></listitem>
|
|
</itemizedlist>
|
|
Here's an example query which uses all these operators:
|
|
<example id="ex-boolean-query"><title>Boolean query example</title>
|
|
<programlisting>
|
|
( cat -dog ) | ( cat -mouse)
|
|
</programlisting>
|
|
</example>
|
|
</para>
|
|
<para>
|
|
There always is implicit AND operator, so "hello world" query actually
|
|
means "hello & world".
|
|
</para>
|
|
<para>
|
|
OR operator precedence is higher than AND, so "looking for cat | dog | mouse"
|
|
means "looking for ( cat | dog | mouse )" and <emphasis>not</emphasis>
|
|
"(looking for cat) | dog | mouse".
|
|
</para>
|
|
<para>
|
|
Queries like "-dog", which implicitly include all documents from the
|
|
collection, can not be evaluated. This is both for technical and performance
|
|
reasons. Technically, Sphinx does not always keep a list of all IDs.
|
|
Performance-wise, when the collection is huge (ie. 10-100M documents),
|
|
evaluating such queries could take very long.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="extended-syntax"><title>Extended query syntax</title>
|
|
<para>
|
|
The following special operators and modifiers can be used when using the extended matching mode:
|
|
<itemizedlist>
|
|
<listitem><para>operator OR: <programlisting>hello | world</programlisting></para></listitem>
|
|
<listitem><para>operator NOT:
|
|
<programlisting>
|
|
hello -world
|
|
hello !world
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>field search operator: <programlisting>@title hello @body world</programlisting></para></listitem>
|
|
<listitem><para>field position limit modifier (introduced in version 0.9.9-rc1): <programlisting>@body[50] hello</programlisting></para></listitem>
|
|
<listitem><para>multiple-field search operator: <programlisting>@(title,body) hello world</programlisting></para></listitem>
|
|
<listitem><para>all-field search operator: <programlisting>@* hello</programlisting></para></listitem>
|
|
<listitem><para>phrase search operator: <programlisting>"hello world"</programlisting></para></listitem>
|
|
<listitem><para>proximity search operator: <programlisting>"hello world"~10</programlisting></para></listitem>
|
|
<listitem><para>quorum matching operator: <programlisting>"the world is a wonderful place"/3</programlisting></para></listitem>
|
|
<listitem><para>strict order operator (aka operator "before"): <programlisting>aaa << bbb << ccc</programlisting></para></listitem>
|
|
<listitem><para>exact form modifier (introduced in version 0.9.9-rc1): <programlisting>raining =cats and =dogs</programlisting></para></listitem>
|
|
<listitem><para>field-start and field-end modifier (introduced in version 0.9.9-rc2): <programlisting>^hello world$</programlisting></para></listitem>
|
|
<listitem><para>NEAR, generalized proximity operator (introduced in version 2.0.1-beta): <programlisting>hello NEAR/3 world NEAR/4 "my test"</programlisting></para></listitem>
|
|
<listitem><para>SENTENCE operator (introduced in version 2.0.1-beta): <programlisting>all SENTENCE words SENTENCE "in one sentence"</programlisting></para></listitem>
|
|
<listitem><para>PARAGRAPH operator (introduced in version 2.0.1-beta): <programlisting>"Bill Gates" PARAGRAPH "Steve Jobs"</programlisting></para></listitem>
|
|
<listitem><para>zone limit operator: <programlisting>ZONE:(h3,h4) only in these titles</programlisting></para></listitem>
|
|
</itemizedlist>
|
|
|
|
Here's an example query that uses some of these operators:
|
|
<example id="ex-extended-query"><title>Extended matching mode: query example</title>
|
|
<programlisting>
|
|
"hello world" @title "example program"~5 @body python -(php|perl) @* code
|
|
</programlisting>
|
|
</example>
|
|
The full meaning of this search is:
|
|
|
|
<itemizedlist>
|
|
<listitem><para>Find the words 'hello' and 'world' adjacently in any field in a document;</para></listitem>
|
|
<listitem><para>Additionally, the same document must also contain the words 'example' and 'program'
|
|
in the title field, with up to, but not including, 10 words between the words in question;
|
|
(E.g. "example PHP program" would be matched however "example script to introduce outside data
|
|
into the correct context for your program" would not because two terms have 10 or more words between them)</para></listitem>
|
|
<listitem><para>Additionally, the same document must contain the word 'python' in the body field, but not contain either 'php' or 'perl';</para></listitem>
|
|
<listitem><para>Additionally, the same document must contain the word 'code' in any field.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
There always is implicit AND operator, so "hello world" means that
|
|
both "hello" and "world" must be present in matching document.
|
|
</para>
|
|
<para>
|
|
OR operator precedence is higher than AND, so "looking for cat | dog | mouse"
|
|
means "looking for ( cat | dog | mouse )" and <emphasis>not</emphasis>
|
|
"(looking for cat) | dog | mouse".
|
|
</para>
|
|
<para>
|
|
Field limit operator limits subsequent searching to a given field.
|
|
Normally, query will fail with an error message if given field name does not exist
|
|
in the searched index. However, that can be suppressed by specifying "@@relaxed"
|
|
option at the very beginning of the query:
|
|
<programlisting>
|
|
@@relaxed @nosuchfield my query
|
|
</programlisting>
|
|
This can be helpful when searching through heterogeneous indexes with
|
|
different schemas.
|
|
</para>
|
|
<para>
|
|
Field position limit, introduced in version 0.9.9-rc1, additionaly restricts the searching
|
|
to first N position within given field (or fields). For example, "@body[50] hello" will
|
|
<b>not</b> match the documents where the keyword 'hello' occurs at position 51 and below
|
|
in the body.
|
|
</para>
|
|
<para>
|
|
Proximity distance is specified in words, adjusted for word count, and
|
|
applies to all words within quotes. For instance, "cat dog mouse"~5 query
|
|
means that there must be less than 8-word span which contains all 3 words,
|
|
ie. "CAT aaa bbb ccc DOG eee fff MOUSE" document will <emphasis>not</emphasis>
|
|
match this query, because this span is exactly 8 words long.
|
|
</para>
|
|
<para>
|
|
Quorum matching operator introduces a kind of fuzzy matching.
|
|
It will only match those documents that pass a given threshold of given words.
|
|
The example above ("the world is a wonderful place"/3) will match all documents
|
|
that have at least 3 of the 6 specified words.
|
|
</para>
|
|
<para>
|
|
Strict order operator (aka operator "before"), introduced in version 0.9.9-rc2,
|
|
will match the document only if its argument keywords occur in the document
|
|
exactly in the query order. For instance, "black << cat" query (without
|
|
quotes) will match the document "black and white cat" but <emphasis>not</emphasis>
|
|
the "that cat was black" document. Order operator has the lowest priority.
|
|
It can be applied both to just keywords and more complex expressions,
|
|
ie. this is a valid query:
|
|
<programlisting>
|
|
(bag of words) << "exact phrase" << red|green|blue
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
Exact form keyword modifier, introduced in version 0.9.9-rc1, will match the document only if the keyword occurred
|
|
in exactly the specified form. The default behaviour is to match the document
|
|
if the stemmed keyword matches. For instance, "runs" query will match both
|
|
the document that contains "runs" <emphasis>and</emphasis> the document that
|
|
contains "running", because both forms stem to just "run" - while "=runs"
|
|
query will only match the first document. Exact form operator requires
|
|
<link linkend="conf-index-exact-words">index_exact_words</link> option to be enabled.
|
|
This is a modifier that affects the keyword and thus can be used within
|
|
operators such as phrase, proximity, and quorum operators.
|
|
</para>
|
|
<para>
|
|
Field-start and field-end keyword modifiers, introduced in version 0.9.9-rc2,
|
|
will make the keyword match only if it occurred at the very start or the very end
|
|
of a fulltext field, respectively. For instance, the query "^hello world$"
|
|
(with quotes and thus combining phrase operator and start/end modifiers)
|
|
will only match documents that contain at least one field that has exactly
|
|
these two keywords.
|
|
</para>
|
|
<para>
|
|
Starting with 0.9.9-rc1, arbitrarily nested brackets and negations are allowed.
|
|
However, the query must be possible to compute without involving an implicit
|
|
list of all documents:
|
|
<programlisting>
|
|
// correct query
|
|
aaa -(bbb -(ccc ddd))
|
|
|
|
// queries that are non-computable
|
|
-aaa
|
|
aaa | -bbb
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
<b>NEAR operator</b>, added in 2.0.1-beta, is a generalized version
|
|
of a proximity operator. The syntax is <code>NEAR/N</code>, it is
|
|
case-sensitive, and no spaces are allowed beetwen the NEAR keyword,
|
|
the slash sign, and the distance value.
|
|
</para>
|
|
<para>
|
|
The original proximity operator only worked on sets of keywords.
|
|
NEAR is more generic and can accept arbitrary subexpressions as
|
|
its two arguments, matching the document when both subexpressions
|
|
are found within N words of each other, no matter in which order.
|
|
NEAR is left associative and has the same (lowest) precedence
|
|
as BEFORE.
|
|
</para>
|
|
<para>
|
|
You should also note how a <code>(one NEAR/7 two NEAR/7 three)</code>
|
|
query using NEAR is not really equivalent to a
|
|
<code>("one two three"~7)</code> one using keyword proximity operator.
|
|
The difference here is that the proximity operator allows for up to
|
|
6 non-matching words between all the 3 matching words, but the version
|
|
with NEAR is less restrictive: it would allow for up to 6 words between
|
|
'one' and 'two' and then for up to 6 more between that two-word
|
|
matching and a 'three' keyword.
|
|
</para>
|
|
<para>
|
|
<b>SENTENCE and PARAGRAPH operators</b>, added in 2.0.1-beta,
|
|
matches the document when both its arguments are within the same
|
|
sentence or the same paragraph of text, respectively. The arguments
|
|
can be either keywords, or phrases, or the instances of the same
|
|
operator. Here are a few examples:
|
|
<programlisting>
|
|
one SENTENCE two
|
|
one SENTENCE "two three"
|
|
one SENTENCE "two three" SENTENCE four
|
|
</programlisting>
|
|
The order of the arguments within the sentence or paragraph
|
|
does not matter. These operators only work on indexes built
|
|
with <link linkend="conf-index-sp">index_sp</link> (sentence
|
|
and paragraph indexing feature) enabled, and revert to a mere
|
|
AND otherwise. Refer to the <code>index_sp</code> directive
|
|
documentation for the notes on what's considered a sentence
|
|
and a paragraph.
|
|
</para>
|
|
<para>
|
|
<b>ZONE limit operator</b>, added in 2.0.1-beta, is quite similar
|
|
to field limit operator, but restricts matching to a given in-field
|
|
zone or a list of zones. Note that the subsequent subexpressions
|
|
are <emphasis>not</emphasis> required to match in a single contiguous
|
|
span of a given zone, and may match in multiple spans.
|
|
For instance, <code>(ZONE:th hello world)</code> query
|
|
<emphasis>will</emphasis> match this example document:
|
|
<programlisting>
|
|
<th>Table 1. Local awareness of Hello Kitty brand.</th>
|
|
.. some table data goes here ..
|
|
<th>Table 2. World-wide brand awareness.</th>
|
|
</programlisting>
|
|
ZONE operator affects the query until the next
|
|
field or ZONE limit operator, or the closing parenthesis.
|
|
It only works on the indexes built with zones support
|
|
(see <xref linkend="conf-index-zones"/>) and will be ignored
|
|
otherwise.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="weighting"><title>Search results ranking</title>
|
|
<bridgehead>Ranking overview</bridgehead>
|
|
<para>
|
|
Ranking (aka weighting) of the search results can be defined
|
|
as a process of computing a so-called relevance (aka weight)
|
|
for every given matched document with regards to a given query
|
|
that matched it. So relevance is in the end just a number attached
|
|
to every document that estimates how relevant the document is to
|
|
the query. Search results can then be sorted based on this number
|
|
and/or some additional parameters, so that the most sought after
|
|
results would come up higher on the results page.
|
|
</para>
|
|
<para>
|
|
There is no single standard one-size-fits-all way to rank
|
|
any document in any scenario. Moreover, there can not ever be
|
|
such a way, because relevance is <emphasis>subjective</emphasis>.
|
|
As in, what seems relevant to you might not seem relevant to me.
|
|
Hence, in general case it's not just hard to compute, it's
|
|
theoretically impossible.
|
|
</para>
|
|
<para>
|
|
So ranking in Sphinx is configurable. It has a notion of
|
|
a so-called <b>ranker</b>. A ranker can formally be defined
|
|
as a function that takes document and query as its input and
|
|
produces a relevance value as output. In layman's terms,
|
|
a ranker controls exactly how (using which specific algorithm)
|
|
will Sphinx assign weights to the document.
|
|
</para>
|
|
<para>
|
|
Previously, this ranking function was rigidly bound to the matching mode.
|
|
So in the legacy matching modes (that is, SPH_MATCH_ALL, SPH_MATCH_ANY,
|
|
SPH_MATCH_PHRASE, and SPH_MATCH_BOOLEAN) you can not choose the ranker.
|
|
You can only do that in the SPH_MATCH_EXTENDED mode. (Which is the only
|
|
mode in SphinxQL and the suggested mode in SphinxAPI anyway.) To choose
|
|
a non-default ranker you can either use
|
|
<link linkend="api-func-setrankingmode">SetRankingMode()</link>
|
|
with SphinxAPI, or <link linkend="sphinxql-select">OPTION ranker</link>
|
|
clause in <code>SELECT</code> statement when using SphinxQL.
|
|
</para>
|
|
<para>
|
|
As a sidenote, legacy matching modes are internally implemented via
|
|
the unified syntax anyway. When you use one of those modes, Sphinx just
|
|
internally adjusts the query and sets the associated ranker, then
|
|
executes the query using the very same unified code path.
|
|
</para>
|
|
<bridgehead>Available rankers</bridgehead>
|
|
<para>
|
|
Sphinx ships with a number of built-in rankers suited for different
|
|
purposes. A number of them uses two factors, phrase proximity (aka LCS)
|
|
and BM25. Phrase proximity works on the keyword positions, while BM25
|
|
works on the keyword frequencies. Basically, the better the degree of
|
|
the phrase match between the document body and the query, the higher
|
|
is the phrase proximity (it maxes out when the document contains
|
|
the entire query as a verbatim quote). And BM25 is higher when
|
|
the document containers more rare words. We'll save the detailed
|
|
discussion for later.
|
|
</para>
|
|
<para>
|
|
Currently implemented rankers are:
|
|
<itemizedlist>
|
|
<listitem><para>
|
|
SPH_RANK_PROXIMITY_BM25, the default ranking mode that uses and combines
|
|
both phrase proximity and BM25 ranking.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
SPH_RANK_BM25, statistical ranking mode which uses BM25 ranking only (similar to
|
|
most other full-text engines). This mode is faster but may result in worse quality
|
|
on queries which contain more than 1 keyword.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
SPH_RANK_NONE, no ranking mode. This mode is obviously the fastest.
|
|
A weight of 1 is assigned to all matches. This is sometimes called boolean
|
|
searching that just matches the documents but does not rank them.
|
|
</para></listitem>
|
|
<listitem><para>SPH_RANK_WORDCOUNT, ranking by the keyword occurrences count.
|
|
This ranker computes the per-field keyword occurrence counts, then multiplies
|
|
them by field weights, and sums the resulting values.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
SPH_RANK_PROXIMITY, added in version 0.9.9-rc1, returns raw phrase proximity
|
|
value as a result. This mode is internally used to emulate SPH_MATCH_ALL queries.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
SPH_RANK_MATCHANY, added in version 0.9.9-rc1, returns rank as it was computed
|
|
in SPH_MATCH_ANY mode ealier, and is internally used to emulate SPH_MATCH_ANY queries.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
SPH_RANK_FIELDMASK, added in version 0.9.9-rc2, returns a 32-bit mask with
|
|
N-th bit corresponding to N-th fulltext field, numbering from 0. The bit will
|
|
only be set when the respective field has any keyword occurences satisfiying
|
|
the query.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
SPH_RANK_SPH04, added in version 1.10-beta, is generally based on the default
|
|
SPH_RANK_PROXIMITY_BM25 ranker, but additionally boosts the matches when
|
|
they occur in the very beginning or the very end of a text field. Thus,
|
|
if a field equals the exact query, SPH04 should rank it higher than a field
|
|
that contains the exact query but is not equal to it. (For instance, when
|
|
the query is "Hyde Park", a document entitled "Hyde Park" should be ranked
|
|
higher than a one entitled "Hyde Park, London" or "The Hyde Park Cafe".)
|
|
</para></listitem>
|
|
<listitem><para>
|
|
SPH_RANK_EXPR, added in version 2.0.2-beta, lets you specify the ranking
|
|
formula in run time. It exposes a number of internal text factors and lets
|
|
you defined how the final weight should be computed from those factors.
|
|
You can find more details about its syntax and a reference available
|
|
factors in a subsection below.
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
You should specify the <code>SPH_RANK_</code> prefix and use capital letters only
|
|
when using the <link linkend="api-func-setrankingmode">SetRankingMode()</link>
|
|
call from the SphinxAPI. The API ports expose these as global constants.
|
|
Using SphinxQL syntax, the prefix should be omitted and the ranker name
|
|
is case insensitive. Example:
|
|
<programlisting>
|
|
// SphinxAPI
|
|
$client->SetRankingMode ( SPH_RANK_SPH04 );
|
|
|
|
// SphinxQL
|
|
mysql_query ( "SELECT ... OPTION ranker=sph04" );
|
|
</programlisting>
|
|
</para>
|
|
<bridgehead>Legacy matching modes rankers</bridgehead>
|
|
<para>
|
|
Legacy matching modes automatically select a ranker as follows:
|
|
<itemizedlist>
|
|
<listitem><para>SPH_MATCH_ALL uses SPH_RANK_PROXIMITY ranker;</para></listitem>
|
|
<listitem><para>SPH_MATCH_ANY uses SPH_RANK_MATCHANY ranker;</para></listitem>
|
|
<listitem><para>SPH_MATCH_PHRASE uses SPH_RANK_PROXIMITY ranker;</para></listitem>
|
|
<listitem><para>SPH_MATCH_BOOLEAN uses SPH_RANK_NONE ranker.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<bridgehead>Expression based ranker (SPH_RANK_EXPR)</bridgehead>
|
|
<para>
|
|
Expression ranker, added in version 2.0.2-beta, lets you change the ranking
|
|
formula on the fly, on a per-query basis. For a quick kickoff, this is how you
|
|
emulate PROXIMITY_BM25 ranker using the expression based one:
|
|
<programlisting>
|
|
SELECT *, WEIGHT() FROM myindex WHERE MATCH('hello world')
|
|
OPTION ranker=expr('sum(lcs*user_weight)*1000+bm25')
|
|
</programlisting>
|
|
The output of this query must not change if you omit the <code>OPTION</code>
|
|
clause, because the default ranker (PROXIMITY_BM25) behaves exactly like
|
|
specified in the ranker formula above. But the expression ranker is somewhat
|
|
more flexible than just that and provides access to many more factors.
|
|
</para>
|
|
<para>
|
|
The ranking formula is an arbitrary arithmetic expression that can use
|
|
constants, document attributes, built-in functions and operators (described
|
|
in <xref linkend="expressions"/>), and also a few ranking-specific things
|
|
that are only accessible in a ranking formula. Namely, those are field
|
|
aggregation functions, field-level, and document-level ranking factors.
|
|
</para>
|
|
<para>
|
|
A <b>document-level factor</b> is a numeric value computed by the ranking
|
|
engine for every matched document with regards to the current query.
|
|
(So it differs from a plain document attribute in that the attribute
|
|
do not depend on the full text query, while factors might.) Those
|
|
factors can be used anywhere in the ranking expression.
|
|
Currently implemented document-level factors are:
|
|
<itemizedlist>
|
|
<listitem><para>
|
|
<code>bm25</code> (integer), a document-level BM25 estimate (computed without
|
|
keyword occurrence filtering).
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<code>max_lcs</code> (integer), a query-level maximum possible value that
|
|
the sum(lcs*user_weight) expression can ever take. This can be
|
|
useful for weight boost scaling. For instance, MATCHANY ranker
|
|
formula uses this to guarantee that a full phrase match in any
|
|
field rankes higher than any combination of partial matches
|
|
in all fields.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<code>field_mask</code> (integer), a document-level 32-bit mask of matched
|
|
fields.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<code>query_word_count</code> (integer), the number of unique keywords
|
|
in a query, adjusted for a number of excluded keywords. For instance,
|
|
both <code>(one one one one)</code> and <code>(one !two)</code> queries
|
|
should assign a value of 1 to this factor, because there is just one unique
|
|
non-excluded keyword.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<code>doc_word_count</code> (integer), the number of unique keywords
|
|
matched in the entire document.
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
A <b>field-level factor</b> is a numeric value computed by the ranking
|
|
engine for every matched in-document text field with regards to the
|
|
current query. As more than one field can be matched by a query,
|
|
but the final weight needs to be a single integer value, these
|
|
values need to be folded into a single one. To achieve that,
|
|
field-level factors can only be used within a field aggregation
|
|
function, they can <b>not</b> be used anywhere in the expression.
|
|
For example, you can not use <code>(lcs+bm25)</code> as your
|
|
ranking expression, as <code>lcs</code> takes multiple values (one
|
|
in every matched field). You shoudl use <code>(sum(lcs)+bm25)</code>
|
|
instead, that expression sums <code>lcs</code> over all matching fields,
|
|
and then adds <code>bm25</code> to that per-field sum.
|
|
Currently implemented field-level factors are:
|
|
<itemizedlist>
|
|
<listitem><para>
|
|
<code>lcs</code> (integer), the length of a maximum verbatim match between
|
|
the document and the query, coutned in words. LCS stands for Longest Common
|
|
Subsequence (or Subset). Takes a minimum value of 1 when only stray keywords
|
|
were matched in a field, and a maximum value of query keywords count
|
|
when the entire query was matched in a field verbatim (in the exact
|
|
query keywords order). For example, if the query is 'hello world'
|
|
and the field contains these two words quoted from the query (that is,
|
|
adjacent to each other, and exaclty in the query order), <code>lcs</code>
|
|
will be 2. For example, if the query is 'hello world program' and
|
|
the field contains 'hello world', <code>lcs</code> will be 2.
|
|
Note that any subset of the query keyword works, not just a subset
|
|
of adjacent keywords. For example, if the query is 'hello world program'
|
|
and the field contains 'hello (test program)', <code>lcs</code> will be 2
|
|
just as well, because both 'hello' and 'program' matched in the same
|
|
respective positions as they were in the query. Finally, if the query
|
|
is 'hello world program' and the field contains 'hello world program',
|
|
<code>lcs</code> will be 3. (Hopefully that is unsurpising at this point.)
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<code>user_weight</code> (integer), the user specified per-field weight
|
|
(refer to <link linkend="api-func-setfieldweights">SetFieldWeights()</link>
|
|
in SphinxAPI and <link linkend="sphinxql-select">OPTION field_weights</link>
|
|
in SphinxQL respectively). The weights default to 1 if not specified
|
|
explicitly.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<code>hit_count</code> (integer), the number of keyword occurrences
|
|
that matched in the field. Note that a single keyword may occur multiple
|
|
times. For example, if 'hello' occurs 3 times in a field and 'world'
|
|
occurs 5 times, <code>hit_count</code> will be 8.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<code>word_count</code> (integer), the number of unique keywords matched
|
|
in the field. For example, if 'hello' and 'world' occur anywhere in a field,
|
|
<code>word_count</code> will be 2, irregardless of how many times do both
|
|
keywords occur.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<code>tf_idf</code> (float), the sum of TF*IDF over all the keywords matched in the
|
|
field. IDF is the Inverse Document Frequency, a floating point value
|
|
between 0 and 1 that describes how frequent is the keywords (basically,
|
|
0 for a keyword that occurs in every document indexed, and 1 for a unique
|
|
keyword that occurs in just a single document). TF is the Term Frequency,
|
|
the number of matched keyword occurrences in the field. As a side note,
|
|
<code>tf_idf</code> is actually computed by summing IDF over all matched
|
|
occurences. That's by construction equivalent to summing TF*IDF over
|
|
all matched keywords.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<code>min_hit_pos</code> (integer), the position of the first matched keyword occurrence,
|
|
counted in words. Indexing begins from position 1.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<code>min_best_span_pos</code> (integer), the position of the first maximum LCS
|
|
occurrences span. For example, assume that our query was 'hello world
|
|
program' and 'hello world' subphrase was matched twice in the field,
|
|
in positions 13 and 21. Assume that 'hello' and 'world' additionally
|
|
occurred elsewhere in the field, but never next to each other and thus
|
|
never as a subphrase match. In that case, <code>min_best_span_pos</code>
|
|
will be 13. Note how for the single keyword queries
|
|
<code>min_best_span_pos</code> will always equal <code>min_hit_pos</code>.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<code>exact_hit</code> (boolean), whether a query was an exact match
|
|
of the entire current field. Used in the SPH04 ranker.
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
A <b>field aggregation function</b> is a single argument function
|
|
that takes an expression with field-level factors, iterates it over
|
|
all the matched fields, and computes the final results.
|
|
Currently implemented field aggregation functions are:
|
|
<itemizedlist>
|
|
<listitem><para>
|
|
<code>sum</code>, sums the argument expression over all matched
|
|
fields. For instance, <code>sum(1)</code> should return a number
|
|
of matched fields.
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<bridgehead>Expressions for the built-in rankers</bridgehead>
|
|
<para>
|
|
Most of the other rankers can actually be emulated with the expression
|
|
based ranker. You just need to pass a proper expression. Such emulation is,
|
|
of course, going to be slower than using the built-in, compiled ranker but
|
|
still might be of interest if you want to fine-tune your ranking formula
|
|
starting with one of the existing ones. Also, the formulas define the
|
|
nitty gritty ranker details in a nicely readable fashion.
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem><para>
|
|
SPH_RANK_PROXIMITY_BM25 = sum(lcs*user_weight)*1000+bm25
|
|
</para></listitem>
|
|
<listitem><para>
|
|
SPH_RANK_BM25 = bm25
|
|
</para></listitem>
|
|
<listitem><para>
|
|
SPH_RANK_NONE = 1
|
|
</para></listitem>
|
|
<listitem><para>
|
|
SPH_RANK_WORDCOUNT = sum(hit_count*user_weight)
|
|
</para></listitem>
|
|
<listitem><para>
|
|
SPH_RANK_PROXIMITY = sum(lcs*user_weight)
|
|
</para></listitem>
|
|
<listitem><para>
|
|
SPH_RANK_MATCHANY = sum((word_count+(lcs-1)*max_lcs)*user_weight)
|
|
</para></listitem>
|
|
<listitem><para>
|
|
SPH_RANK_FIELDMASK = field_mask
|
|
</para></listitem>
|
|
<listitem><para>
|
|
SPH_RANK_SPH04 = sum((4*lcs+2*(min_hit_pos==1)+exact_hit)*user_weight)*1000+bm25
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="expressions">
|
|
<title>Expressions, functions, and operators</title>
|
|
<para>
|
|
Sphinx lets you use arbitrary arithmetic expressions both via SphinxQL
|
|
and SphinxAPI, involving attribute values, internal attributes (document ID
|
|
and relevance weight), arithmetic operations, a number of built-in functions,
|
|
and user-defined functions.
|
|
This section documents the supported operators and functions.
|
|
Here's the complete reference list for quick access.
|
|
<itemizedlist>
|
|
<listitem><para><link linkend="expr-ari-ops">Arithmetic operators: +, -, *, /, %, DIV, MOD</link></para></listitem>
|
|
<listitem><para><link linkend="expr-comp-ops">Comparison operators: <, > <=, >=, =, <></link></para></listitem>
|
|
<listitem><para><link linkend="expr-bool-ops">Boolean operators: AND, OR, NOT</link></para></listitem>
|
|
<listitem><para><link linkend="expr-bitwise-ops">Bitwise operators: &, |</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-abs">ABS()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-bigint">BIGINT()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-ceil">CEIL()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-cos">COS()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-crc32">CRC32()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-day">DAY()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-exp">EXP()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-floor">FLOOR()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-geodist">GEODIST()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-idiv">IDIV()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-if">IF()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-in">IN()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-interval">INTERVAL()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-ln">LN()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-log10">LOG10()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-log2">LOG2()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-max">MAX()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-min">MIN()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-month">MONTH()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-now">NOW()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-pow">POW()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-sin">SIN()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-sint">SINT()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-sqrt">SQRT()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-year">YEAR()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-yearmonth">YEARMONTH()</link></para></listitem>
|
|
<listitem><para><link linkend="expr-func-yearmonthday">YEARMONTHDAY()</link></para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
|
|
<sect2 id="operators">
|
|
<title>Operators</title>
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term id="expr-ari-ops">Arithmetic operators: +, -, *, /, %, DIV, MOD</term>
|
|
<listitem><para>
|
|
The standard arithmetic operators. Arithmetic calculations involving those
|
|
can be performed in three different modes: (a) using single-precision,
|
|
32-bit IEEE 754 floating point values (the default), (b) using signed 32-bit integers,
|
|
(c) using 64-bit signed integers. The expression parser will automatically switch
|
|
to integer mode if there are no operations the result in a floating point value.
|
|
Otherwise, it will use the default floating point mode. For instance, <code>a+b</code>
|
|
will be computed using 32-bit integers if both arguments are 32-bit integers;
|
|
or using 64-bit integers if both arguments are integers but one of them is
|
|
64-bit; or in floats otherwise. However, <code>a/b</code> or <code>sqrt(a)</code>
|
|
will always be computed in floats, because these operations return a result
|
|
of non-integer type. To avoid the first, you can either use <code>IDIV(a,b)</code>
|
|
or <code>a DIV b</code> form. Also, <code>a*b</code>
|
|
will not be automatically promoted to 64-bit when the arguments are 32-bit.
|
|
To enforce 64-bit results, you can use BIGINT(). (But note that if there are
|
|
non-integer operations, BIGINT() will simply be ignored.)
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-comp-ops">Comparison operators: <, > <=, >=, =, <></term>
|
|
<listitem><para>
|
|
Comparison operators (eg. = or <=) return 1.0 when the condition is true and 0.0 otherwise.
|
|
For instance, <code>(a=b)+3</code> will evaluate to 4 when attribute 'a' is equal to attribute 'b', and to 3 when 'a' is not.
|
|
Unlike MySQL, the equality comparisons (ie. = and <> operators) introduce a small equality threshold (1e-6 by default).
|
|
If the difference between compared values is within the threshold, they will be considered equal.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-bool-ops">Boolean operators: AND, OR, NOT</term>
|
|
<listitem><para>
|
|
Boolean operators (AND, OR, NOT) were introduced in 0.9.9-rc2 and behave as usual.
|
|
They are left-associative and have the least priority compared to other operators.
|
|
NOT has more priority than AND and OR but nevertheless less than any other operator.
|
|
AND and OR have the same priority so brackets use is recommended to avoid confusion
|
|
in complex expressions.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-bitwise-ops">Bitwise operators: &, |</term>
|
|
<listitem><para>
|
|
These operators perform bitwise AND and OR respectively. The operands
|
|
must be of an integer types. Introduced in version 1.10-beta.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="numeric-functions">
|
|
<title>Numeric functions</title>
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-abs">ABS()</term>
|
|
<listitem><para>Returns the absolute value of the argument.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-ceil">CEIL()</term>
|
|
<listitem><para>Returns the smallest integer value greater or equal to the argument.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-cos">COS()</term>
|
|
<listitem><para>Returns the cosine of the argument.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-exp">EXP()</term>
|
|
<listitem><para>Returns the exponent of the argument (e=2.718... to the power of the argument).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-floor">FLOOR()</term>
|
|
<listitem><para>Returns the largest integer value lesser or equal to the argument.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-idiv">IDIV()</term>
|
|
<listitem><para>
|
|
Returns the result of an integer division of the first
|
|
argument by the second argument. Both arguments must be
|
|
of an integer type.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-ln">LN()</term>
|
|
<listitem><para>Returns the natural logarithm of the argument (with the base of e=2.718...).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-log10">LOG10()</term>
|
|
<listitem><para>Returns the common logarithm of the argument (with the base of 10).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-log2">LOG2()</term>
|
|
<listitem><para>Returns the binary logarithm of the argument (with the base of 2).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-max">MAX()</term>
|
|
<listitem><para>Returns the bigger of two arguments.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-min">MIN()</term>
|
|
<listitem><para>Returns the smaller of two arguments.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-pow">POW()</term>
|
|
<listitem><para>Returns the first argument raised to the power of the second argument.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-sin">SIN()</term>
|
|
<listitem><para>Returns the sine of the argument.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-sqrt">SQRT()</term>
|
|
<listitem><para>Returns the square root of the argument.</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="date-time-functions">
|
|
<title>Date and time functions</title>
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-day">DAY()</term>
|
|
<listitem><para>Returns the integer day of month (in 1..31 range) from a timestamp argument, according to the current timezone. Introduced in version 2.0.1-beta.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-month">MONTH()</term>
|
|
<listitem><para>Returns the integer month (in 1..12 range) from a timestamp argument, according to the current timezone. Introduced in version 2.0.1-beta.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-now">NOW()</term>
|
|
<listitem><para>Returns the current timestamp as an INTEGER. Introduced in version 0.9.9-rc1.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-year">YEAR()</term>
|
|
<listitem><para>Returns the integer year (in 1969..2038 range) from a timestamp argument, according to the current timezone. Introduced in version 2.0.1-beta.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-yearmonth">YEARMONTH()</term>
|
|
<listitem><para>Returns the integer year and month code (in 196912..203801 range) from a timestamp argument, according to the current timezone. Introduced in version 2.0.1-beta.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-yearmonthday">YEARMONTHDAY()</term>
|
|
<listitem><para>Returns the integer year, month, and date code (in 19691231..20380119 range) from a timestamp argument, according to the current timezone. Introduced in version 2.0.1-beta.</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="type-conversion-functions">
|
|
<title>Type conversion functions</title>
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-bigint">BIGINT()</term>
|
|
<listitem><para>
|
|
Forcibly promotes the integer argument to 64-bit type,
|
|
and does nothing on floating point argument. It's intended to help enforce evaluation
|
|
of certain expressions (such as <code>a*b</code>) in 64-bit mode even though all the arguments
|
|
are 32-bit.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-sint">SINT()</term>
|
|
<listitem><para>
|
|
Forcibly reinterprets its
|
|
32-bit unsigned integer argument as signed, and also expands it to 64-bit type
|
|
(because 32-bit type is unsigned). It's easily illustrated by the following
|
|
example: 1-2 normally evaluates to 4294967295, but SINT(1-2) evaluates to -1.
|
|
Introduced in version 1.10-beta.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="comparison-functions">
|
|
<title>Comparison functions</title>
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-if">IF()</term>
|
|
<listitem><para>
|
|
<code>IF()</code> behavior is slightly different that that of its MySQL counterpart.
|
|
It takes 3 arguments, check whether the 1st argument is equal to 0.0, returns the 2nd argument if it is not zero, or the 3rd one when it is.
|
|
Note that unlike comparison operators, <code>IF()</code> does <b>not</b> use a threshold!
|
|
Therefore, it's safe to use comparison results as its 1st argument, but arithmetic operators might produce unexpected results.
|
|
For instance, the following two calls will produce <emphasis>different</emphasis> results even though they are logically equivalent:
|
|
<programlisting>
|
|
IF ( sqrt(3)*sqrt(3)-3<>0, a, b )
|
|
IF ( sqrt(3)*sqrt(3)-3, a, b )
|
|
</programlisting>
|
|
In the first case, the comparison operator <> will return 0.0 (false)
|
|
because of a threshold, and <code>IF()</code> will always return 'b' as a result.
|
|
In the second one, the same <code>sqrt(3)*sqrt(3)-3</code> expression will be compared
|
|
with zero <emphasis>without</emphasis> threshold by the <code>IF()</code> function itself.
|
|
But its value will be slightly different from zero because of limited floating point
|
|
calculations precision. Because of that, the comparison with 0.0 done by <code>IF()</code>
|
|
will not pass, and the second variant will return 'a' as a result.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-in">IN()</term>
|
|
<listitem><para>
|
|
IN(expr,val1,val2,...), introduced in version 0.9.9-rc1, takes 2 or more arguments, and returns 1 if 1st argument
|
|
(expr) is equal to any of the other arguments (val1..valN), or 0 otherwise.
|
|
Currently, all the checked values (but not the expression itself!) are required
|
|
to be constant. (Its technically possible to implement arbitrary expressions too,
|
|
and that might be implemented in the future.) Constants are pre-sorted and then
|
|
binary search is used, so IN() even against a big arbitrary list of constants
|
|
will be very quick. Starting with 0.9.9-rc2, first argument can also be
|
|
a MVA attribute. In that case, IN() will return 1 if any of the MVA values
|
|
is equal to any of the other arguments. Starting with 2.0.1-beta, IN() also
|
|
supports <code>IN(expr,@uservar)</code> syntax to check whether the value
|
|
belongs to the list in the given global user variable.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-interval">INTERVAL()</term>
|
|
<listitem><para>
|
|
INTERVAL(expr,point1,point2,point3,...), introduced in version 0.9.9-rc1, takes 2 or more arguments, and returns
|
|
the index of the argument that is less than the first argument: it returns
|
|
0 if expr<point1, 1 if point1<=expr<point2, and so on.
|
|
It is required that point1<point2<...<pointN for this function
|
|
to work correctly.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="misc-functions">
|
|
<title>Miscellaneous functions</title>
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-crc32">CRC32()</term>
|
|
<listitem><para>
|
|
Returns the CRC32 value of a string argument. Introduced in version 2.0.1-beta.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term id="expr-func-geodist">GEODIST()</term>
|
|
<listitem><para>
|
|
GEODIST(lat1,long1,lat2,long2) function, introduced in version 0.9.9-rc2,
|
|
computes geosphere distance between two given points specified by their
|
|
coordinates. Note that both latitudes and longitudes must be in radians
|
|
and the result will be in meters. You can use arbitrary expression as any
|
|
of the four coordinates. An optimized path will be selected when one pair
|
|
of the arguments refers directly to a pair attributes and the other one
|
|
is constant.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sorting-modes"><title>Sorting modes</title>
|
|
<para>
|
|
There are the following result sorting modes available:
|
|
<itemizedlist>
|
|
<listitem><para>SPH_SORT_RELEVANCE mode, that sorts by relevance in descending order (best matches first);</para></listitem>
|
|
<listitem><para>SPH_SORT_ATTR_DESC mode, that sorts by an attribute in descending order (bigger attribute values first);</para></listitem>
|
|
<listitem><para>SPH_SORT_ATTR_ASC mode, that sorts by an attribute in ascending order (smaller attribute values first);</para></listitem>
|
|
<listitem><para>SPH_SORT_TIME_SEGMENTS mode, that sorts by time segments (last hour/day/week/month) in descending order, and then by relevance in descending order;</para></listitem>
|
|
<listitem><para>SPH_SORT_EXTENDED mode, that sorts by SQL-like combination of columns in ASC/DESC order;</para></listitem>
|
|
<listitem><para>SPH_SORT_EXPR mode, that sorts by an arithmetic expression.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
SPH_SORT_RELEVANCE ignores any additional parameters and always sorts matches
|
|
by relevance rank. All other modes require an additional sorting clause, with the
|
|
syntax depending on specific mode. SPH_SORT_ATTR_ASC, SPH_SORT_ATTR_DESC and
|
|
SPH_SORT_TIME_SEGMENTS modes require simply an attribute name.
|
|
|
|
SPH_SORT_RELEVANCE is equivalent to sorting by "@weight DESC, @id ASC" in extended sorting mode,
|
|
SPH_SORT_ATTR_ASC is equivalent to "attribute ASC, @weight DESC, @id ASC",
|
|
and SPH_SORT_ATTR_DESC to "attribute DESC, @weight DESC, @id ASC" respectively.
|
|
</para>
|
|
|
|
<bridgehead>SPH_SORT_TIME_SEGMENTS mode</bridgehead>
|
|
<para>
|
|
In SPH_SORT_TIME_SEGMENTS mode, attribute values are split into so-called
|
|
time segments, and then sorted by time segment first, and by relevance second.
|
|
</para>
|
|
<para>
|
|
The segments are calculated according to the <emphasis>current timestamp</emphasis>
|
|
at the time when the search is performed, so the results would change over time.
|
|
The segments are as follows:
|
|
<itemizedlist>
|
|
<listitem><para>last hour,</para></listitem>
|
|
<listitem><para>last day,</para></listitem>
|
|
<listitem><para>last week,</para></listitem>
|
|
<listitem><para>last month,</para></listitem>
|
|
<listitem><para>last 3 months,</para></listitem>
|
|
<listitem><para>everything else.</para></listitem>
|
|
</itemizedlist>
|
|
These segments are hardcoded, but it is trivial to change them if necessary.
|
|
</para>
|
|
<para>
|
|
This mode was added to support searching through blogs, news headlines, etc.
|
|
When using time segments, recent records would be ranked higher because of segment,
|
|
but withing the same segment, more relevant records would be ranked higher -
|
|
unlike sorting by just the timestamp attribute, which would not take relevance
|
|
into account at all.
|
|
</para>
|
|
|
|
<bridgehead id="sort-extended">SPH_SORT_EXTENDED mode</bridgehead>
|
|
<para>
|
|
In SPH_SORT_EXTENDED mode, you can specify an SQL-like sort expression
|
|
with up to 5 attributes (including internal attributes), eg:
|
|
<programlisting>
|
|
@relevance DESC, price ASC, @id DESC
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
Both internal attributes (that are computed by the engine on the fly)
|
|
and user attributes that were configured for this index are allowed.
|
|
Internal attribute names must start with magic @-symbol; user attribute
|
|
names can be used as is. In the example above, <option>@relevance</option>
|
|
and <option>@id</option> are internal attributes and <option>price</option> is user-specified.
|
|
</para>
|
|
<para>
|
|
Known internal attributes are:
|
|
<itemizedlist>
|
|
<listitem><para>@id (match ID)</para></listitem>
|
|
<listitem><para>@weight (match weight)</para></listitem>
|
|
<listitem><para>@rank (match weight)</para></listitem>
|
|
<listitem><para>@relevance (match weight)</para></listitem>
|
|
<listitem><para>@random (return results in random order)</para></listitem>
|
|
</itemizedlist>
|
|
<option>@rank</option> and <option>@relevance</option> are just additional
|
|
aliases to <option>@weight</option>.
|
|
</para>
|
|
|
|
<bridgehead id="sort-expr">SPH_SORT_EXPR mode</bridgehead>
|
|
<para>
|
|
Expression sorting mode lets you sort the matches by an arbitrary arithmetic
|
|
expression, involving attribute values, internal attributes (@id and @weight),
|
|
arithmetic operations, and a number of built-in functions. Here's an example:
|
|
<programlisting>
|
|
$cl->SetSortMode ( SPH_SORT_EXPR,
|
|
"@weight + ( user_karma + ln(pageviews) )*0.1" );
|
|
</programlisting>
|
|
The operators and functions supported in the expressions are discussed
|
|
in a separate section, <xref linkend="expressions"/>.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="clustering"><title>Grouping (clustering) search results </title>
|
|
<para>
|
|
Sometimes it could be useful to group (or in other terms, cluster)
|
|
search results and/or count per-group match counts - for instance,
|
|
to draw a nice graph of how much maching blog posts were there per
|
|
each month; or to group Web search results by site; or to group
|
|
matching forum posts by author; etc.
|
|
</para>
|
|
<para>
|
|
In theory, this could be performed by doing only the full-text search
|
|
in Sphinx and then using found IDs to group on SQL server side. However,
|
|
in practice doing this with a big result set (10K-10M matches) would
|
|
typically kill performance.
|
|
</para>
|
|
<para>
|
|
To avoid that, Sphinx offers so-called grouping mode. It is enabled
|
|
with SetGroupBy() API call. When grouping, all matches are assigned to
|
|
different groups based on group-by value. This value is computed from
|
|
specified attribute using one of the following built-in functions:
|
|
<itemizedlist>
|
|
<listitem><para>SPH_GROUPBY_DAY, extracts year, month and day in YYYYMMDD format from timestamp;</para></listitem>
|
|
<listitem><para>SPH_GROUPBY_WEEK, extracts year and first day of the week number (counting from year start) in YYYYNNN format from timestamp;</para></listitem>
|
|
<listitem><para>SPH_GROUPBY_MONTH, extracts month in YYYYMM format from timestamp;</para></listitem>
|
|
<listitem><para>SPH_GROUPBY_YEAR, extracts year in YYYY format from timestamp;</para></listitem>
|
|
<listitem><para>SPH_GROUPBY_ATTR, uses attribute value itself for grouping.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
The final search result set then contains one best match per group.
|
|
Grouping function value and per-group match count are returned along
|
|
as "virtual" attributes named
|
|
<emphasis role="bold">@group</emphasis> and
|
|
<emphasis role="bold">@count</emphasis> respectively.
|
|
</para>
|
|
<para>
|
|
The result set is sorted by group-by sorting clause, with the syntax similar
|
|
to <link linkend="sort-extended"><option>SPH_SORT_EXTENDED</option> sorting clause</link>
|
|
syntax. In addition to <option>@id</option> and <option>@weight</option>,
|
|
group-by sorting clause may also include:
|
|
<itemizedlist>
|
|
<listitem><para>@group (groupby function value),</para></listitem>
|
|
<listitem><para>@count (amount of matches in group).</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
The default mode is to sort by groupby value in descending order,
|
|
ie. by <option>"@group desc"</option>.
|
|
</para>
|
|
<para>
|
|
On completion, <option>total_found</option> result parameter would
|
|
contain total amount of matching groups over he whole index.
|
|
</para>
|
|
<para>
|
|
<emphasis role="bold">WARNING:</emphasis> grouping is done in fixed memory
|
|
and thus its results are only approximate; so there might be more groups reported
|
|
in <option>total_found</option> than actually present. <option>@count</option> might also
|
|
be underestimated. To reduce inaccuracy, one should raise <option>max_matches</option>.
|
|
If <option>max_matches</option> allows to store all found groups, results will be 100% correct.
|
|
</para>
|
|
<para>
|
|
For example, if sorting by relevance and grouping by <code>"published"</code>
|
|
attribute with <code>SPH_GROUPBY_DAY</code> function, then the result set will
|
|
contain
|
|
<itemizedlist>
|
|
<listitem><para>one most relevant match per each day when there were any
|
|
matches published,</para></listitem>
|
|
<listitem><para>with day number and per-day match count attached,</para></listitem>
|
|
<listitem><para>sorted by day number in descending order (ie. recent days first).</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
Starting with version 0.9.9-rc2, aggregate functions (AVG(), MIN(),
|
|
MAX(), SUM()) are supported through <link linkend="api-func-setselect">SetSelect()</link> API call
|
|
when using GROUP BY.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="distributed"><title>Distributed searching</title>
|
|
<para>
|
|
To scale well, Sphinx has distributed searching capabilities.
|
|
Distributed searching is useful to improve query latency (ie. search
|
|
time) and throughput (ie. max queries/sec) in multi-server, multi-CPU
|
|
or multi-core environments. This is essential for applications which
|
|
need to search through huge amounts data (ie. billions of records
|
|
and terabytes of text).
|
|
</para>
|
|
<para>
|
|
The key idea is to horizontally partition (HP) searched data
|
|
accross search nodes and then process it in parallel.
|
|
</para>
|
|
<para>
|
|
Partitioning is done manually. You should
|
|
<itemizedlist>
|
|
<listitem><para>setup several instances
|
|
of Sphinx programs (<filename>indexer</filename> and <filename>searchd</filename>)
|
|
on different servers;</para></listitem>
|
|
<listitem><para>make the instances index (and search) different parts of data;</para></listitem>
|
|
<listitem><para>configure a special distributed index on some of the <filename>searchd</filename>
|
|
instances;</para></listitem>
|
|
<listitem><para>and query this index.</para></listitem>
|
|
</itemizedlist>
|
|
This index only contains references to other
|
|
local and remote indexes - so it could not be directly reindexed,
|
|
and you should reindex those indexes which it references instead.
|
|
</para>
|
|
<para>
|
|
When <filename>searchd</filename> receives a query against distributed index,
|
|
it does the following:
|
|
<orderedlist>
|
|
<listitem><para>connects to configured remote agents;</para></listitem>
|
|
<listitem><para>issues the query;</para></listitem>
|
|
<listitem><para>sequentially searches configured local indexes (while the remote agents are searching);</para></listitem>
|
|
<listitem><para>retrieves remote agents' search results;</para></listitem>
|
|
<listitem><para>merges all the results together, removing the duplicates;</para></listitem>
|
|
<listitem><para>sends the merged resuls to client.</para></listitem>
|
|
</orderedlist>
|
|
</para>
|
|
<para>
|
|
From the application's point of view, there are no differences
|
|
between searching through a regular index, or a distributed index at all.
|
|
That is, distributed indexes are fully transparent to the application,
|
|
and actually there's no way to tell whether the index you queried
|
|
was distributed or local. (Even though as of 0.9.9 Sphinx does not
|
|
allow to combine searching through distributed indexes with anything else,
|
|
this constraint will be lifted in the future.)
|
|
</para>
|
|
<para>
|
|
Any <filename>searchd</filename> instance could serve both as a master
|
|
(which aggregates the results) and a slave (which only does local searching)
|
|
at the same time. This has a number of uses:
|
|
<orderedlist>
|
|
<listitem><para>every machine in a cluster could serve as a master which
|
|
searches the whole cluster, and search requests could be balanced between
|
|
masters to achieve a kind of HA (high availability) in case any of the nodes fails;
|
|
</para></listitem>
|
|
<listitem><para>
|
|
if running within a single multi-CPU or multi-core machine, there
|
|
would be only 1 searchd instance quering itself as an agent and thus
|
|
utilizing all CPUs/core.
|
|
</para></listitem>
|
|
</orderedlist>
|
|
</para>
|
|
<para>
|
|
It is scheduled to implement better HA support which would allow
|
|
to specify which agents mirror each other, do health checks, keep track
|
|
of alive agents, load-balance requests, etc.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="query-log-format"><title><filename>searchd</filename> query log formats</title>
|
|
<para>
|
|
In version 2.0.1-beta and above two query log formats are supported.
|
|
Previous versions only supported a custom plain text format. That format
|
|
is still the default one. However, while it might be more convenient for
|
|
manual monitoring and review, but hard to replay for benchmarks, it only
|
|
logs <emphasis>search</emphasis> queries but not the other types
|
|
of requests, does not always contain the complete search query
|
|
data, etc. The default text format is also harder (and sometimes
|
|
impossible) to replay for benchmarking purposes. The new <code>sphinxql</code>
|
|
format alleviates that. It aims to be complete and automatable,
|
|
even though at the cost of brevity and readability.
|
|
</para>
|
|
|
|
<sect2 id="plain-log-format"><title>Plain log format</title>
|
|
<para>
|
|
By default, <filename>searchd</filename> logs all succesfully executed search queries
|
|
into a query log file. Here's an example:
|
|
<programlisting>
|
|
[Fri Jun 29 21:17:58 2007] 0.004 sec [all/0/rel 35254 (0,20)] [lj] test
|
|
[Fri Jun 29 21:20:34 2007] 0.024 sec [all/0/rel 19886 (0,20) @channel_id] [lj] test
|
|
</programlisting>
|
|
This log format is as follows:
|
|
<programlisting>
|
|
[query-date] query-time [match-mode/filters-count/sort-mode
|
|
total-matches (offset,limit) @groupby-attr] [index-name] query
|
|
</programlisting>
|
|
Match mode can take one of the following values:
|
|
<itemizedlist>
|
|
<listitem><para>"all" for SPH_MATCH_ALL mode;</para></listitem>
|
|
<listitem><para>"any" for SPH_MATCH_ANY mode;</para></listitem>
|
|
<listitem><para>"phr" for SPH_MATCH_PHRASE mode;</para></listitem>
|
|
<listitem><para>"bool" for SPH_MATCH_BOOLEAN mode;</para></listitem>
|
|
<listitem><para>"ext" for SPH_MATCH_EXTENDED mode;</para></listitem>
|
|
<listitem><para>"scan" if the full scan mode was used, either by being specified with SPH_MATCH_FULLSCAN, or if the query was empty (as documented under <link linkend="matching-modes">Matching Modes</link>)</para></listitem>
|
|
</itemizedlist>
|
|
Sort mode can take one of the following values:
|
|
<itemizedlist>
|
|
<listitem><para>"rel" for SPH_SORT_RELEVANCE mode;</para></listitem>
|
|
<listitem><para>"attr-" for SPH_SORT_ATTR_DESC mode;</para></listitem>
|
|
<listitem><para>"attr+" for SPH_SORT_ATTR_ASC mode;</para></listitem>
|
|
<listitem><para>"tsegs" for SPH_SORT_TIME_SEGMENTS mode;</para></listitem>
|
|
<listitem><para>"ext" for SPH_SORT_EXTENDED mode.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>Additionally, if <filename>searchd</filename> was started with <option>--iostats</option>, there will be a block of data after where the index(es) searched are listed.</para>
|
|
<para>A query log entry might take the form of:</para>
|
|
<programlisting>
|
|
[Fri Jun 29 21:17:58 2007] 0.004 sec [all/0/rel 35254 (0,20)] [lj]
|
|
[ios=6 kb=111.1 ms=0.5] test
|
|
</programlisting>
|
|
<para>
|
|
This additional block is information regarding I/O operations in performing the search:
|
|
the number of file I/O operations carried out, the amount of data in kilobytes read from
|
|
the index files and time spent on I/O operations (although there is a background processing
|
|
component, the bulk of this time is the I/O operation time).
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="sphinxql-log-format"><title>SphinxQL log format</title>
|
|
<para>
|
|
This is a new log format introduced in 2.0.1-beta, with the goals
|
|
begin logging everything and then some, and in a format easy to automate
|
|
(for insance, automatically replay). New format can either be enabled
|
|
via the <link linkend="conf-query-log-format">query_log_format</link>
|
|
directive in the configuration file, or switched back and forth
|
|
on the fly with the
|
|
<link linkend="sphinxql-set"><code>SET GLOBAL query_log_format=...</code></link>
|
|
statement via SphinxQL. In the new format, the example from the previous
|
|
section would look as follows. (Wrapped below for readability, but with
|
|
just one query per line in the actual log.)
|
|
<programlisting>
|
|
/* Fri Jun 29 21:17:58.609 2007 2011 conn 2 wall 0.004 found 35254 */
|
|
SELECT * FROM lj WHERE MATCH('test') OPTION ranker=proximity;
|
|
|
|
/* Fri Jun 29 21:20:34 2007.555 conn 3 wall 0.024 found 19886 */
|
|
SELECT * FROM lj WHERE MATCH('test') GROUP BY channel_id
|
|
OPTION ranker=proximity;
|
|
</programlisting>
|
|
Note that <b>all</b> requests would be logged in this format,
|
|
including those sent via SphinxAPI and SphinxSE, not just those
|
|
sent via SphinxQL. Also note, that this kind of logging works only with plain log
|
|
files and will not work if you use 'syslog' for logging.
|
|
</para>
|
|
<para>
|
|
The features of SphinxQL log format compared to the default text
|
|
one are as follows.
|
|
<itemizedlist>
|
|
<listitem><para>All request types should be logged. (This is still work in progress.)</para></listitem>
|
|
<listitem><para>Full statement data will be logged where possible.</para></listitem>
|
|
<listitem><para>Errors and warnings are logged.</para></listitem>
|
|
<listitem><para>The log should be automatically replayable via SphinxQL.</para></listitem>
|
|
<listitem><para>Additional performance counters (currently, per-agent distributed query times) are logged.</para></listitem>
|
|
</itemizedlist>
|
|
<!-- FIXME! more examples with ios, kbs, agents etc; comment stuff reference?-->
|
|
</para>
|
|
<para>
|
|
Every request (including both SphinxAPI and SphinxQL) request
|
|
must result in exactly one log line. All request types, including
|
|
INSERT, CALL SNIPPETS, etc will eventually get logged, though as of
|
|
time of this writing, that is a work in progress). Every log line
|
|
must be a valid SphinxQL statement that reconstructs the full request,
|
|
except if the logged request is too big and needs shortening
|
|
for performance reasons. Additional messages, counters, etc can be
|
|
logged in the comments section after the request.
|
|
</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql"><title>MySQL protocol support and SphinxQL</title>
|
|
<para>
|
|
Starting with version 0.9.9-rc2, Sphinx searchd daemon supports MySQL binary
|
|
network protocol and can be accessed with regular MySQL API. For instance,
|
|
'mysql' CLI client program works well. Here's an example of querying
|
|
Sphinx using MySQL client:
|
|
<programlisting>
|
|
$ mysql -P 9306
|
|
Welcome to the MySQL monitor. Commands end with ; or \g.
|
|
Your MySQL connection id is 1
|
|
Server version: 0.9.9-dev (r1734)
|
|
|
|
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
|
|
|
|
mysql> SELECT * FROM test1 WHERE MATCH('test')
|
|
-> ORDER BY group_id ASC OPTION ranker=bm25;
|
|
+------+--------+----------+------------+
|
|
| id | weight | group_id | date_added |
|
|
+------+--------+----------+------------+
|
|
| 4 | 1442 | 2 | 1231721236 |
|
|
| 2 | 2421 | 123 | 1231721236 |
|
|
| 1 | 2421 | 456 | 1231721236 |
|
|
+------+--------+----------+------------+
|
|
3 rows in set (0.00 sec)
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
Note that mysqld was not even running on the test machine. Everything was
|
|
handled by searchd itself.
|
|
</para>
|
|
<para>
|
|
The new access method is supported <emphasis>in addition</emphasis>
|
|
to native APIs which all still work perfectly well. In fact, both
|
|
access methods can be used at the same time. Also, native API is still
|
|
the default access method. MySQL protocol support needs to be additionally
|
|
configured. This is a matter of 1-line config change, adding a new
|
|
<link linkend="conf-listen">listener</link> with mysql41 specified
|
|
as a protocol:
|
|
<programlisting>
|
|
listen = localhost:9306:mysql41
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
Just supporting the protocol and not the SQL syntax would be useless
|
|
so Sphinx now also supports a subset of SQL that we dubbed SphinxQL.
|
|
It supports the standard querying all the index types with SELECT,
|
|
modifying RT indexes with INSERT, REPLACE, and DELETE, and much more.
|
|
Full SphinxQL reference is available in <xref linkend="sphinxql-reference"/>.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="multi-queries"><title>Multi-queries</title>
|
|
<para>
|
|
Multi-queries, or query batches, let you send multiple queries to Sphinx
|
|
in one go (more formally, one network request).
|
|
</para>
|
|
<para>
|
|
Two API methods that implement multi-query mechanism are
|
|
<link linkend="api-func-addquery">AddQuery()</link> and
|
|
<link linkend="api-func-runqueries">RunQueries()</link>.
|
|
You can also run multiple queries with SphinxQL, see
|
|
<xref linkend="sphinxql-multi-queries"/>.
|
|
(In fact, regular <link linkend="api-func-addquery">Query()</link>
|
|
call is internally implemented as a single AddQuery() call immediately
|
|
followed by RunQueries() call.) AddQuery() captures the current state
|
|
of all the query settings set by previous API calls, and memorizes
|
|
the query. RunQueries() actually sends all the memorized queries,
|
|
and returns multiple result sets. There are no restrictions on
|
|
the queries at all, except just a sanity check on a number of queries
|
|
in a single batch (see <xref linkend="conf-max-batch-queries"/>).
|
|
</para>
|
|
<para>
|
|
Why use multi-queries? Generally, it all boils down to performance.
|
|
First, by sending requests to <filename>searchd</filename> in a batch
|
|
instead of one by one, you always save a bit by doing less network
|
|
roundtrips. Second, and somewhat more important, sending queries
|
|
in a batch enables <filename>searchd</filename> to perform certain
|
|
internal optimizations. As new types of optimizations are being
|
|
added over time, it generally makes sense to pack all the queries
|
|
into batches where possible, so that simply upgrading Sphinx
|
|
to a new version would automatically enable new optimizations.
|
|
In the case when there aren't any possible batch optimizations
|
|
to apply, queries will be processed one by one internally.
|
|
</para>
|
|
<para>
|
|
Why (or rather when) not use multi-queries? Multi-queries requires
|
|
all the queries in a batch to be independent, and sometimes they aren't.
|
|
That is, sometimes query B is based on query A results, and so can only be
|
|
set up after executing query A. For instance, you might want to display
|
|
results from a secondary index if and only if there were no results
|
|
found in a primary index. Or maybe just specify offset into 2nd result set
|
|
based on the amount of matches in the 1st result set. In that case,
|
|
you will have to use separate queries (or separate batches).
|
|
</para>
|
|
<para>
|
|
As of 0.9.10, there are two major optimizations to be aware of:
|
|
common query optimization (available since 0.9.8); and common
|
|
subtree optimization (available since 0.9.10).
|
|
</para>
|
|
<para>
|
|
<b>Common query optimization</b> means that <filename>searchd</filename>
|
|
will identify all those queries in a batch where only the sorting
|
|
and group-by settings differ, and <emphasis>only perform searching once</emphasis>.
|
|
For instance, if a batch consists of 3 queries, all of them are for
|
|
"ipod nano", but 1st query requests top-10 results sorted by price,
|
|
2nd query groups by vendor ID and requests top-5 vendors sorted by
|
|
rating, and 3rd query requests max price, full-text search for
|
|
"ipod nano" will only be performed once, and its results will be
|
|
reused to build 3 different result sets.
|
|
</para>
|
|
<para>
|
|
So-called <b>faceted searching</b> is a particularly important case
|
|
that benefits from this optimization. Indeed, faceted searching
|
|
can be implemented by running a number of queries, one to retrieve
|
|
search results themselves, and a few other ones with same full-text
|
|
query but different group-by settings to retrieve all the required
|
|
groups of results (top-3 authors, top-5 vendors, etc). And as long
|
|
as full-text query and filtering settings stay the same, common
|
|
query optimization will trigger, and greatly improve performance.
|
|
</para>
|
|
<para>
|
|
<b>Common subtree optimization</b> is even more interesting.
|
|
It lets <filename>searchd</filename> exploit similarities between
|
|
batched full-text queries. It identifies common full-text query parts
|
|
(subtress) in all queries, and caches them between queries. For instance,
|
|
look at the following query batch:
|
|
<programlisting>
|
|
barack obama president
|
|
barack obama john mccain
|
|
barack obama speech
|
|
</programlisting>
|
|
There's a common two-word part ("barack obama") that can be computed
|
|
only once, then cached and shared across the queries. And common subtree
|
|
optimization does just that. Per-query cache size is strictly controlled
|
|
by <link linkend="conf-subtree-docs-cache">subtree_docs_cache</link>
|
|
and <link linkend="conf-subtree-hits-cache">subtree_hits_cache</link>
|
|
directives (so that caching <emphasis>all</emphasis> sxiteen gazillions
|
|
of documents that match "i am" does not exhaust the RAM and instantly
|
|
kill your server).
|
|
</para>
|
|
<para>
|
|
Here's a code sample (in PHP) that fire the same query in 3 different
|
|
sorting modes:
|
|
<programlisting>
|
|
require ( "sphinxapi.php" );
|
|
$cl = new SphinxClient ();
|
|
$cl->SetMatchMode ( SPH_MATCH_EXTENDED );
|
|
|
|
$cl->SetSortMode ( SPH_SORT_RELEVANCE );
|
|
$cl->AddQuery ( "the", "lj" );
|
|
$cl->SetSortMode ( SPH_SORT_EXTENDED, "published desc" );
|
|
$cl->AddQuery ( "the", "lj" );
|
|
$cl->SetSortMode ( SPH_SORT_EXTENDED, "published asc" );
|
|
$cl->AddQuery ( "the", "lj" );
|
|
$res = $cl->RunQueries();
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
How to tell whether the queries in the batch were actually optimized?
|
|
If they were, respective query log will have a "multiplier" field that
|
|
specifies how many queries were processed together:
|
|
<programlisting>
|
|
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/rel 747541 (0,20)] [lj] the
|
|
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/ext 747541 (0,20)] [lj] the
|
|
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/ext 747541 (0,20)] [lj] the
|
|
</programlisting>
|
|
Note the "x3" field. It means that this query was optimized and
|
|
processed in a sub-batch of 3 queries. For reference, this is how
|
|
the regular log would look like if the queries were not batched:
|
|
<programlisting>
|
|
[Sun Jul 12 15:18:17.062 2009] 0.059 sec [ext/0/rel 747541 (0,20)] [lj] the
|
|
[Sun Jul 12 15:18:17.156 2009] 0.091 sec [ext/0/ext 747541 (0,20)] [lj] the
|
|
[Sun Jul 12 15:18:17.250 2009] 0.092 sec [ext/0/ext 747541 (0,20)] [lj] the
|
|
</programlisting>
|
|
Note how per-query time in multi-query case was improved by a factor
|
|
of 1.5x to 2.3x, depending on a particular sorting mode. In fact, for both
|
|
common query and common subtree optimizations, there were reports of 3x and
|
|
even more improvements, and that's from production instances, not just
|
|
synthetic tests.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="collations"><title>Collations</title>
|
|
<para>
|
|
Introduced to Sphinx in version 2.0.1-beta to supplement string sorting,
|
|
collations essentially affect the string attribute comparisons. They specify
|
|
both the character set encoding and the strategy that Sphinx uses to compare
|
|
strings when doing ORDER BY or GROUP BY with a string attribute involved.
|
|
</para>
|
|
<para>
|
|
String attributes are stored as is when indexing, and no character set
|
|
or language information is attached to them. That's okay as long as Sphinx
|
|
only needs to store and return the strings to the calling application verbatim.
|
|
But when you ask Sphinx to sort by a string value, that request immediately
|
|
becomes quite ambiguous.
|
|
</para>
|
|
<para>
|
|
First, single-byte (ASCII, or ISO-8859-1, or Windows-1251) strings
|
|
need to be processed differently that the UTF-8 ones that may encode
|
|
every character with a variable number of bytes. So we need to know
|
|
what is the character set type to interepret the raw bytes as meaningful
|
|
characters properly.
|
|
</para>
|
|
<para>
|
|
Second, we additionally need to know the language-specific
|
|
string sorting rules. For instance, when sorting according to US rules
|
|
in en_US locale, the accented character 'ï' (small letter i with diaeresis)
|
|
should be placed somewhere after 'z'. However, when sorting with French rules
|
|
and fr_FR locale in mind, it should be placed between 'i' and 'j'. And some
|
|
other set of rules might choose to ignore accents at all, allowing 'ï'
|
|
and 'i' to be mixed arbitrarily.
|
|
</para>
|
|
<para>
|
|
Third, but not least, we might need case-sensitive sorting in some
|
|
scenarios and case-insensitive sorting in some others.
|
|
</para>
|
|
<para>
|
|
Collations combine all of the above: the character set, the lanugage rules,
|
|
and the case sensitivity. Sphinx currently provides the following four
|
|
collations.
|
|
<orderedlist>
|
|
<listitem><para><option>libc_ci</option></para></listitem>
|
|
<listitem><para><option>libc_cs</option></para></listitem>
|
|
<listitem><para><option>utf8_general_ci</option></para></listitem>
|
|
<listitem><para><option>binary</option></para></listitem>
|
|
</orderedlist>
|
|
</para>
|
|
<para>
|
|
The first two collations rely on several standard C library (libc) calls
|
|
and can thus support any locale that is installed on your system. They provide
|
|
case-insensitive (_ci) and case-sensitive (_cs) comparisons respectively.
|
|
By default they will use C locale, effectively resorting to bytewise
|
|
comparisons. To change that, you need to specify a different available
|
|
locale using <link linkend="conf-collation-libc-locale">collation_libc_locale</link>
|
|
directive. The list of locales available on your system can usually be obtained
|
|
with the <filename>locale</filename> command:
|
|
<programlisting>
|
|
$ locale -a
|
|
C
|
|
en_AG
|
|
en_AU.utf8
|
|
en_BW.utf8
|
|
en_CA.utf8
|
|
en_DK.utf8
|
|
en_GB.utf8
|
|
en_HK.utf8
|
|
en_IE.utf8
|
|
en_IN
|
|
en_NG
|
|
en_NZ.utf8
|
|
en_PH.utf8
|
|
en_SG.utf8
|
|
en_US.utf8
|
|
en_ZA.utf8
|
|
en_ZW.utf8
|
|
es_ES
|
|
fr_FR
|
|
POSIX
|
|
ru_RU.utf8
|
|
ru_UA.utf8
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
The specific list of the system locales may vary. Consult your OS documentation
|
|
to install additional needed locales.
|
|
</para>
|
|
<para>
|
|
<option>utf8_general_ci</option> and <option>binary</option> locales are
|
|
built-in into Sphinx. The first one is a generic collation for UTF-8 data
|
|
(without any so-called language tailoring); it should behave similar to
|
|
<option>utf8_general_ci</option> collation in MySQL. The second one
|
|
is a simple bytewise comparison.
|
|
</para>
|
|
<para>
|
|
Collation can be overriden via SphinxQL on a per-session basis using
|
|
<code>SET collation_connection</code> statement. All subsequent SphinxQL
|
|
queries will use this collation. SphinxAPI and SphinxSE queries will use
|
|
the server default collation, as specified in
|
|
<link linkend="conf-collation-server">collation_server</link> configuration
|
|
directive. Sphinx currently defaults to <option>libc_ci</option> collation.
|
|
</para>
|
|
<para>
|
|
Collations should affect all string attribute comparisons, including
|
|
those within ORDER BY and GROUP BY, so differently ordered or grouped results
|
|
can be returned depending on the collation chosen.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="udf"><title>User-defined functions (UDF)</title>
|
|
<para>
|
|
Starting with 2.0.1-beta, Sphinx supports User-Defined Functions,
|
|
or UDF for short. They can be loaded and unloaded dynamically into
|
|
<filename>searchd</filename> without having to restart the daemon,
|
|
and used in expressions when searching. UDF features at a glance
|
|
are as follows.
|
|
<itemizedlist>
|
|
<listitem><para>Functions can take integer (both 32-bit and 64-bit), float, string, or MVA arguments.</para></listitem>
|
|
<listitem><para>Functions can return integer or float values.</para></listitem>
|
|
<listitem><para>Functions can check the argument number, types, and names and raise errors.</para></listitem>
|
|
<listitem><para>Only simple functions (that is, non-aggregate ones) are currently supported.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
User-defined functions need your OS to support dynamically loadable
|
|
libraries (aka shared objects). Most of the modern OSes are eligible,
|
|
including Linux, Windows, MacOS, Solaris, BSD and others. (The internal
|
|
testing has been done on Linux and Windows.) The UDF libraries must
|
|
reside in a directory specified by
|
|
<link linkend="conf-plugin-dir">plugin_dir</link> directive, and the
|
|
server must be configured to use <option>workers = threads</option> mode.
|
|
Relative paths to the library files are not allowed. Once the library
|
|
is succesfully built and copied to the trusted location, you can then
|
|
dynamically install and deinstall the functions using
|
|
<link linkend="sphinxql-create-function">CREATE FUNCTION</link> and
|
|
<link linkend="sphinxql-drop-function">DROP FUNCTION</link> statements
|
|
respectively. A single library can contain multiple functions. A library
|
|
gets loaded when you first install a function from it, and unloaded
|
|
when you deinstall all the functions from that library.
|
|
</para>
|
|
<para>
|
|
The library functions that will implement a UDF visible to SQL statements
|
|
need to follow C calling convention, and a simple naming convention. Sphinx
|
|
source distribution provides a sample file,
|
|
<ulink url="http://code.google.com/p/sphinxsearch/source/browse/trunk/src/udfexample.c">src/udfexample.c</ulink>,
|
|
that defines a few simple functions showing how to work with integer,
|
|
string, and MVA arguments; you can use that one as a foundation for
|
|
your new functions. It includes the UDF interface header file,
|
|
<ulink url="http://code.google.com/p/sphinxsearch/source/browse/trunk/src/udfexample.c">src/sphinxudf.h</ulink>,
|
|
that defines the required types and structures. <filename>sphinxudf.h</filename>
|
|
header is standalone, that is, does not require any other parts of Sphinx
|
|
source to compile.
|
|
</para>
|
|
<para>
|
|
Every function that you intend to use in your SELECT statements
|
|
requires at least two corresponding C/C++ functions: the initialization
|
|
call, and the function call itself. You can also optionally define
|
|
the deinitialization call if your function requires any post-query
|
|
cleanup. (For instance, if you were allocating any memory in either
|
|
the initialization call or the function calls.) Function names
|
|
in SQL are case insensitive, C function names are not. They need
|
|
to be all lower-case. Mistakes in function name prevent UDFs
|
|
from loading. You also have to pay special attention to the calling
|
|
convention used when compiling, the list and the types of arguments,
|
|
and the return type of the main function call. Mistakes in either
|
|
are likely to crash the server, or result in unexpected results
|
|
in the best case. Last but not least, all functions need to be
|
|
thread-safe.
|
|
</para>
|
|
<para>
|
|
Let's assume for the sake of example that your UDF name in SphinxQL
|
|
will be <code>MYFUNC</code>. The initialization, main, and deinitialization
|
|
functions would then need to be named as follows and take the following
|
|
arguments:
|
|
<programlisting>
|
|
/// initialization function
|
|
/// called once during query initialization
|
|
/// returns 0 on success
|
|
/// returns non-zero and fills error_message buffer on failure
|
|
int myfunc_init ( SPH_UDF_INIT * init, SPH_UDF_ARGS * args,
|
|
char * error_message );
|
|
|
|
/// main call function
|
|
/// returns the computed value
|
|
/// writes non-zero value into error_flag to indicate errors
|
|
RETURN_TYPE myfunc ( SPH_UDF_INIT * init, SPH_UDF_ARGS * args,
|
|
char * error_flag );
|
|
|
|
/// optional deinitialization function
|
|
/// called once to cleanup once query processing is done
|
|
void myfunc_deinit ( SPH_UDF_INIT * init );
|
|
</programlisting>
|
|
The two mentioned structures, <code>SPH_UDF_INIT</code> and
|
|
<code>SPH_UDF_ARGS</code>, are defined in the <filename>src/sphinxudf.h</filename>
|
|
interface header and documented there. <code>RETURN_TYPE</code> of the
|
|
main function must be one of the following:
|
|
<itemizedlist>
|
|
<listitem><para><code>int</code> for the functions that return INT.</para></listitem>
|
|
<listitem><para><code>sphinx_int64_t</code> for the functions that return BIGINT.</para></listitem>
|
|
<listitem><para><code>float</code> for the functions that return FLOAT.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
The calling sequence is as follows. <code>myfunc_init()</code> is called
|
|
once when initializing the query. It can return a non-zero code to indicate
|
|
a failure; in that case query is not executed, and the error message from
|
|
the <code>error_message</code> buffer is returned. Otherwise, <code>myfunc()</code>
|
|
is be called for every row, and a <code>myfunc_deinit()</code> is then called
|
|
when the query ends. <code>myfunc()</code> can indicate an error by writing
|
|
a non-zero byte value to <code>error_flag</code>, in that case, it will
|
|
no more be called for subsequent rows, and a default value of 0 will be
|
|
substituted. Sphinx might or might not choose to terminate such queries
|
|
early, neither behavior is currently guaranteed.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
</chapter>
|
|
<chapter id="command-line-tools"><title>Command line tools reference</title>
|
|
|
|
|
|
<para>As mentioned elsewhere, Sphinx is not a single program called 'sphinx',
|
|
but a collection of 4 separate programs which collectively form Sphinx. This section
|
|
covers these tools and how to use them.</para>
|
|
|
|
|
|
<sect1 id="ref-indexer"><title><filename>indexer</filename> command reference</title>
|
|
<para><filename>indexer</filename> is the first of the two principle tools
|
|
as part of Sphinx. Invoked from either the command line directly, or as part
|
|
of a larger script, <filename>indexer</filename> is solely responsible
|
|
for gathering the data that will be searchable.</para>
|
|
<para>The calling syntax for <filename>indexer</filename> is as follows:</para>
|
|
<programlisting>
|
|
indexer [OPTIONS] [indexname1 [indexname2 [...]]]
|
|
</programlisting>
|
|
<para>Essentially you would list the different possible indexes (that you would later
|
|
make available to search) in <filename>sphinx.conf</filename>, so when calling
|
|
<filename>indexer</filename>, as a minimum you need to be telling it what index
|
|
(or indexes) you want to index.</para>
|
|
<para>If <filename>sphinx.conf</filename> contained details on 2 indexes,
|
|
<filename>mybigindex</filename> and <filename>mysmallindex</filename>,
|
|
you could do the following:</para>
|
|
<programlisting>
|
|
$ indexer mybigindex
|
|
$ indexer mysmallindex mybigindex
|
|
</programlisting>
|
|
<para>As part of the configuration file, <filename>sphinx.conf</filename>, you specify
|
|
one or more indexes for your data. You might call <filename>indexer</filename> to reindex
|
|
one of them, ad-hoc, or you can tell it to process all indexes - you are not limited
|
|
to calling just one, or all at once, you can always pick some combination
|
|
of the available indexes.</para>
|
|
<para>The majority of the options for <filename>indexer</filename> are given
|
|
in the configuration file, however there are some options you might need to specify
|
|
on the command line as well, as they can affect how the indexing operation is performed.
|
|
These options are:
|
|
<itemizedlist>
|
|
|
|
<listitem><para><option>--config <file></option> (<option>-c <file></option> for short)
|
|
tells <filename>indexer</filename> to use the given file as its configuration. Normally,
|
|
it will look for <filename>sphinx.conf</filename> in the installation directory
|
|
(e.g. <filename>/usr/local/sphinx/etc/sphinx.conf</filename> if installed into
|
|
<filename>/usr/local/sphinx</filename>), followed by the current directory you are
|
|
in when calling <filename>indexer</filename> from the shell. This is most of use
|
|
in shared environments where the binary files are installed somewhere like
|
|
<filename>/usr/local/sphinx/</filename> but you want to provide users with
|
|
the ability to make their own custom Sphinx set-ups, or if you want to run
|
|
multiple instances on a single server. In cases like those you could allow them
|
|
to create their own <filename>sphinx.conf</filename> files and pass them to
|
|
<filename>indexer</filename> with this option. For example:
|
|
<programlisting>
|
|
$ indexer --config /home/myuser/sphinx.conf myindex
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--all</option> tells <filename>indexer</filename> to update
|
|
every index listed in <filename>sphinx.conf</filename>, instead of listing individual indexes.
|
|
This would be useful in small configurations, or <filename>cron</filename>-type or maintenance
|
|
jobs where the entire index set will get rebuilt each day, or week, or whatever period is best.
|
|
Example usage:
|
|
<programlisting>
|
|
$ indexer --config /home/myuser/sphinx.conf --all
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--rotate</option> is used for rotating indexes. Unless you have the situation
|
|
where you can take the search function offline without troubling users, you will almost certainly
|
|
need to keep search running whilst indexing new documents. <option>--rotate</option> creates
|
|
a second index, parallel to the first (in the same place, simply including <filename>.new</filename>
|
|
in the filenames). Once complete, <filename>indexer</filename> notifies <filename>searchd</filename>
|
|
via sending the <option>SIGHUP</option> signal, and <filename>searchd</filename> will attempt
|
|
to rename the indexes (renaming the existing ones to include <filename>.old</filename>
|
|
and renaming the <filename>.new</filename> to replace them), and then start serving
|
|
from the newer files. Depending on the setting of
|
|
<link linkend="conf-seamless-rotate">seamless_rotate</link>, there may be a slight delay
|
|
in being able to search the newer indexes. Example usage:
|
|
<programlisting>
|
|
$ indexer --rotate --all
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--quiet</option> tells <filename>indexer</filename> not to output anything,
|
|
unless there is an error. Again, most used for <filename>cron</filename>-type, or other script
|
|
jobs where the output is irrelevant or unnecessary, except in the event of some kind of error.
|
|
Example usage:
|
|
<programlisting>
|
|
$ indexer --rotate --all --quiet
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--noprogress</option> does not display progress details as they occur;
|
|
instead, the final status details (such as documents indexed, speed of indexing and so on
|
|
are only reported at completion of indexing. In instances where the script is not being
|
|
run on a console (or 'tty'), this will be on by default. Example usage:
|
|
<programlisting>
|
|
$ indexer --rotate --all --noprogress
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--buildstops <outputfile.text> <N></option> reviews
|
|
the index source, as if it were indexing the data, and produces a list of the terms
|
|
that are being indexed. In other words, it produces a list of all the searchable terms
|
|
that are becoming part of the index. Note; it does not update the index in question,
|
|
it simply processes the data 'as if' it were indexing, including running queries
|
|
defined with <option>sql_query_pre</option> or <option>sql_query_post</option>.
|
|
<filename>outputfile.txt</filename> will contain the list of words, one per line,
|
|
sorted by frequency with most frequent first, and <filename>N</filename> specifies
|
|
the maximum number of words that will be listed; if sufficiently large to encompass
|
|
every word in the index, only that many words will be returned. Such a dictionary list
|
|
could be used for client application features around "Did you mean..." functionality,
|
|
usually in conjunction with <option>--buildfreqs</option>, below. Example:
|
|
<programlisting>
|
|
$ indexer myindex --buildstops word_freq.txt 1000
|
|
</programlisting>
|
|
This would produce a document in the current directory, <filename>word_freq.txt</filename>
|
|
with the 1,000 most common words in 'myindex', ordered by most common first. Note that
|
|
the file will pertain to the last index indexed when specified with multiple indexes or
|
|
<option>--all</option> (i.e. the last one listed in the configuration file)
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--buildfreqs</option> works with <option>--buildstops</option>
|
|
(and is ignored if <option>--buildstops</option> is not specified).
|
|
As <option>--buildstops</option> provides the list of words used within the index,
|
|
<option>--buildfreqs</option> adds the quantity present in the index, which would be
|
|
useful in establishing whether certain words should be considered stopwords
|
|
if they are too prevalent. It will also help with developing "Did you mean..."
|
|
features where you can how much more common a given word compared to another,
|
|
similar one. Example:
|
|
<programlisting>
|
|
$ indexer myindex --buildstops word_freq.txt 1000 --buildfreqs
|
|
</programlisting>
|
|
This would produce the <filename>word_freq.txt</filename> as above, however after each word would be the number of times it occurred in the index in question.
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--merge <dst-index> <src-index></option> is used
|
|
for physically merging indexes together, for example if you have a main+delta scheme,
|
|
where the main index rarely changes, but the delta index is rebuilt frequently,
|
|
and <option>--merge</option> would be used to combine the two. The operation moves
|
|
from right to left - the contents of <filename>src-index</filename> get examined
|
|
and physically combined with the contents of <filename>dst-index</filename>
|
|
and the result is left in <filename>dst-index</filename>.
|
|
In pseudo-code, it might be expressed as: <code>dst-index += src-index</code>
|
|
An example:
|
|
<programlisting>
|
|
$ indexer --merge main delta --rotate
|
|
</programlisting>
|
|
In the above example, where the main is the master, rarely modified index,
|
|
and delta is the less frequently modified one, you might use the above to call
|
|
<filename>indexer</filename> to combine the contents of the delta into the
|
|
main index and rotate the indexes.
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--merge-dst-range <attr> <min> <max></option>
|
|
runs the filter range given upon merging. Specifically, as the merge is applied
|
|
to the destination index (as part of <option>--merge</option>, and is ignored
|
|
if <option>--merge</option> is not specified), <filename>indexer</filename>
|
|
will also filter the documents ending up in the destination index, and only
|
|
documents will pass through the filter given will end up in the final index.
|
|
This could be used for example, in an index where there is a 'deleted' attribute,
|
|
where 0 means 'not deleted'. Such an index could be merged with:
|
|
<programlisting>
|
|
$ indexer --merge main delta --merge-dst-range deleted 0 0
|
|
</programlisting>
|
|
Any documents marked as deleted (value 1) would be removed from the newly-merged
|
|
destination index. It can be added several times to the command line,
|
|
to add successive filters to the merge, all of which must be met in order
|
|
for a document to become part of the final index.
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--dump-rows <FILE></option> dumps rows fetched
|
|
by SQL source(s) into the specified file, in a MySQL compatible syntax.
|
|
Resulting dumps are the exact representation of data as received by
|
|
<filename>indexer</filename> and help to repeat indexing-time issues.
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--verbose</option> guarantees that every row that
|
|
caused problems indexing (duplicate, zero, or missing document ID;
|
|
or file field IO issues; etc) will be reported. By default, this option
|
|
is off, and problem summaries may be reported instead.
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--sighup-each</option> is useful when you are
|
|
rebuilding many big indexes, and want each one rotated into
|
|
<filename>searchd</filename> as soon as possible. With
|
|
<option>--sighup-each</option>, <filename>indexer</filename>
|
|
will send a SIGHUP signal to searchd after succesfully
|
|
completing the work on each index. (The default behavior
|
|
is to send a single SIGHUP after all the indexes were built.)
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--print-queries</option> prints out
|
|
SQL queries that <filename>indexer</filename> sends to
|
|
the database, along with SQL connection and disconnection
|
|
events. That is useful to diagnose and fix problems with
|
|
SQL sources.
|
|
</para></listitem>
|
|
|
|
</itemizedlist>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="ref-searchd"><title><filename>searchd</filename> command reference</title>
|
|
<para><filename>searchd</filename> is the second of the two principle tools as part of Sphinx.
|
|
<filename>searchd</filename> is the part of the system which actually handles searches;
|
|
it functions as a server and is responsible for receiving queries, processing them and
|
|
returning a dataset back to the different APIs for client applications.</para>
|
|
<para>Unlike <filename>indexer</filename>, <filename>searchd</filename> is not designed
|
|
to be run either from a regular script or command-line calling, but instead either
|
|
as a daemon to be called from init.d (on Unix/Linux type systems) or to be called
|
|
as a service (on Windows-type systems), so not all of the command line options will
|
|
always apply, and so will be build-dependent.</para>
|
|
<para>Calling <filename>searchd</filename> is simply a case of:</para>
|
|
<programlisting>
|
|
$ searchd [OPTIONS]
|
|
</programlisting>
|
|
<para>The options available to <filename>searchd</filename> on all builds are:</para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><option>--help</option> (<option>-h</option> for short) lists all of the
|
|
parameters that can be called in your particular build of <filename>searchd</filename>.
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--config <file></option> (<option>-c <file></option> for short)
|
|
tells <filename>searchd</filename> to use the given file as its configuration,
|
|
just as with <filename>indexer</filename> above.
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--stop</option> is used to asynchronously stop <filename>searchd</filename>,
|
|
using the details of the PID file as specified in the <filename>sphinx.conf</filename> file,
|
|
so you may also need to confirm to <filename>searchd</filename> which configuration
|
|
file to use with the <option>--config</option> option. NB, calling <option>--stop</option>
|
|
will also make sure any changes applied to the indexes with
|
|
<link linkend="api-func-updateatttributes"><code>UpdateAttributes()</code></link>
|
|
will be applied to the index files themselves. Example:
|
|
<programlisting>
|
|
$ searchd --config /home/myuser/sphinx.conf --stop
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--stopwait</option> is used to synchronously stop <filename>searchd</filename>.
|
|
<option>--stop</option> essentially tells the running instance to exit (by sending it a SIGTERM)
|
|
and then immediately returns. <option>--stopwait</option> will also attempt to wait until the
|
|
running <filename>searchd</filename> instance actually finishes the shutdown (eg. saves all
|
|
the pending attribute changes) and exits. Example:
|
|
<programlisting>
|
|
$ searchd --config /home/myuser/sphinx.conf --stopwait
|
|
</programlisting>
|
|
Possible exit codes are as follows:
|
|
<itemizedlist>
|
|
<listitem><para>0 on success;</para></listitem>
|
|
<listitem><para>1 if connection to running searchd daemon failed;</para></listitem>
|
|
<listitem><para>2 if daemon reported an error during shutdown;</para></listitem>
|
|
<listitem><para>3 if daemon crashed during shutdown.</para></listitem>
|
|
</itemizedlist>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--status</option> command is used to query running
|
|
<filename>searchd</filename> instance status, using the connection details
|
|
from the (optionally) provided configuration file. It will try to connect
|
|
to the running instance using the first configured UNIX socket or TCP port.
|
|
On success, it will query for a number of status and performance counter
|
|
values and print them. You can use <link linkend="api-func-status">Status()</link>
|
|
API call to access the very same counters from your application. Examples:
|
|
<programlisting>
|
|
$ searchd --status
|
|
$ searchd --config /home/myuser/sphinx.conf --status
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--pidfile</option> is used to explicitly state a PID file,
|
|
where the process information is stored regarding <filename>searchd</filename>,
|
|
used for inter-process communications (for example, <filename>indexer</filename>
|
|
will need to know the PID to contact <filename>searchd</filename> for rotating
|
|
indexes). Normally, <filename>searchd</filename> would use a PID if running
|
|
in regular mode (i.e. not with <option>--console</option>), but it is possible
|
|
that you will be running it in console mode whilst the index is being updated
|
|
and rotated, for which a PID file will be needed.
|
|
<programlisting>
|
|
$ searchd --config /home/myuser/sphinx.conf --pidfile /home/myuser/sphinx.pid
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--console</option> is used to force <filename>searchd</filename>
|
|
into console mode; typically it will be running as a conventional server application,
|
|
and will aim to dump information into the log files (as specified in
|
|
<filename>sphinx.conf</filename>). Sometimes though, when debugging issues
|
|
in the configuration or the daemon itself, or trying to diagnose hard-to-track-down
|
|
problems, it may be easier to force it to dump information directly
|
|
to the console/command line from which it is being called. Running in console mode
|
|
also means that the process will not be forked (so searches are done in sequence)
|
|
and logs will not be written to. (It should be noted that console mode
|
|
is not the intended method for running <filename>searchd</filename>.)
|
|
You can invoke it as such:
|
|
<programlisting>
|
|
$ searchd --config /home/myuser/sphinx.conf --console
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--logdebug</option> enables additional debug output
|
|
in the daemon log. Should only be needed rarely, to assist with debugging
|
|
issues that could not be easily reproduced on request.
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--iostats</option> is used in conjuction with the
|
|
logging options (the <option>query_log</option> will need to have been
|
|
activated in <filename>sphinx.conf</filename>) to provide more detailed
|
|
information on a per-query basis as to the input/output operations
|
|
carried out in the course of that query, with a slight performance hit
|
|
and of course bigger logs. Further details are available under the
|
|
<link linkend="query-log-format">query log format</link> section.
|
|
You might start <filename>searchd</filename> thus:
|
|
<programlisting>
|
|
$ searchd --config /home/myuser/sphinx.conf --iostats
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--cpustats</option> is used to provide actual CPU time
|
|
report (in addition to wall time) in both query log file (for every given
|
|
query) and status report (aggregated). It depends on clock_gettime() system
|
|
call and might therefore be unavailable on certain systems. You might start
|
|
<filename>searchd</filename> thus:
|
|
<programlisting>
|
|
$ searchd --config /home/myuser/sphinx.conf --cpustats
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--port portnumber</option> (<option>-p</option> for short)
|
|
is used to specify the port that <filename>searchd</filename> should listen on,
|
|
usually for debugging purposes. This will usually default to 9312, but sometimes
|
|
you need to run it on a different port. Specifying it on the command line
|
|
will override anything specified in the configuration file. The valid range
|
|
is 0 to 65535, but ports numbered 1024 and below usually require
|
|
a privileged account in order to run. An example of usage:
|
|
<programlisting>
|
|
$ searchd --port 9313
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem>
|
|
<para><option>--listen ( address ":" port | port | path ) [ ":" protocol ]</option>
|
|
(or <option>-l</option> for short) Works as <option>--port</option>, but allow
|
|
you to specify not only the port, but full path, as IP address and port, or
|
|
Unix-domain socket path, that <filename>searchd</filename> will listen on.
|
|
Otherwords, you can specify either an IP address (or hostname) and port number, or
|
|
just a port number, or Unix socket path. If you specify port number
|
|
but not the address, searchd will listen on all network interfaces.
|
|
Unix path is identified by a leading slash. As the last param you
|
|
can also specify a protocol handler (listener) to be used for
|
|
connections on this socket. Supported protocol values are 'sphinx'
|
|
(Sphinx 0.9.x API protocol) and 'mysql41' (MySQL protocol used since
|
|
4.1 upto at least 5.1).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><option>--index <index></option> (or <option>-i
|
|
<index></option> for short) forces this instance of
|
|
<filename>searchd</filename> only to serve the specified index.
|
|
Like <option>--port</option>, above, this is usually for debugging purposes;
|
|
more long-term changes would generally be applied to the configuration file
|
|
itself. Example usage:
|
|
<programlisting>
|
|
$ searchd --index myindex
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--strip-path</option> strips the path names from
|
|
all the file names referenced from the index (stopwords, wordforms,
|
|
exceptions, etc). This is useful for picking up indexes built on another
|
|
machine with possibly different path layouts.
|
|
</para></listitem>
|
|
|
|
</itemizedlist>
|
|
<para>There are some options for <filename>searchd</filename> that are specific
|
|
to Windows platforms, concerning handling as a service, are only be available on Windows binaries.</para>
|
|
<para>Note that on Windows searchd will default to <option>--console</option> mode, unless you install it as a service.</para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><option>--install</option> installs <filename>searchd</filename> as a service
|
|
into the Microsoft Management Console (Control Panel / Administrative Tools / Services).
|
|
Any other parameters specified on the command line, where <option>--install</option>
|
|
is specified will also become part of the command line on future starts of the service.
|
|
For example, as part of calling <filename>searchd</filename>, you will likely also need
|
|
to specify the configuration file with <option>--config</option>, and you would do that
|
|
as well as specifying <option>--install</option>. Once called, the usual start/stop
|
|
facilities will become available via the management console, so any methods you could
|
|
use for starting, stopping and restarting services would also apply to
|
|
<filename>searchd</filename>. Example:
|
|
<programlisting>
|
|
C:\WINDOWS\system32> C:\Sphinx\bin\searchd.exe --install
|
|
--config C:\Sphinx\sphinx.conf
|
|
</programlisting>
|
|
If you wanted to have the I/O stats every time you started <filename>searchd</filename>,
|
|
you would specify its option on the same line as the <option>--install</option> command thus:
|
|
<programlisting>
|
|
C:\WINDOWS\system32> C:\Sphinx\bin\searchd.exe --install
|
|
--config C:\Sphinx\sphinx.conf --iostats
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--delete</option> removes the service from the Microsoft Management Console
|
|
and other places where services are registered, after previously installed with
|
|
<option>--install</option>. Note, this does not uninstall the software or delete the indexes.
|
|
It means the service will not be called from the services systems, and will not be started
|
|
on the machine's next start. If currently running as a service, the current instance
|
|
will not be terminated (until the next reboot, or <filename>searchd</filename> is called
|
|
with <option>--stop</option>). If the service was installed with a custom name
|
|
(with <option>--servicename</option>), the same name will need to be specified
|
|
with <option>--servicename</option> when calling to uninstall. Example:
|
|
<programlisting>
|
|
C:\WINDOWS\system32> C:\Sphinx\bin\searchd.exe --delete
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--servicename <name></option> applies the given name to
|
|
<filename>searchd</filename> when installing or deleting the service, as would appear
|
|
in the Management Console; this will default to searchd, but if being deployed on servers
|
|
where multiple administrators may log into the system, or a system with multiple
|
|
<filename>searchd</filename> instances, a more descriptive name may be applicable.
|
|
Note that unless combined with <option>--install</option> or <option>--delete</option>,
|
|
this option does not do anything. Example:
|
|
<programlisting>
|
|
C:\WINDOWS\system32> C:\Sphinx\bin\searchd.exe --install
|
|
--config C:\Sphinx\sphinx.conf --servicename SphinxSearch
|
|
</programlisting>
|
|
</para></listitem>
|
|
|
|
<listitem><para><option>--ntservice</option> is the option that is passed by the
|
|
Management Console to <filename>searchd</filename> to invoke it as a service
|
|
on Windows platforms. It would not normally be necessary to call this directly;
|
|
this would normally be called by Windows when the service would be started,
|
|
although if you wanted to call this as a regular service from the command-line
|
|
(as the complement to <option>--console</option>) you could do so in theory.
|
|
</para></listitem>
|
|
|
|
</itemizedlist>
|
|
<para>
|
|
Last but not least, as every other daemon, <filename>searchd</filename> supports a number of signals.
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>SIGTERM</term>
|
|
<listitem><para>Initiates a clean shutdown. New queries will not be handled; but queries
|
|
that are already started will not be forcibly interrupted.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>SIGHUP</term>
|
|
<listitem><para>Initiates index rotation. Depending on the value of
|
|
<link linkend="conf-seamless-rotate">seamless_rotate</link> setting,
|
|
new queries might be shortly stalled; clients will receive temporary
|
|
errors.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>SIGUSR1</term>
|
|
<listitem><para>Forces reopen of searchd log and query log files, letting
|
|
you implement log file rotation.</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="ref-search"><title><filename>search</filename> command reference</title>
|
|
<para><filename>search</filename> is one of the helper tools within the
|
|
Sphinx package. Whereas <filename>searchd</filename> is responsible for
|
|
searches in a server-type environment, <filename>search</filename> is
|
|
aimed at testing the index from the command line, and testing the index
|
|
quickly without building a framework to make the connection to the server
|
|
and process its response.</para>
|
|
<para>Note: <filename>search</filename> is not intended to be deployed as
|
|
part of a client application; it is strongly recommended you do not write
|
|
an interface to <filename>search</filename> instead of
|
|
<filename>searchd</filename>, and none of the bundled client APIs support
|
|
this method. (In any event, <filename>search</filename> will reload files
|
|
each time, whereas <filename>searchd</filename> will cache them in memory
|
|
for performance.)</para>
|
|
<para>That said, many types of query that you could build in the APIs
|
|
could also be made with <filename>search</filename>, however for very
|
|
complex searches it may be easier to construct them using a small script
|
|
and the corresponding API. Additionally, some newer features may be
|
|
available in the <filename>searchd</filename> system that have not yet
|
|
been brought into <filename>search</filename>.</para>
|
|
<para>The calling syntax for <filename>search</filename> is as
|
|
follows:</para>
|
|
<programlisting>
|
|
search [OPTIONS] word1 [word2 [word3 [...]]]
|
|
</programlisting>
|
|
<para>When calling <filename>search</filename>, it is not necessary
|
|
to have <filename>searchd</filename> running; simply make sure that
|
|
the account running the <filename>search</filename> program has read
|
|
access to the configuration file and the index files.</para>
|
|
<para>The default behaviour is to apply a search for word1 (AND word2 AND
|
|
word3... as specified) to all fields in all indexes as given in the
|
|
configuration file. If constructing the equivalent in the API, this would
|
|
be the equivalent to passing <option>SPH_MATCH_ALL</option> to
|
|
<code>SetMatchMode</code>, and specifying <option>*</option> as the
|
|
indexes to query as part of <code>Query</code>.</para>
|
|
<para>There are many options available to <filename>search</filename>.
|
|
Firstly, the general options:
|
|
<itemizedlist>
|
|
<listitem><para><option>--config <file></option> (<option>-c
|
|
<file></option> for short) tells <filename>search</filename> to use
|
|
the given file as its configuration, just as with
|
|
<filename>indexer</filename> above.</para></listitem>
|
|
<listitem><para><option>--index <index></option> (<option>-i
|
|
<index></option> for short) tells <filename>search</filename> to
|
|
limit searching to the specified index only; normally it would attempt to
|
|
search all of the physical indexes listed in
|
|
<filename>sphinx.conf</filename>, not any distributed ones.</para></listitem>
|
|
<listitem><para><option>--stdin</option> tells <filename>search</filename> to
|
|
accept the query from the standard input, rather than the command line.
|
|
This can be useful for testing purposes whereby you could feed input via
|
|
pipes and from scripts.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>Options for setting matches:
|
|
<itemizedlist>
|
|
<listitem><para><option>--any</option> (<option>-a</option> for short) changes
|
|
the matching mode to match any of the words as part of the query (word1 OR
|
|
word2 OR word3). In the API this would be equivalent to passing
|
|
<option>SPH_MATCH_ANY</option> to <code>SetMatchMode</code>.</para></listitem>
|
|
<listitem><para><option>--phrase</option> (<option>-p</option> for short)
|
|
changes the matching mode to match all of the words as part of the query,
|
|
and do so in the phrase given (not including punctuation). In the API this
|
|
would be equivalent to passing <option>SPH_MATCH_PHRASE</option> to
|
|
<code>SetMatchMode</code>.</para></listitem>
|
|
<listitem><para><option>--boolean</option> (<option>-b</option> for short)
|
|
changes the matching mode to <link linkend="boolean-syntax">Boolean
|
|
matching</link>. Note if using Boolean syntax matching on the command
|
|
line, you may need to escape the symbols (with a backslash) to avoid the
|
|
shell/command line processor applying them, such as ampersands being
|
|
escaped on a Unix/Linux system to avoid it forking to the
|
|
<filename>search</filename> process, although this can be resolved by
|
|
using <option>--stdin</option>, as below. In the API this would be
|
|
equivalent to passing <option>SPH_MATCH_BOOLEAN</option> to
|
|
<code>SetMatchMode</code>.</para></listitem>
|
|
<listitem><para><option>--ext</option> (<option>-e</option> for short) changes
|
|
the matching mode to <link linkend="extended-syntax">extended
|
|
matching</link> which provides various text querying operators.
|
|
In the API this would be equivalent to passing
|
|
<option>SPH_MATCH_EXTENDED</option> to <code>SetMatchMode</code>.
|
|
</para></listitem>
|
|
<listitem><para><option>--filter <attr> <v></option> (<option>-f
|
|
<attr> <v></option> for short) filters the results such that
|
|
only documents where the attribute given (attr) matches the value given
|
|
(v). For example, <option>--filter deleted 0</option> only matches
|
|
documents with an attribute called 'deleted' where its value is 0. You can
|
|
also add multiple filters on the command line, by specifying multiple
|
|
<option>--filter</option> multiple times, however if you apply a second
|
|
filter to an attribute it will override the first defined
|
|
filter.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>Options for handling the results:
|
|
<itemizedlist>
|
|
<listitem><para><option>--limit <count></option> (<option>-l
|
|
count</option> for short) limits the total number of matches back to the
|
|
number given. If a 'group' is specified, this will be the number of
|
|
grouped results. This defaults to 20 results if not specified (as do the
|
|
APIs)</para></listitem>
|
|
<listitem><para><option>--offset <count></option> (<option>-o
|
|
<count></option> for short) offsets the result list by the number of
|
|
places set by the count; this would be used for pagination through
|
|
results, where if you have 20 results per 'page', the second page would
|
|
begin at offset 20, the third page at offset 40, etc.</para></listitem>
|
|
<listitem><para><option>--group <attr></option> (<option>-g
|
|
<attr></option> for short) specifies that results should be grouped
|
|
together based on the attribute specified. Like the GROUP BY clause in
|
|
SQL, it will combine all results where the attribute given matches, and
|
|
returns a set of results where each returned result is the best from each
|
|
group. Unless otherwise specified, this will be the best match on
|
|
relevance.</para></listitem>
|
|
<listitem><para><option>--groupsort <expr></option> (<option>-gs
|
|
<expr></option> for short) instructs that when results are grouped
|
|
with <option>--group</option>, the expression given in <expr> shall
|
|
determine the order of the groups. Note, this does not specify which is
|
|
the best item within the group, only the order in which the groups
|
|
themselves shall be returned.</para></listitem>
|
|
<listitem><para><option>--sortby <clause></option> (<option>-s
|
|
<clause></option> for short) specifies that results should be sorted
|
|
in the order listed in <clause>. This allows you to specify the
|
|
order you wish results to be presented in, ordering by different columns.
|
|
For example, you could say <option>--sortby "@weight DESC entrytime
|
|
DESC"</option> to sort entries first by weight (or relevance) and where
|
|
two or more entries have the same weight, to then sort by the time with
|
|
the highest time (newest) first. You will usually need to put the items in
|
|
quotes (<option>--sortby "@weight DESC"</option>) or use commas
|
|
(<option>--sortby @weight,DESC</option>) to avoid the items being treated
|
|
separately. Additionally, like the regular sorting modes, if
|
|
<option>--group</option> (grouping) is being used, this will state how to
|
|
establish the best match within each group.</para></listitem>
|
|
<listitem><para><option>--sortexpr expr</option> (<option>-S expr</option> for
|
|
short) specifies that the search results should be presented in an order
|
|
determined by an arithmetic expression, stated in expr. For example:
|
|
<option>--sortexpr "@weight + ( user_karma + ln(pageviews) )*0.1"</option>
|
|
(again noting that this will have to be quoted to avoid the shell dealing
|
|
with the asterisk). Extended sort mode is discussed in more detail under
|
|
the <option>SPH_SORT_EXTENDED</option> entry under the <link
|
|
linkend="sorting-modes">Sorting modes</link> section of the
|
|
manual.</para></listitem>
|
|
<listitem><para><option>--sort=date</option> specifies that the results should
|
|
be sorted by descending (i.e. most recent first) date. This requires that
|
|
there is an attribute in the index that is set as a timestamp.</para></listitem>
|
|
<listitem><para><option>--rsort=date</option> specifies that the results should
|
|
be sorted by ascending (i.e. oldest first) date. This requires that there
|
|
is an attribute in the index that is set as a timestamp.</para></listitem>
|
|
<listitem><para><option>--sort=ts</option> specifies that the results should be
|
|
sorted by timestamp in groups; it will return all of the documents whose
|
|
timestamp is within the last hour, then sorted within that bracket for
|
|
relevance. After, it would return the documents from the last day, sorted
|
|
by relevance, then the last week and then the last month. It is discussed
|
|
in more detail under the <option>SPH_SORT_TIME_SEGMENTS</option> entry
|
|
under the <link linkend="sorting-modes">Sorting modes</link> section of
|
|
the manual.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>Other options:
|
|
<itemizedlist>
|
|
<listitem><para><option>--noinfo</option> (<option>-q</option> for short)
|
|
instructs <filename>search</filename> not to look-up data in your SQL
|
|
database. Specifically, for debugging with MySQL and
|
|
<filename>search</filename>, you can provide it with a query to look up
|
|
the full article based on the returned document ID. It is explained in
|
|
more detail under the <link
|
|
linkend="conf-sql-query-info">sql_query_info</link> directive.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="ref-spelldump"><title><filename>spelldump</filename> command reference</title>
|
|
<para><filename>spelldump</filename> is one of the helper tools within the Sphinx package.</para>
|
|
<para>It is used to extract the contents of a dictionary file that uses
|
|
<filename>ispell</filename> or <filename>MySpell</filename> format, which
|
|
can help build word lists for <glossterm>wordforms</glossterm> - all of
|
|
the possible forms are pre-built for you.</para>
|
|
<para>Its general usage is:</para>
|
|
<programlisting>
|
|
spelldump [options] <dictionary> <affix> [result] [locale-name]
|
|
</programlisting>
|
|
<para>The two main parameters are the dictionary's main file and its affix
|
|
file; usually these are named as
|
|
<filename>[language-prefix].dict</filename> and
|
|
<filename>[language-prefix].aff</filename> and will be available with most
|
|
common Linux distributions, as well as various places online.</para>
|
|
<para><option>[result]</option> specifies where the dictionary data should
|
|
be output to, and <option>[locale-name]</option> additionally specifies
|
|
the locale details you wish to use.</para>
|
|
<para>There is an additional option, <option>-c [file]</option>, which
|
|
specifies a file for case conversion details.</para>
|
|
<para>Examples of its usage are:</para>
|
|
<programlisting>
|
|
spelldump en.dict en.aff
|
|
spelldump ru.dict ru.aff ru.txt ru_RU.CP1251
|
|
spelldump ru.dict ru.aff ru.txt .1251
|
|
</programlisting>
|
|
<para>The results file will contain a list of all the words in the
|
|
dictionary in alphabetical order, output in the format of a wordforms file,
|
|
which you can use to customise for your specific circumstances. An example
|
|
of the result file:</para>
|
|
<programlisting>
|
|
zone > zone
|
|
zoned > zoned
|
|
zoning > zoning
|
|
</programlisting>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="ref-indextool"><title><filename>indextool</filename> command reference</title>
|
|
<para>
|
|
<filename>indextool</filename> is one of the helper tools within
|
|
the Sphinx package, introduced in version 0.9.9-rc2. It is used to
|
|
dump miscellaneous debug information about the physical index.
|
|
(Additional functionality such as index verification is planned
|
|
in the future, hence the indextool name rather than just indexdump.)
|
|
Its general usage is:
|
|
</para>
|
|
<programlisting>
|
|
indextool <command> [options]
|
|
</programlisting>
|
|
<para>
|
|
The only currently available option applies to all commands
|
|
and lets you specify the configuration file:
|
|
<itemizedlist>
|
|
<listitem><para><option>--config <file></option> (<option>-c <file></option> for short)
|
|
overrides the built-in config file names.
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
The commands are as follows:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem><para><option>--dumpheader FILENAME.sph</option> quickly dumps
|
|
the provided index header file without touching any other index files
|
|
or even the configuration file. The report provides a breakdown of
|
|
all the index settings, in particular the entire attribute and
|
|
field list. Prior to 0.9.9-rc2, this command was present in
|
|
CLI search utility.
|
|
</para></listitem>
|
|
<listitem><para><option>--dumpconfig FILENAME.sph</option> dumps
|
|
the index definition from the given index header file in (almost)
|
|
compliant <filename>sphinx.conf</filename> file format.
|
|
Added in version 2.0.1-beta.
|
|
</para></listitem>
|
|
<listitem><para><option>--dumpheader INDEXNAME</option> dumps index header
|
|
by index name with looking up the header path in the configuration file.
|
|
</para></listitem>
|
|
<listitem><para><option>--dumpdocids INDEXNAME</option> dumps document IDs
|
|
by index name. It takes the data from attribute (.spa) file and therefore
|
|
requires docinfo=extern to work.
|
|
</para></listitem>
|
|
<listitem><para><option>--dumphitlist INDEXNAME KEYWORD</option> dumps all
|
|
the hits (occurences) of a given keyword in a given index, with keyword
|
|
specified as text.
|
|
</para></listitem>
|
|
<listitem><para><option>--dumphitlist INDEXNAME --wordid ID</option> dumps all
|
|
the hits (occurences) of a given keyword in a given index, with keyword
|
|
specified as internal numeric ID.
|
|
</para></listitem>
|
|
<listitem><para><option>--htmlstrip INDEXNAME</option> filters stdin using
|
|
HTML stripper settings for a given index, and prints the filtering
|
|
results to stdout. Note that the settings will be taken from sphinx.conf,
|
|
and not the index header.
|
|
</para></listitem>
|
|
<listitem><para><option>--check INDEXNAME</option> checks the index data
|
|
files for consistency errors that might be introduced either by bugs
|
|
in <filename>indexer</filename> and/or hardware faults.
|
|
</para></listitem>
|
|
<listitem><para><option>--strip-path</option> strips the path names from
|
|
all the file names referenced from the index (stopwords, wordforms,
|
|
exceptions, etc). This is useful for checking indexes built on another
|
|
machine with possibly different path layouts.
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
|
|
</chapter>
|
|
<chapter id="sphinxql-reference"><title>SphinxQL reference</title>
|
|
|
|
|
|
<para>
|
|
SphinxQL is our SQL dialect that exposes all of the search daemon
|
|
functionality using a standard SQL syntax with a few Sphinx-specific
|
|
extensions. Everything available via the SphinxAPI is also available
|
|
SphinxQL but not vice versa; for instance, writes into RT indexes
|
|
are only available via SphinxQL. This chapter documents supported
|
|
SphinxQL statements syntax.
|
|
</para>
|
|
|
|
|
|
<sect1 id="sphinxql-select"><title>SELECT syntax</title>
|
|
<programlisting>
|
|
SELECT
|
|
select_expr [, select_expr ...]
|
|
FROM index [, index2 ...]
|
|
[WHERE where_condition]
|
|
[GROUP BY {col_name | expr_alias}]
|
|
[ORDER BY {col_name | expr_alias} {ASC | DESC} [, ...]]
|
|
[WITHIN GROUP ORDER BY {col_name | expr_alias} {ASC | DESC}]
|
|
[LIMIT offset, row_count]
|
|
[OPTION opt_name = opt_value [, ...]]
|
|
</programlisting>
|
|
<para>
|
|
<b>SELECT</b> statement was introduced in version 0.9.9-rc2.
|
|
It's syntax is based upon regular SQL but adds several Sphinx-specific
|
|
extensions and has a few omissions (such as (currently) missing support for JOINs).
|
|
Specifically,
|
|
<itemizedlist>
|
|
<listitem><para>Column list clause. Column names, arbitrary expressions,
|
|
and star ('*') are all allowed (ie.
|
|
<code>SELECT @id, group_id*123+456 AS expr1 FROM test1</code>
|
|
will work). Unlike in regular SQL, all computed expressions must be aliased
|
|
with a valid identifier. Starting with version 2.0.1-beta, <code>AS</code>
|
|
is optional. Special names such as @id and @weight should currently
|
|
be used with leading at-sign. This at-sign requirement will be lifted in
|
|
the future.
|
|
</para></listitem>
|
|
<listitem><para>FROM clause. FROM clause should contain the list of indexes
|
|
to search through. Unlike in regular SQL, comma means enumeration of
|
|
full-text indexes as in <link linkend="api-func-query">Query()</link>
|
|
API call rather than JOIN.
|
|
</para></listitem>
|
|
<listitem><para>WHERE clause. This clause will map both to fulltext query
|
|
and filters. Comparison operators (=, !=, <, >, <=, >=), IN,
|
|
AND, NOT, and BETWEEN are all supported and map directly to filters.
|
|
OR is not supported yet but will be in the future. MATCH('query')
|
|
is supported and maps to fulltext query. Query will be interpreted
|
|
according to <link linkend="extended-syntax">full-text query language rules</link>.
|
|
There must be at most one MATCH() in the clause. Starting with version
|
|
2.0.1-beta, <code>{col_name | expr_alias} [NOT] IN @uservar</code>
|
|
condition syntax is supported. (Refer to <xref linkend="sphinxql-set"/>
|
|
for a discussion of global user variables.)
|
|
</para></listitem>
|
|
<listitem><para>GROUP BY clause. Currently only supports grouping by a single
|
|
column. The column however can be a computed expression:
|
|
<programlisting>
|
|
SELECT *, group_id*1000+article_type AS gkey FROM example GROUP BY gkey
|
|
</programlisting>
|
|
Aggregate functions (AVG(), MIN(), MAX(), SUM()) in column list
|
|
clause are supported. Arguments to aggregate functions can be either
|
|
plain attributes or arbitrary expressions. COUNT(*) is implicitly
|
|
supported as using GROUP BY will add @count column to result set.
|
|
Explicit support might be added in the future. COUNT(DISTINCT attr)
|
|
is supported. Currently there can be at most one COUNT(DISTINCT)
|
|
per query and an argument needs to be an attribute. Both current
|
|
restrictions on COUNT(DISTINCT) might be lifted in the future.
|
|
<programlisting>
|
|
SELECT *, AVG(price) AS avgprice, COUNT(DISTINCT storeid)
|
|
FROM products
|
|
WHERE MATCH('ipod')
|
|
GROUP BY vendorid
|
|
</programlisting>
|
|
Starting with 2.0.1-beta, GROUP BY on a string attribute is supported,
|
|
with respect for current collation (see <xref linkend="collations"/>).
|
|
</para></listitem>
|
|
<listitem><para>WITHIN GROUP ORDER BY clause. This is a Sphinx specific
|
|
extension that lets you control how the best row within a group
|
|
will to be selected. The syntax matches that of regular ORDER BY
|
|
clause:
|
|
<programlisting>
|
|
SELECT *, INTERVAL(posted,NOW()-7*86400,NOW()-86400) AS timeseg
|
|
FROM example WHERE MATCH('my search query')
|
|
GROUP BY siteid
|
|
WITHIN GROUP ORDER BY @weight DESC
|
|
ORDER BY timeseg DESC, @weight DESC
|
|
</programlisting>
|
|
Starting with 2.0.1-beta, WITHIN GROUP ORDER BY on a string attribute is supported,
|
|
with respect for current collation (see <xref linkend="collations"/>).
|
|
</para></listitem>
|
|
<listitem><para>ORDER BY clause. Unlike in regular SQL, only column names
|
|
(not expressions) are allowed and explicit ASC and DESC are required.
|
|
The columns however can be computed expressions:
|
|
<programlisting>
|
|
SELECT *, @weight*10+docboost AS skey FROM example ORDER BY skey
|
|
</programlisting>
|
|
Starting with 2.0.1-beta, ORDER BY on a string attribute is supported,
|
|
with respect for current collation (see <xref linkend="collations"/>).
|
|
</para></listitem>
|
|
<listitem><para>LIMIT clause. Both LIMIT N and LIMIT M,N forms are supported.
|
|
Unlike in regular SQL (but like in Sphinx API), an implicit LIMIT 0,20
|
|
is present by default.
|
|
</para></listitem>
|
|
<listitem><para>OPTION clause. This is a Sphinx specific extension that
|
|
lets you control a number of per-query options. The syntax is:
|
|
<programlisting>
|
|
OPTION <optionname>=<value> [ , ... ]
|
|
</programlisting>
|
|
Supported options and respectively allowed values are:
|
|
<itemizedlist>
|
|
<listitem><para>'ranker' - any of 'proximity_bm25', 'bm25', 'none', 'wordcount', 'proximity', 'matchany', or 'fieldmask'</para></listitem>
|
|
<listitem><para>'max_matches' - integer (per-query max matches value)</para></listitem>
|
|
<listitem><para>'cutoff' - integer (max found matches threshold)</para></listitem>
|
|
<listitem><para>'max_query_time' - integer (max search time threshold, msec)</para></listitem>
|
|
<listitem><para>'retry_count' - integer (distributed retries count)</para></listitem>
|
|
<listitem><para>'retry_delay' - integer (distributed retry delay, msec)</para></listitem>
|
|
<listitem><para>'field_weights' - a named integer list (per-field user weights for ranking)</para></listitem>
|
|
<listitem><para>'index_weights' - a named integer list (per-index user weights for ranking)</para></listitem>
|
|
<listitem><para>'reverse_scan' - 0 or 1, lets you control the order in which full-scan query processes the rows</para></listitem>
|
|
</itemizedlist>
|
|
Example:
|
|
<programlisting>
|
|
SELECT * FROM test WHERE MATCH('@title hello @body world')
|
|
OPTION ranker=bm25, max_matches=3000,
|
|
field_weights=(title=10, body=3)
|
|
</programlisting>
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-show-meta"><title>SHOW META syntax</title>
|
|
<programlisting>
|
|
SHOW META
|
|
</programlisting>
|
|
<para><b>SHOW META</b> shows additional meta-information about the latest
|
|
query such as query time and keyword statistics:
|
|
<programlisting>
|
|
mysql> SELECT * FROM test1 WHERE MATCH('test|one|two');
|
|
+------+--------+----------+------------+
|
|
| id | weight | group_id | date_added |
|
|
+------+--------+----------+------------+
|
|
| 1 | 3563 | 456 | 1231721236 |
|
|
| 2 | 2563 | 123 | 1231721236 |
|
|
| 4 | 1480 | 2 | 1231721236 |
|
|
+------+--------+----------+------------+
|
|
3 rows in set (0.01 sec)
|
|
|
|
mysql> SHOW META;
|
|
+---------------+-------+
|
|
| Variable_name | Value |
|
|
+---------------+-------+
|
|
| total | 3 |
|
|
| total_found | 3 |
|
|
| time | 0.005 |
|
|
| keyword[0] | test |
|
|
| docs[0] | 3 |
|
|
| hits[0] | 5 |
|
|
| keyword[1] | one |
|
|
| docs[1] | 1 |
|
|
| hits[1] | 2 |
|
|
| keyword[2] | two |
|
|
| docs[2] | 1 |
|
|
| hits[2] | 2 |
|
|
+---------------+-------+
|
|
12 rows in set (0.00 sec)
|
|
</programlisting>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-show-warnings"><title>SHOW WARNINGS syntax</title>
|
|
<programlisting>
|
|
SHOW WARNINGS
|
|
</programlisting>
|
|
<para><b>SHOW WARNINGS</b> statement, introduced in version 0.9.9-rc2,
|
|
can be used to retrieve the warning
|
|
produced by the latest query. The error message will be returned along with
|
|
the query itself:
|
|
<programlisting>
|
|
mysql> SELECT * FROM test1 WHERE MATCH('@@title hello') \G
|
|
ERROR 1064 (42000): index test1: syntax error, unexpected TOK_FIELDLIMIT
|
|
near '@title hello'
|
|
|
|
mysql> SELECT * FROM test1 WHERE MATCH('@title -hello') \G
|
|
ERROR 1064 (42000): index test1: query is non-computable (single NOT operator)
|
|
|
|
mysql> SELECT * FROM test1 WHERE MATCH('"test doc"/3') \G
|
|
*************************** 1. row ***************************
|
|
id: 4
|
|
weight: 2500
|
|
group_id: 2
|
|
date_added: 1231721236
|
|
1 row in set, 1 warning (0.00 sec)
|
|
|
|
mysql> SHOW WARNINGS \G
|
|
*************************** 1. row ***************************
|
|
Level: warning
|
|
Code: 1000
|
|
Message: quorum threshold too high (words=2, thresh=3); replacing quorum operator
|
|
with AND operator
|
|
1 row in set (0.00 sec)
|
|
</programlisting>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-show-status"><title>SHOW STATUS syntax</title>
|
|
<para><b>SHOW STATUS</b>, introduced in version 0.9.9-rc2,
|
|
displays a number of useful performance counters. IO and CPU
|
|
counters will only be available if searchd was started with --iostats and --cpustats
|
|
switches respectively.
|
|
<programlisting>
|
|
mysql> SHOW STATUS;
|
|
+--------------------+-------+
|
|
| Variable_name | Value |
|
|
+--------------------+-------+
|
|
| uptime | 216 |
|
|
| connections | 3 |
|
|
| maxed_out | 0 |
|
|
| command_search | 0 |
|
|
| command_excerpt | 0 |
|
|
| command_update | 0 |
|
|
| command_keywords | 0 |
|
|
| command_persist | 0 |
|
|
| command_status | 0 |
|
|
| agent_connect | 0 |
|
|
| agent_retry | 0 |
|
|
| queries | 10 |
|
|
| dist_queries | 0 |
|
|
| query_wall | 0.075 |
|
|
| query_cpu | OFF |
|
|
| dist_wall | 0.000 |
|
|
| dist_local | 0.000 |
|
|
| dist_wait | 0.000 |
|
|
| query_reads | OFF |
|
|
| query_readkb | OFF |
|
|
| query_readtime | OFF |
|
|
| avg_query_wall | 0.007 |
|
|
| avg_query_cpu | OFF |
|
|
| avg_dist_wall | 0.000 |
|
|
| avg_dist_local | 0.000 |
|
|
| avg_dist_wait | 0.000 |
|
|
| avg_query_reads | OFF |
|
|
| avg_query_readkb | OFF |
|
|
| avg_query_readtime | OFF |
|
|
+--------------------+-------+
|
|
29 rows in set (0.00 sec)
|
|
</programlisting>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-insert"><title>INSERT and REPLACE syntax</title>
|
|
<programlisting>
|
|
{INSERT | REPLACE} INTO index [(column, ...)]
|
|
VALUES (value, ...)
|
|
[, (...)]
|
|
</programlisting>
|
|
<para>
|
|
INSERT statement, introduced in version 1.10-beta, is only supported for RT indexes.
|
|
It inserts new rows (documents) into an existing index, with the provided column values.
|
|
</para>
|
|
<para>
|
|
ID column must be present in all cases. Rows with duplicate IDs will <b>not</b>
|
|
be overwritten by INSERT; use REPLACE to do that.
|
|
</para>
|
|
<para>
|
|
<option>index</option> is the name of RT index into which the new row(s)
|
|
should be inserted. The optional column names list lets you only explicitly specify
|
|
values for some of the columns present in the index. All the other columns will be
|
|
filled with their default values (0 for scalar types, empty string for text types).
|
|
</para>
|
|
<para>
|
|
Expressions are not currently supported in INSERT and values should be explicitly
|
|
specified.
|
|
</para>
|
|
<para>
|
|
Multiple rows can be inserted using a single INSERT statement by providing
|
|
several comma-separated, parens-enclosed lists of rows values.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-delete"><title>DELETE syntax</title>
|
|
<programlisting>
|
|
DELETE FROM index WHERE {id = value | id IN (val1 [, val2 [, ...]])}
|
|
</programlisting>
|
|
<para>
|
|
DELETE statement, introduced in version 1.10-beta, is only supported for RT indexes.
|
|
It deletes existing rows (documents) from an existing index based on ID.
|
|
</para>
|
|
<para>
|
|
<option>index</option> is the name of RT index from which the row should be deleted.
|
|
<option>value</option> is the row ID to be deleted. Support for batch
|
|
<code>id IN (2,3,5)</code> syntax was added in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
Additional types of WHERE conditions (such as conditions on attributes, etc)
|
|
are planned, but not supported yet as of 1.10-beta.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-set"><title>SET syntax</title>
|
|
<programlisting>
|
|
SET [GLOBAL] server_variable_name = value
|
|
SET GLOBAL @user_variable_name = (int_val1 [, int_val2, ...])
|
|
</programlisting>
|
|
<para>
|
|
SET statement, introduced in version 1.10-beta, modifies a server variable value.
|
|
The variable names are case-insensitive. No variable value changes survive
|
|
server restart. There are the following classes of the variables:
|
|
<orderedlist>
|
|
<listitem><para>per-session server variable (1.10-beta and above)</para></listitem>
|
|
<listitem><para>global server variable (2.0.1-beta and above)</para></listitem>
|
|
<listitem><para>global user variable (2.0.1-beta and above)</para></listitem>
|
|
</orderedlist>
|
|
</para>
|
|
<para>
|
|
Global user variables are shared between concurrent sessions. Currently,
|
|
the only supported value type is the list of BIGINTs, and these variables
|
|
can only be used along with IN() for filtering purpose. The intended usage
|
|
scenario is uploading huge lists of values to <filename>searchd</filename>
|
|
(once) and reusing them (many times) later, saving on network overheads.
|
|
Example:
|
|
<programlisting>
|
|
// in session 1
|
|
mysql> SET GLOBAL @myfilter=(2,3,5,7,11,13);
|
|
Query OK, 0 rows affected (0.00 sec)
|
|
|
|
// later in session 2
|
|
mysql> SELECT * FROM test1 WHERE group_id IN @myfilter;
|
|
+------+--------+----------+------------+-----------------+------+
|
|
| id | weight | group_id | date_added | title | tag |
|
|
+------+--------+----------+------------+-----------------+------+
|
|
| 3 | 1 | 2 | 1299338153 | another doc | 15 |
|
|
| 4 | 1 | 2 | 1299338153 | doc number four | 7,40 |
|
|
+------+--------+----------+------------+-----------------+------+
|
|
2 rows in set (0.02 sec)
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
Per-session and global server variables affect certain server settings in the respective scope.
|
|
Known per-session server variables are:
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><code>AUTOCOMMIT = {0 | 1}</code></term>
|
|
<listitem><para>
|
|
Whether any data modification statement should be implicitly
|
|
wrapped by BEGIN and COMMIT.
|
|
Introduced in version 1.10-beta.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><code>COLLATION_CONNECTION = collation_name</code></term>
|
|
<listitem><para>
|
|
Selects the collation to be used for ORDER BY or GROUP BY on string
|
|
values in the subsequent queries. Refer to <xref linkend="collations"/>
|
|
for a list of known collation names.
|
|
Introduced in version 2.0.1-beta.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><code>CHARACTER_SET_RESULTS = charset_name</code></term>
|
|
<listitem><para>
|
|
Does nothing; a placeholder to support frameworks, clients, and
|
|
connectors that attempt to automatically enforce a charset when
|
|
connecting to a Sphinx server.
|
|
Introduced in version 2.0.1-beta.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
<para>
|
|
Known global server variables are:
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><code>QUERY_LOG_FORMAT = {plain | sphinxql}</code></term>
|
|
<listitem><para>
|
|
Changes the current log format.
|
|
Introduced in version 2.0.1-beta.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><code>LOG_LEVEL = {info | debug | debugv | debugvv}</code></term>
|
|
<listitem><para>
|
|
Changes the current log verboseness level.
|
|
Introduced in version 2.0.1-beta.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
<para>
|
|
Examples:
|
|
<programlisting>
|
|
mysql> SET autocommit=0;
|
|
Query OK, 0 rows affected (0.00 sec)
|
|
|
|
mysql> SET GLOBAL query_log_format=sphinxql;
|
|
Query OK, 0 rows affected (0.00 sec)
|
|
</programlisting>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-commit"><title>BEGIN, COMMIT, and ROLLBACK syntax</title>
|
|
<programlisting>
|
|
START TRANSACTION | BEGIN
|
|
COMMIT
|
|
ROLLBACK
|
|
SET AUTOCOMMIT = {0 | 1}
|
|
</programlisting>
|
|
<para>
|
|
BEGIN, COMMIT, and ROLLBACK statements were introduced in version 1.10-beta.
|
|
BEGIN statement (or its START TRANSACTION alias) forcibly commits pending
|
|
transaction, if any, and begins a new one. COMMIT statement commits the current
|
|
transaction, making all its changes permanent. ROLLBACK statement rolls back the
|
|
current transaction, canceling all its changes. SET AUTOCOMMIT controls the
|
|
autocommit mode in the active session.
|
|
</para>
|
|
<para>
|
|
AUTOCOMMIT is set to 1 by default, meaning that every statement that perfoms
|
|
any changes on any index is implicitly wrapped in BEGIN and COMMIT.
|
|
</para>
|
|
<para>
|
|
Transactions are limited to a single RT index, and also limited in size.
|
|
They are atomic, consistent, overly isolated, and durable. Overly isolated
|
|
means that the changes are not only invisible to the concurrent transactions
|
|
but even to the current session itself.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-call-snippets"><title>CALL SNIPPETS syntax</title>
|
|
<programlisting>
|
|
CALL SNIPPETS(data, index, query[, opt_value AS opt_name[, ...]])
|
|
</programlisting>
|
|
<para>
|
|
CALL SNIPPETS statement, introduced in version 1.10-beta, builds a snippet
|
|
from provided data and query, using specified index settings.
|
|
</para>
|
|
<para>
|
|
<option>data</option> is the source data string to extract a snippet from.
|
|
<option>index</option> is the name of the index from which to take the text
|
|
processing settings. <option>query</option> is the full-text query to build
|
|
snippets for. Additional options are documented in
|
|
<xref linkend="api-func-buildexcerpts"/>. Usage example:
|
|
</para>
|
|
<programlisting>
|
|
CALL SNIPPETS('this is my document text', 'test1', 'hello world',
|
|
5 AS around, 200 AS limit)
|
|
</programlisting>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-call-keywords"><title>CALL KEYWORDS syntax</title>
|
|
<programlisting>
|
|
CALL KEYWORDS(text, index, [hits])
|
|
</programlisting>
|
|
<para>
|
|
CALL KEYWORDS statement, introduced in version 1.10-beta, splits text
|
|
into particular keywords. It returns tokenized and normalized forms
|
|
of the keywords, and, optionally, keyword statistics.
|
|
</para>
|
|
<para>
|
|
<option>text</option> is the text to break down to keywords.
|
|
<option>index</option> is the name of the index from which to take the text
|
|
processing settings. <option>hits</option> is an optional boolean parameter
|
|
that specifies whether to return document and hit occurrence statistics.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-show-tables"><title>SHOW TABLES syntax</title>
|
|
<programlisting>
|
|
SHOW TABLES
|
|
</programlisting>
|
|
<para>
|
|
SHOW TABLES statement, introduced in version 2.0.1-beta, enumerates
|
|
all currently active indexes along with their types. As of 2.0.1-beta,
|
|
existing index types are <option>local</option>, <option>distributed</option>,
|
|
and <option>rt</option> respectively.
|
|
Example:
|
|
<programlisting>
|
|
mysql> SHOW TABLES;
|
|
+-------+-------------+
|
|
| Index | Type |
|
|
+-------+-------------+
|
|
| dist1 | distributed |
|
|
| rt | rt |
|
|
| test1 | local |
|
|
| test2 | local |
|
|
+-------+-------------+
|
|
4 rows in set (0.00 sec)
|
|
</programlisting>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-describe"><title>DESCRIBE syntax</title>
|
|
<programlisting>
|
|
{DESC | DESCRIBE} index
|
|
</programlisting>
|
|
<para>
|
|
DESCRIBE statement, introduced in version 2.0.1-beta, lists
|
|
index columns and their associated types. Columns are document ID,
|
|
full-text fields, and attributes. The order matches that in which
|
|
fields and attributes are expected by INSERT and REPLACE statements.
|
|
As of 2.0.1-beta, column types are <option>field</option>,
|
|
<option>integer</option>, <option>timestamp</option>,
|
|
<option>ordinal</option>, <option>bool</option>,
|
|
<option>float</option>, <option>bigint</option>,
|
|
<option>string</option>, and <option>mva</option>.
|
|
ID column will be typed either <option>integer</option>
|
|
or <option>bigint</option> based on whether the binaries
|
|
were built with 32-bit or 64-bit document ID support.
|
|
Example:
|
|
</para>
|
|
<programlisting>
|
|
mysql> DESC rt;
|
|
+---------+---------+
|
|
| Field | Type |
|
|
+---------+---------+
|
|
| id | integer |
|
|
| title | field |
|
|
| content | field |
|
|
| gid | integer |
|
|
+---------+---------+
|
|
4 rows in set (0.00 sec)
|
|
</programlisting>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-create-function"><title>CREATE FUNCTION syntax</title>
|
|
<programlisting>
|
|
CREATE FUNCTION udf_name
|
|
RETURNS {INT | BIGINT | FLOAT}
|
|
SONAME 'udf_lib_file'
|
|
</programlisting>
|
|
<para>
|
|
CREATE FUNCTION statement, introduced in version 2.0.1-beta,
|
|
installs a <link linkend="udf">user-defined function (UDF)</link>
|
|
with the given name and type from the given library file.
|
|
The library file must reside in a trusted
|
|
<link linkend="conf-plugin-dir">plugin_dir</link> directory.
|
|
On success, the function is available for use in all subsequent
|
|
queries that the server receives. Example:
|
|
</para>
|
|
<programlisting>
|
|
mysql> CREATE FUNCTION avgmva RETURNS INT SONAME 'udfexample.dll';
|
|
Query OK, 0 rows affected (0.03 sec)
|
|
|
|
mysql> SELECT *, AVGMVA(tag) AS q from test1;
|
|
+------+--------+---------+-----------+
|
|
| id | weight | tag | q |
|
|
+------+--------+---------+-----------+
|
|
| 1 | 1 | 1,3,5,7 | 4.000000 |
|
|
| 2 | 1 | 2,4,6 | 4.000000 |
|
|
| 3 | 1 | 15 | 15.000000 |
|
|
| 4 | 1 | 7,40 | 23.500000 |
|
|
+------+--------+---------+-----------+
|
|
</programlisting>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-drop-function"><title>DROP FUNCTION syntax</title>
|
|
<programlisting>
|
|
DROP FUNCTION udf_name
|
|
</programlisting>
|
|
<para>
|
|
DROP FUNCTION statement, introduced in version 2.0.1-beta,
|
|
deinstalls a <link linkend="udf">user-defined function (UDF)</link>
|
|
with the given name. On success, the function is no longer available
|
|
for use in subsequent queries. Pending concurrent queries will not be
|
|
affected and the library unload, if necessary, will be postponed
|
|
until those queries complete. Example:
|
|
</para>
|
|
<programlisting>
|
|
mysql> DROP FUNCTION avgmva;
|
|
Query OK, 0 rows affected (0.00 sec)
|
|
</programlisting>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-show-variables"><title>SHOW VARIABLES syntax</title>
|
|
<programlisting>
|
|
SHOW VARIABLES
|
|
</programlisting>
|
|
<para>
|
|
Added in version 2.0.1-beta, this is currently a placeholder
|
|
query that does nothing and reports success. That is in order
|
|
to keep compatibility with frameworks and connectors that
|
|
automatically execute this statement.
|
|
</para>
|
|
<programlisting>
|
|
mysql> SHOW VARIABLES;
|
|
Query OK, 0 rows affected (0.00 sec)
|
|
</programlisting>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-show-collation"><title>SHOW COLLATION syntax</title>
|
|
<programlisting>
|
|
SHOW COLLATION
|
|
</programlisting>
|
|
<para>
|
|
Added in version 2.0.1-beta, this is currently a placeholder
|
|
query that does nothing and reports success. That is in order
|
|
to keep compatibility with frameworks and connectors that
|
|
automatically execute this statement.
|
|
</para>
|
|
<programlisting>
|
|
mysql> SHOW COLLATION;
|
|
Query OK, 0 rows affected (0.00 sec)
|
|
</programlisting>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-update"><title>UPDATE syntax</title>
|
|
<programlisting>
|
|
UPDATE index SET col1 = newval1 [, ...] WHERE ID = docid
|
|
</programlisting>
|
|
<para>
|
|
UPDATE statement was added in version 2.0.1-beta. It can currently
|
|
update 32-bit integer attributes only. Multiple attributes and values
|
|
can be specified. Both RT and disk indexes are supported.
|
|
Updates on other attribute types are also planned.
|
|
</para>
|
|
<programlisting>
|
|
mysql> UPDATE myindex SET enabled=0 WHERE id=123;
|
|
Query OK, 1 rows affected (0.00 sec)
|
|
</programlisting>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-multi-queries">
|
|
<title>Multi-statement queries</title>
|
|
<para>
|
|
Starting version 2.0.1-beta, SphinxQL supports multi-statement
|
|
queries, or batches. Possible inter-statement optimizations described
|
|
in <xref linkend="multi-queries"/> do apply to SphinxQL just as well.
|
|
The batched queries should be separated by a semicolon. Your MySQL
|
|
client library needs to support MySQL multi-query mechanism and
|
|
multiple result set. For instance, mysqli interface in PHP
|
|
and DBI/DBD libraries in Perl are known to work.
|
|
</para>
|
|
<para>
|
|
Here's a PHP sample showing how to utilize mysqli interface
|
|
with Sphinx.
|
|
<programlisting><![CDATA[
|
|
<?php
|
|
|
|
$link = mysqli_connect ( "127.0.0.1", "root", "", "", 9306 );
|
|
if ( mysqli_connect_errno() )
|
|
die ( "connect failed: " . mysqli_connect_error() );
|
|
|
|
$batch = "SELECT * FROM test1 ORDER BY group_id ASC;";
|
|
$batch .= "SELECT * FROM test1 ORDER BY group_id DESC";
|
|
|
|
if ( !mysqli_multi_query ( $link, $batch ) )
|
|
die ( "query failed" );
|
|
|
|
do
|
|
{
|
|
// fetch and print result set
|
|
if ( $result = mysqli_store_result($link) )
|
|
{
|
|
while ( $row = mysqli_fetch_row($result) )
|
|
printf ( "id=%s\n", $row[0] );
|
|
mysqli_free_result($result);
|
|
}
|
|
|
|
// print divider
|
|
if ( mysqli_more_results($link) )
|
|
printf ( "------\n" );
|
|
|
|
} while ( mysqli_next_result($link) );
|
|
]]></programlisting>
|
|
Its output with the sample <code>test1</code> index included
|
|
with Sphinx is as follows.
|
|
<programlisting>
|
|
$ php test_multi.php
|
|
id=1
|
|
id=2
|
|
id=3
|
|
id=4
|
|
------
|
|
id=3
|
|
id=4
|
|
id=1
|
|
id=2
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
The following statements can currently be used in a batch:
|
|
SELECT, SHOW WARNINGS, SHOW STATUS, and SHOW META. Arbitrary
|
|
sequence of these statements are allowed. The results sets
|
|
returned should match those that would be returned if the
|
|
batched queries were sent one by one.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-comment-syntax">
|
|
<title>Comment syntax</title>
|
|
<para>
|
|
Since version 2.0.1-beta, SphinxQL supports C-style comment syntax.
|
|
Everything from an opening <code>/*</code> sequence to a closing
|
|
<code>*/</code> sequence is ignored. Comments can span multiple lines,
|
|
can not nest, and should not get logged. MySQL specific
|
|
<code>/*! ... */</code> comments are also currently ignored.
|
|
(As the comments support was rather added for better compatibility
|
|
with <filename>mysqldump</filename> produced dumps, rather than
|
|
improving generaly query interoperability between Sphinx and MySQL.)
|
|
<programlisting>
|
|
SELECT /*! SQL_CALC_FOUND_ROWS */ col1 FROM table1 WHERE ...
|
|
</programlisting>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-reserved-keywords">
|
|
<title>List of SphinxQL reserved keywords</title>
|
|
<para>A complete alphabetical list of keywords that are currently reserved
|
|
in SphinxQL syntax (and therefore can not be used as identifiers).
|
|
<programlisting>
|
|
AND
|
|
AS
|
|
ASC
|
|
AVG
|
|
BEGIN
|
|
BETWEEN
|
|
BY
|
|
CALL
|
|
COLLATION
|
|
COMMIT
|
|
COUNT
|
|
DELETE
|
|
DESC
|
|
DESCRIBE
|
|
DISTINCT
|
|
FALSE
|
|
FROM
|
|
GLOBAL
|
|
GROUP
|
|
ID
|
|
IN
|
|
INSERT
|
|
INTO
|
|
LIMIT
|
|
MATCH
|
|
MAX
|
|
META
|
|
MIN
|
|
NOT
|
|
NULL
|
|
OPTION
|
|
OR
|
|
ORDER
|
|
REPLACE
|
|
ROLLBACK
|
|
SELECT
|
|
SET
|
|
SHOW
|
|
START
|
|
STATUS
|
|
SUM
|
|
TABLES
|
|
TRANSACTION
|
|
TRUE
|
|
UPDATE
|
|
VALUES
|
|
VARIABLES
|
|
WARNINGS
|
|
WEIGHT
|
|
WHERE
|
|
WITHIN
|
|
</programlisting></para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxql-upgrading-magics">
|
|
<title>SphinxQL upgrade notes, version 2.0.1-beta</title>
|
|
<para>
|
|
This section only applies to existing applications that
|
|
use SphinxQL versions prior to 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
In previous versions, SphinxQL just wrapped around SphinxAPI
|
|
and inherited its magic columns and column set quirks. Essentially,
|
|
SphinxQL queries could return (slightly) different columns and
|
|
in a (slightly) different order than it was explicitly requested
|
|
in the query. Namely, <code>weight</code> magic column (which is not
|
|
a real column in any index) was added at all times, and GROUP BY
|
|
related <code>@count</code>, <code>@group</code>, and <code>@distinct</code>
|
|
magic columns were conditionally added when grouping. Also, the order
|
|
of columns (attributes) in the result set was actually taken from the
|
|
index rather than the query. (So if you asked for columns C, B, A
|
|
in your query but they were in the A, B, C order in the index,
|
|
they would have been returned in the A, B, C order.)
|
|
</para>
|
|
<para>
|
|
In version 2.0.1-beta, we fixed that. SphinxQL is now more
|
|
SQL compliant (and will be further brought in as much compliance
|
|
with standard SQL syntax as possible). That is not yet a breaking
|
|
change, because <filename>searchd</filename> now supports
|
|
<link linkend="conf-compat-sphinxql-magics"><code>compat_sphinxql_magics</code></link>
|
|
directive that flips between the old "compatibility" mode and the new
|
|
"compliance" mode. However, the compatibility mode support is going
|
|
to be removed in future, so it's strongly advised to update SphinxQL
|
|
applications and switch to the compliance mode.
|
|
</para>
|
|
<para>
|
|
The important changes are as follows:
|
|
<itemizedlist>
|
|
<listitem><para>
|
|
<b><code>@ID</code> magic name is deprecated in favor of
|
|
<code>ID</code>.</b> Document ID is considered an attribute.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<b><code>WEIGHT</code> is no longer implicitly returned</b>,
|
|
because it is not actually a column (an index attribute),
|
|
but rather an internal function computed per each row (a match).
|
|
You have to explicitly ask for it, using the <code>WEIGHT()</code>
|
|
function. (The requirement to alias the result will be lifted
|
|
in the next release.)
|
|
<programlisting>
|
|
SELECT id, WEIGHT() w FROM myindex WHERE MATCH('test')
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<b>You can now use quoted reserved keywords as aliases.</b>
|
|
The quote character is backtick ("`", ASCII code 96 decimal,
|
|
60 hex). One particularly useful example would be returning
|
|
<code>weight</code> column like the old mode:
|
|
<programlisting>
|
|
SELECT id, WEIGHT() `weight` FROM myindex WHERE MATCH('test')
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>
|
|
The column order is now different and should now match the
|
|
one expliclitly defined in the query. So if you are accessing
|
|
columns based on their position in the result set rather than
|
|
the name (for instance, by using <code>mysql_fetch_row()</code>
|
|
rather than <code>mysql_fetch_assoc()</code> in PHP),
|
|
<b>check and fix the order of columns in your queries.</b>
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<code>SELECT *</code> return the columns in index order,
|
|
as it used to, including the ID column. However,
|
|
<b><code>SELECT *</code> does not automatically return WEIGHT().</b>
|
|
To update such queries in case you access columns by names,
|
|
simply add it to the query:
|
|
<programlisting>
|
|
SELECT *, WEIGHT() `weight` FROM myindex WHERE MATCH('test')
|
|
</programlisting>
|
|
Otherwise, i.e., in case you rely on column order, select
|
|
ID, weight, and then other columns:
|
|
<programlisting>
|
|
SELECT id, *, WEIGHT() `weight` FROM myindex WHERE MATCH('test')
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>
|
|
<b>Magic <code>@count</code> and <code>@distinct</code>
|
|
attributes are no longer implicitly returned</b>. You now
|
|
have to explicitly ask for them when using GROUP BY.
|
|
(Also note that you currently have to alias them;
|
|
that requirement will be lifted in the future.)
|
|
<programlisting>
|
|
SELECT gid, COUNT(*) q FROM myindex WHERE MATCH('test')
|
|
GROUP BY gid ORDER BY q DESC
|
|
</programlisting>
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
</chapter>
|
|
<chapter id="api-reference"><title>API reference</title>
|
|
|
|
|
|
<para>
|
|
There is a number of native searchd client API implementations
|
|
for Sphinx. As of time of this writing, we officially support our own
|
|
PHP, Python, and Java implementations. There also are third party
|
|
free, open-source API implementations for Perl, Ruby, and C++.
|
|
</para>
|
|
<para>
|
|
The reference API implementation is in PHP, because (we believe)
|
|
Sphinx is most widely used with PHP than any other language.
|
|
This reference documentation is in turn based on reference PHP API,
|
|
and all code samples in this section will be given in PHP.
|
|
</para>
|
|
<para>
|
|
However, all other APIs provide the same methods and implement
|
|
the very same network protocol. Therefore the documentation does
|
|
apply to them as well. There might be minor differences as to the
|
|
method naming conventions or specific data structures used.
|
|
But the provided functionality must not differ across languages.
|
|
</para>
|
|
|
|
|
|
<sect1 id="api-funcgroup-general"><title>General API functions</title>
|
|
|
|
|
|
<sect2 id="api-func-getlasterror"><title>GetLastError</title>
|
|
<para><b>Prototype:</b> function GetLastError()</para>
|
|
<para>
|
|
Returns last error message, as a string, in human readable format.
|
|
If there were no errors during the previous API call, empty string is returned.
|
|
</para>
|
|
<para>
|
|
You should call it when any other function (such as <link linkend="api-func-query">Query()</link>)
|
|
fails (typically, the failing function returns false). The returned string will
|
|
contain the error description.
|
|
</para>
|
|
<para>
|
|
The error message is <emphasis>not</emphasis> reset by this call; so you can safely
|
|
call it several times if needed.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-getlastwarning"><title>GetLastWarning</title>
|
|
<para><b>Prototype:</b> function GetLastWarning ()</para>
|
|
<para>
|
|
Returns last warning message, as a string, in human readable format.
|
|
If there were no warnings during the previous API call, empty string is returned.
|
|
</para>
|
|
<para>
|
|
You should call it to verify whether your request
|
|
(such as <link linkend="api-func-query">Query()</link>) was completed but with warnings.
|
|
For instance, search query against a distributed index might complete
|
|
succesfully even if several remote agents timed out. In that case,
|
|
a warning message would be produced.
|
|
</para>
|
|
<para>
|
|
The warning message is <emphasis>not</emphasis> reset by this call; so you can safely
|
|
call it several times if needed.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setserver"><title>SetServer</title>
|
|
<para><b>Prototype:</b> function SetServer ( $host, $port )</para>
|
|
<para>
|
|
Sets <filename>searchd</filename> host name and TCP port.
|
|
All subsequent requests will use the new host and port settings.
|
|
Default host and port are 'localhost' and 9312, respectively.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setretries"><title>SetRetries</title>
|
|
<para><b>Prototype:</b> function SetRetries ( $count, $delay=0 )</para>
|
|
<para>
|
|
Sets distributed retry count and delay.
|
|
</para>
|
|
<para>
|
|
On temporary failures <filename>searchd</filename> will attempt up to
|
|
<code>$count</code> retries per agent. <code>$delay</code> is the delay
|
|
between the retries, in milliseconds. Retries are disabled by default.
|
|
Note that this call will <b>not</b> make the API itself retry on
|
|
temporary failure; it only tells <filename>searchd</filename> to do so.
|
|
Currently, the list of temporary failures includes all kinds of connect()
|
|
failures and maxed out (too busy) remote agents.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setconnecttimeout"><title>SetConnectTimeout</title>
|
|
<para><b>Prototype:</b> function SetConnectTimeout ( $timeout )</para>
|
|
<para>
|
|
Sets the time allowed to spend connecting to the server before giving up.
|
|
</para>
|
|
<para>Under some circumstances, the server can be delayed in responding, either
|
|
due to network delays, or a query backlog. In either instance, this allows
|
|
the client application programmer some degree of control over how their
|
|
program interacts with <filename>searchd</filename> when not available,
|
|
and can ensure that the client application does not fail due to exceeding
|
|
the script execution limits (especially in PHP).
|
|
</para>
|
|
<para>In the event of a failure to connect, an appropriate error code should
|
|
be returned back to the application in order for application-level error handling
|
|
to advise the user.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setarrayresult"><title>SetArrayResult</title>
|
|
<para><b>Prototype:</b> function SetArrayResult ( $arrayresult )</para>
|
|
<para>
|
|
PHP specific. Controls matches format in the search results set
|
|
(whether matches should be returned as an array or a hash).
|
|
</para>
|
|
<para>
|
|
<code>$arrayresult</code> argument must be boolean. If <code>$arrayresult</code> is <code>false</code>
|
|
(the default mode), matches will returned in PHP hash format with
|
|
document IDs as keys, and other information (weight, attributes)
|
|
as values. If <code>$arrayresult</code> is true, matches will be returned
|
|
as a plain array with complete per-match information including
|
|
document ID.
|
|
</para>
|
|
<para>
|
|
Introduced along with GROUP BY support on MVA attributes.
|
|
Group-by-MVA result sets may contain duplicate document IDs.
|
|
Thus they need to be returned as plain arrays, because hashes
|
|
will only keep one entry per document ID.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="api-func-isconnecterror"><title>IsConnectError</title>
|
|
<para><b>Prototype:</b> function IsConnectError ()</para>
|
|
<para>
|
|
Checks whether the last error was a network error on API side, or a remote error
|
|
reported by searchd. Returns true if the last connection attempt to searchd failed on API side,
|
|
false otherwise (if the error was remote, or there were no connection attempts at all).
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
</sect1>
|
|
<sect1 id="api-funcgroup-general-query-settings"><title>General query settings</title>
|
|
|
|
|
|
<sect2 id="api-func-setlimits"><title>SetLimits</title>
|
|
<para><b>Prototype:</b> function SetLimits ( $offset, $limit, $max_matches=0, $cutoff=0 )</para>
|
|
<para>
|
|
Sets offset into server-side result set (<code>$offset</code>) and amount of matches
|
|
to return to client starting from that offset (<code>$limit</code>). Can additionally
|
|
control maximum server-side result set size for current query (<code>$max_matches</code>)
|
|
and the threshold amount of matches to stop searching at (<code>$cutoff</code>).
|
|
All parameters must be non-negative integers.
|
|
</para>
|
|
<para>
|
|
First two parameters to SetLimits() are identical in behavior to MySQL
|
|
LIMIT clause. They instruct <filename>searchd</filename> to return at
|
|
most <code>$limit</code> matches starting from match number <code>$offset</code>.
|
|
The default offset and limit settings are 0 and 20, that is, to return
|
|
first 20 matches.
|
|
</para>
|
|
<para>
|
|
<code>max_matches</code> setting controls how much matches <filename>searchd</filename>
|
|
will keep in RAM while searching. <b>All</b> matching documents will be normally
|
|
processed, ranked, filtered, and sorted even if <code>max_matches</code> is set to 1.
|
|
But only best N documents are stored in memory at any given moment for performance
|
|
and RAM usage reasons, and this setting controls that N. Note that there are
|
|
<b>two</b> places where <code>max_matches</code> limit is enforced. Per-query
|
|
limit is controlled by this API call, but there also is per-server limit
|
|
controlled by <code>max_matches</code> setting in the config file. To prevent
|
|
RAM usage abuse, server will not allow to set per-query limit
|
|
higher than the per-server limit.
|
|
</para>
|
|
<para>
|
|
You can't retrieve more than <code>max_matches</code> matches to the client application.
|
|
The default limit is set to 1000. Normally, you must not have to go over
|
|
this limit. One thousand records is enough to present to the end user.
|
|
And if you're thinking about pulling the results to application
|
|
for further sorting or filtering, that would be <b>much</b> more efficient
|
|
if performed on Sphinx side.
|
|
</para>
|
|
<para>
|
|
<code>$cutoff</code> setting is intended for advanced performance control.
|
|
It tells <filename>searchd</filename> to forcibly stop search query
|
|
once <code>$cutoff</code> matches had been found and processed.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setmaxquerytime"><title>SetMaxQueryTime</title>
|
|
<para><b>Prototype:</b> function SetMaxQueryTime ( $max_query_time )</para>
|
|
<para>
|
|
Sets maximum search query time, in milliseconds. Parameter must be
|
|
a non-negative integer. Default valus is 0 which means "do not limit".
|
|
</para>
|
|
<para>Similar to <code>$cutoff</code> setting from <link linkend="api-func-setlimits">SetLimits()</link>,
|
|
but limits elapsed query time instead of processed matches count. Local search queries
|
|
will be stopped once that much time has elapsed. Note that if you're performing
|
|
a search which queries several local indexes, this limit applies to each index
|
|
separately.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setoverride"><title>SetOverride</title>
|
|
<para><b>Prototype:</b> function SetOverride ( $attrname, $attrtype, $values )</para>
|
|
<para>
|
|
Sets temporary (per-query) per-document attribute value overrides.
|
|
Only supports scalar attributes. $values must be a hash that maps document
|
|
IDs to overridden attribute values. Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
Override feature lets you "temporary" update attribute values for some documents
|
|
within a single query, leaving all other queries unaffected. This might be useful
|
|
for personalized data. For example, assume you're implementing a personalized
|
|
search function that wants to boost the posts that the user's friends recommend.
|
|
Such data is not just dynamic, but also personal; so you can't simply put it
|
|
in the index because you don't want everyone's searches affected. Overrides,
|
|
on the other hand, are local to a single query and invisible to everyone else.
|
|
So you can, say, setup a "friends_weight" value for every document, defaulting to 0,
|
|
then temporary override it with 1 for documents 123, 456 and 789 (recommended by
|
|
exactly the friends of current user), and use that value when ranking.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="api-func-setselect"><title>SetSelect</title>
|
|
<para><b>Prototype:</b> function SetSelect ( $clause )</para>
|
|
<para>
|
|
Sets the select clause, listing specific attributes to fetch, and <link linkend="sort-expr">expressions</link>
|
|
to compute and fetch. Clause syntax mimics SQL. Introduced in version 0.9.9-rc1.</para>
|
|
<para>
|
|
SetSelect() is very similar to the part of a typical SQL query between SELECT and FROM.
|
|
It lets you choose what attributes (columns) to fetch, and also what expressions
|
|
over the columns to compute and fetch. A certain difference from SQL is that expressions
|
|
<b>must</b> always be aliased to a correct identifier (consisting of letters and digits)
|
|
using 'AS' keyword. SQL also lets you do that but does not require to. Sphinx enforces
|
|
aliases so that the computation results can always be returned under a "normal" name
|
|
in the result set, used in other clauses, etc.
|
|
</para>
|
|
<para>
|
|
Everything else is basically identical to SQL. Star ('*') is supported.
|
|
Functions are supported. Arbitrary amount of expressions is supported.
|
|
Computed expressions can be used for sorting, filtering, and grouping,
|
|
just as the regular attributes.
|
|
</para>
|
|
<para>
|
|
Starting with version 0.9.9-rc2, aggregate functions (AVG(), MIN(),
|
|
MAX(), SUM()) are supported when using GROUP BY.
|
|
</para>
|
|
<para>
|
|
Expression sorting (<xref linkend="sort-expr"/>) and geodistance functions
|
|
(<xref linkend="api-func-setgeoanchor"/>) are now internally implemented using
|
|
this computed expressions mechanism, using magic names '@expr' and '@geodist'
|
|
respectively.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
$cl->SetSelect ( "*, @weight+(user_karma+ln(pageviews))*0.1 AS myweight" );
|
|
$cl->SetSelect ( "exp_years, salary_gbp*{$gbp_usd_rate} AS salary_usd,
|
|
IF(age>40,1,0) AS over40" );
|
|
$cl->SetSelect ( "*, AVG(price) AS avgprice" );
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
</sect1>
|
|
<sect1 id="api-funcgroup-fulltext-query-settings"><title>Full-text search query settings</title>
|
|
|
|
|
|
<sect2 id="api-func-setmatchmode"><title>SetMatchMode</title>
|
|
<para><b>Prototype:</b> function SetMatchMode ( $mode )</para>
|
|
<para>
|
|
Sets full-text query matching mode, as described in <xref linkend="matching-modes"/>.
|
|
Parameter must be a constant specifying one of the known modes.
|
|
</para>
|
|
<para>
|
|
<b>WARNING:</b> (PHP specific) you <b>must not</b> take the matching mode
|
|
constant name in quotes, that syntax specifies a string and is incorrect:
|
|
<programlisting>
|
|
$cl->SetMatchMode ( "SPH_MATCH_ANY" ); // INCORRECT! will not work as expected
|
|
$cl->SetMatchMode ( SPH_MATCH_ANY ); // correct, works OK
|
|
</programlisting>
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setrankingmode"><title>SetRankingMode</title>
|
|
<para><b>Prototype:</b> function SetRankingMode ( $ranker )</para>
|
|
<para>
|
|
Sets ranking mode (aka ranker). Only available in SPH_MATCH_EXTENDED
|
|
matching mode. Parameter must be a constant specifying one of the known
|
|
rankers.
|
|
</para>
|
|
<para>
|
|
By default, in the EXTENDED matching mode Sphinx computes two factors
|
|
which contribute to the final match weight. The major part is a phrase
|
|
proximity value between the document text and the query.
|
|
The minor part is so-called BM25 statistical function, which varies
|
|
from 0 to 1 depending on the keyword frequency within document
|
|
(more occurrences yield higher weight) and within the whole index
|
|
(more rare keywords yield higher weight).
|
|
</para>
|
|
<para>
|
|
However, in some cases you'd want to compute weight differently -
|
|
or maybe avoid computing it at all for performance reasons because
|
|
you're sorting the result set by something else anyway. This can be
|
|
accomplished by setting the appropriate ranking mode. The list of
|
|
the modes is available in <xref linkend="weighting"/>.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="api-func-setsortmode"><title>SetSortMode</title>
|
|
<para><b>Prototype:</b> function SetSortMode ( $mode, $sortby="" )</para>
|
|
<para>
|
|
Set matches sorting mode, as described in <xref linkend="sorting-modes"/>.
|
|
Parameter must be a constant specifying one of the known modes.
|
|
</para>
|
|
<para>
|
|
<b>WARNING:</b> (PHP specific) you <b>must not</b> take the matching mode
|
|
constant name in quotes, that syntax specifies a string and is incorrect:
|
|
<programlisting>
|
|
$cl->SetSortMode ( "SPH_SORT_ATTR_DESC" ); // INCORRECT! will not work as expected
|
|
$cl->SetSortMode ( SPH_SORT_ATTR_ASC ); // correct, works OK
|
|
</programlisting>
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setweights"><title>SetWeights</title>
|
|
<para><b>Prototype:</b> function SetWeights ( $weights )</para>
|
|
<para>
|
|
Binds per-field weights in the order of appearance in the index.
|
|
<b>DEPRECATED</b>, use <link linkend="api-func-setfieldweights">SetFieldWeights()</link> instead.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setfieldweights"><title>SetFieldWeights</title>
|
|
<para><b>Prototype:</b> function SetFieldWeights ( $weights )</para>
|
|
<para>
|
|
Binds per-field weights by name. Parameter must be a hash (associative array)
|
|
mapping string field names to integer weights.
|
|
</para>
|
|
<para>
|
|
Match ranking can be affected by per-field weights. For instance,
|
|
see <xref linkend="weighting"/> for an explanation how phrase proximity
|
|
ranking is affected. This call lets you specify what non-default
|
|
weights to assign to different full-text fields.
|
|
</para>
|
|
<para>
|
|
The weights must be positive 32-bit integers. The final weight
|
|
will be a 32-bit integer too. Default weight value is 1. Unknown
|
|
field names will be silently ignored.
|
|
</para>
|
|
<para>
|
|
There is no enforced limit on the maximum weight value at the
|
|
moment. However, beware that if you set it too high you can start
|
|
hitting 32-bit wraparound issues. For instance, if you set
|
|
a weight of 10,000,000 and search in extended mode, then
|
|
maximum possible weight will be equal to 10 million (your weight)
|
|
by 1 thousand (internal BM25 scaling factor, see <xref linkend="weighting"/>)
|
|
by 1 or more (phrase proximity rank). The result is at least 10 billion
|
|
that does not fit in 32 bits and will be wrapped around, producing
|
|
unexpected results.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setindexweights"><title>SetIndexWeights</title>
|
|
<para><b>Prototype:</b> function SetIndexWeights ( $weights )</para>
|
|
<para>
|
|
Sets per-index weights, and enables weighted summing of match weights
|
|
across different indexes. Parameter must be a hash (associative array)
|
|
mapping string index names to integer weights. Default is empty array
|
|
that means to disable weighting summing.
|
|
</para>
|
|
<para>
|
|
When a match with the same document ID is found in several different
|
|
local indexes, by default Sphinx simply chooses the match from the index
|
|
specified last in the query. This is to support searching through
|
|
partially overlapping index partitions.
|
|
</para>
|
|
<para>
|
|
However in some cases the indexes are not just partitions, and you
|
|
might want to sum the weights across the indexes instead of picking one.
|
|
<code>SetIndexWeights()</code> lets you do that. With summing enabled,
|
|
final match weight in result set will be computed as a sum of match
|
|
weight coming from the given index multiplied by respective per-index
|
|
weight specified in this call. Ie. if the document 123 is found in
|
|
index A with the weight of 2, and also in index B with the weight of 3,
|
|
and you called <code>SetIndexWeights ( array ( "A"=>100, "B"=>10 ) )</code>,
|
|
the final weight return to the client will be 2*100+3*10 = 230.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
</sect1>
|
|
<sect1 id="api-funcgroup-filtering"><title>Result set filtering settings</title>
|
|
|
|
|
|
<sect2 id="api-func-setidrange"><title>SetIDRange</title>
|
|
<para><b>Prototype:</b> function SetIDRange ( $min, $max )</para>
|
|
<para>
|
|
Sets an accepted range of document IDs. Parameters must be integers.
|
|
Defaults are 0 and 0; that combination means to not limit by range.
|
|
</para>
|
|
<para>
|
|
After this call, only those records that have document ID
|
|
between <code>$min</code> and <code>$max</code> (including IDs
|
|
exactly equal to <code>$min</code> or <code>$max</code>)
|
|
will be matched.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setfilter"><title>SetFilter</title>
|
|
<para><b>Prototype:</b> function SetFilter ( $attribute, $values, $exclude=false )</para>
|
|
<para>
|
|
Adds new integer values set filter.
|
|
</para>
|
|
<para>
|
|
On this call, additional new filter is added to the existing
|
|
list of filters. <code>$attribute</code> must be a string with
|
|
attribute name. <code>$values</code> must be a plain array
|
|
containing integer values. <code>$exclude</code> must be a boolean
|
|
value; it controls whether to accept the matching documents
|
|
(default mode, when <code>$exclude</code> is false) or reject them.
|
|
</para>
|
|
<para>
|
|
Only those documents where <code>$attribute</code> column value
|
|
stored in the index matches any of the values from <code>$values</code>
|
|
array will be matched (or rejected, if <code>$exclude</code> is true).
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setfilterrange"><title>SetFilterRange</title>
|
|
<para><b>Prototype:</b> function SetFilterRange ( $attribute, $min, $max, $exclude=false )</para>
|
|
<para>
|
|
Adds new integer range filter.
|
|
</para>
|
|
<para>
|
|
On this call, additional new filter is added to the existing
|
|
list of filters. <code>$attribute</code> must be a string with
|
|
attribute name. <code>$min</code> and <code>$max</code> must be
|
|
integers that define the acceptable attribute values range
|
|
(including the boundaries). <code>$exclude</code> must be a boolean
|
|
value; it controls whether to accept the matching documents
|
|
(default mode, when <code>$exclude</code> is false) or reject them.
|
|
</para>
|
|
<para>
|
|
Only those documents where <code>$attribute</code> column value
|
|
stored in the index is between <code>$min</code> and <code>$max</code>
|
|
(including values that are exactly equal to <code>$min</code> or <code>$max</code>)
|
|
will be matched (or rejected, if <code>$exclude</code> is true).
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setfilterfloatrange"><title>SetFilterFloatRange</title>
|
|
<para><b>Prototype:</b> function SetFilterFloatRange ( $attribute, $min, $max, $exclude=false )</para>
|
|
<para>
|
|
Adds new float range filter.
|
|
</para>
|
|
<para>
|
|
On this call, additional new filter is added to the existing
|
|
list of filters. <code>$attribute</code> must be a string with
|
|
attribute name. <code>$min</code> and <code>$max</code> must be
|
|
floats that define the acceptable attribute values range
|
|
(including the boundaries). <code>$exclude</code> must be a boolean
|
|
value; it controls whether to accept the matching documents
|
|
(default mode, when <code>$exclude</code> is false) or reject them.
|
|
</para>
|
|
<para>
|
|
Only those documents where <code>$attribute</code> column value
|
|
stored in the index is between <code>$min</code> and <code>$max</code>
|
|
(including values that are exactly equal to <code>$min</code> or <code>$max</code>)
|
|
will be matched (or rejected, if <code>$exclude</code> is true).
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setgeoanchor"><title>SetGeoAnchor</title>
|
|
<para><b>Prototype:</b> function SetGeoAnchor ( $attrlat, $attrlong, $lat, $long )</para>
|
|
<para>
|
|
Sets anchor point for and geosphere distance (geodistance) calculations, and enable them.
|
|
</para>
|
|
<para>
|
|
<code>$attrlat</code> and <code>$attrlong</code> must be strings that contain the names
|
|
of latitude and longitude attributes, respectively. <code>$lat</code> and <code>$long</code>
|
|
are floats that specify anchor point latitude and longitude, in radians.
|
|
</para>
|
|
<para>
|
|
Once an anchor point is set, you can use magic <code>"@geodist"</code> attribute
|
|
name in your filters and/or sorting expressions. Sphinx will compute geosphere distance
|
|
between the given anchor point and a point specified by latitude and lognitude
|
|
attributes from each full-text match, and attach this value to the resulting match.
|
|
The latitude and longitude values both in <code>SetGeoAnchor</code> and the index
|
|
attribute data are expected to be in radians. The result will be returned in meters,
|
|
so geodistance value of 1000.0 means 1 km. 1 mile is approximately 1609.344 meters.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
</sect1>
|
|
<sect1 id="api-funcgroup-groupby"><title>GROUP BY settings</title>
|
|
|
|
|
|
<sect2 id="api-func-setgroupby"><title>SetGroupBy</title>
|
|
<para><b>Prototype:</b> function SetGroupBy ( $attribute, $func, $groupsort="@group desc" )</para>
|
|
<para>
|
|
Sets grouping attribute, function, and groups sorting mode; and enables grouping
|
|
(as described in <xref linkend="clustering"/>).
|
|
</para>
|
|
<para>
|
|
<code>$attribute</code> is a string that contains group-by attribute name.
|
|
<code>$func</code> is a constant that chooses a function applied to the attribute value in order to compute group-by key.
|
|
<code>$groupsort</code> is a clause that controls how the groups will be sorted. Its syntax is similar
|
|
to that described in <xref linkend="sort-extended"/>.
|
|
</para>
|
|
<para>
|
|
Grouping feature is very similar in nature to GROUP BY clause from SQL.
|
|
Results produces by this function call are going to be the same as produced
|
|
by the following pseudo code:
|
|
<programlisting>
|
|
SELECT ... GROUP BY $func($attribute) ORDER BY $groupsort
|
|
</programlisting>
|
|
Note that it's <code>$groupsort</code> that affects the order of matches
|
|
in the final result set. Sorting mode (see <xref linkend="api-func-setsortmode"/>)
|
|
affect the ordering of matches <emphasis>within</emphasis> group, ie.
|
|
what match will be selected as the best one from the group.
|
|
So you can for instance order the groups by matches count
|
|
and select the most relevant match within each group at the same time.
|
|
</para>
|
|
<para>
|
|
Starting with version 0.9.9-rc2, aggregate functions (AVG(), MIN(),
|
|
MAX(), SUM()) are supported through <link linkend="api-func-setselect">SetSelect()</link> API call
|
|
when using GROUP BY.
|
|
</para>
|
|
<para>
|
|
Starting with version 2.0.1-beta, grouping on string attributes
|
|
is supported, with respect to current collation.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-setgroupdistinct"><title>SetGroupDistinct</title>
|
|
<para><b>Prototype:</b> function SetGroupDistinct ( $attribute )</para>
|
|
<para>
|
|
Sets attribute name for per-group distinct values count calculations.
|
|
Only available for grouping queries.
|
|
</para>
|
|
<para>
|
|
<code>$attribute</code> is a string that contains the attribute name.
|
|
For each group, all values of this attribute will be stored (as RAM limits
|
|
permit), then the amount of distinct values will be calculated and returned
|
|
to the client. This feature is similar to <code>COUNT(DISTINCT)</code>
|
|
clause in standard SQL; so these Sphinx calls:
|
|
<programlisting>
|
|
$cl->SetGroupBy ( "category", SPH_GROUPBY_ATTR, "@count desc" );
|
|
$cl->SetGroupDistinct ( "vendor" );
|
|
</programlisting>
|
|
can be expressed using the following SQL clauses:
|
|
<programlisting>
|
|
SELECT id, weight, all-attributes,
|
|
COUNT(DISTINCT vendor) AS @distinct,
|
|
COUNT(*) AS @count
|
|
FROM products
|
|
GROUP BY category
|
|
ORDER BY @count DESC
|
|
</programlisting>
|
|
In the sample pseudo code shown just above, <code>SetGroupDistinct()</code> call
|
|
corresponds to <code>COUNT(DISINCT vendor)</code> clause only.
|
|
<code>GROUP BY</code>, <code>ORDER BY</code>, and <code>COUNT(*)</code>
|
|
clauses are all an equivalent of <code>SetGroupBy()</code> settings. Both queries
|
|
will return one matching row for each category. In addition to indexed attributes,
|
|
matches will also contain total per-category matches count, and the count
|
|
of distinct vendor IDs within each category.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
</sect1>
|
|
<sect1 id="api-funcgroup-querying"><title>Querying</title>
|
|
|
|
|
|
<sect2 id="api-func-query"><title>Query</title>
|
|
<para><b>Prototype:</b> function Query ( $query, $index="*", $comment="" )</para>
|
|
<para>
|
|
Connects to <filename>searchd</filename> server, runs given search query
|
|
with current settings, obtains and returns the result set.
|
|
</para>
|
|
<para>
|
|
<code>$query</code> is a query string. <code>$index</code> is an index name (or names) string.
|
|
Returns false and sets <code>GetLastError()</code> message on general error.
|
|
Returns search result set on success.
|
|
Additionally, the contents of <code>$comment</code> are sent to the query log, marked in square brackets, just before the search terms, which can be very useful for debugging.
|
|
|
|
Currently, the comment is limited to 128 characters.
|
|
</para>
|
|
<para>
|
|
Default value for <code>$index</code> is <code>"*"</code> that means
|
|
to query all local indexes. Characters allowed in index names include
|
|
Latin letters (a-z), numbers (0-9), minus sign (-), and underscore (_);
|
|
everything else is considered a separator. Therefore, all of the
|
|
following samples calls are valid and will search the same
|
|
two indexes:
|
|
<programlisting>
|
|
$cl->Query ( "test query", "main delta" );
|
|
$cl->Query ( "test query", "main;delta" );
|
|
$cl->Query ( "test query", "main, delta" );
|
|
</programlisting>
|
|
Index specification order matters. If document with identical IDs are found
|
|
in two or more indexes, weight and attribute values from the very last matching
|
|
index will be used for sorting and returning to client (unless explicitly
|
|
overridden with <link linkend="api-func-setindexweights">SetIndexWeights()</link>). Therefore,
|
|
in the example above, matches from "delta" index will always win over
|
|
matches from "main".
|
|
</para>
|
|
<para>
|
|
On success, <code>Query()</code> returns a result set that contains
|
|
some of the found matches (as requested by <link linkend="api-func-setlimits">SetLimits()</link>)
|
|
and additional general per-query statistics. The result set is a hash
|
|
(PHP specific; other languages might utilize other structures instead
|
|
of hash) with the following keys and values:
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>"matches":</term>
|
|
<listitem><para>Hash which maps found document IDs to another small hash containing document weight and attribute values
|
|
(or an array of the similar small hashes if <link linkend="api-func-setarrayresult">SetArrayResult()</link> was enabled).
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"total":</term>
|
|
<listitem><para>Total amount of matches retrieved <emphasis>on server</emphasis> (ie. to the server side result set) by this query.
|
|
You can retrieve up to this amount of matches from server for this query text with current query settings.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"total_found":</term>
|
|
<listitem><para>Total amount of matching documents in index (that were found and procesed on server).</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"words":</term>
|
|
<listitem><para>Hash which maps query keywords (case-folded, stemmed, and otherwise processed) to a small hash with per-keyword statitics ("docs", "hits").</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"error":</term>
|
|
<listitem><para>Query error message reported by <filename>searchd</filename> (string, human readable). Empty if there were no errors.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"warning":</term>
|
|
<listitem><para>Query warning message reported by <filename>searchd</filename> (string, human readable). Empty if there were no warnings.</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
<para>
|
|
It should be noted that <code>Query()</code> carries out the same actions as
|
|
<code>AddQuery()</code> and <code>RunQueries()</code> without the intermediate steps;
|
|
it is analoguous to a single <code>AddQuery()</code> call, followed by a corresponding
|
|
<code>RunQueries()</code>, then returning the first array element of matches
|
|
(from the first, and only, query.)
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-addquery"><title>AddQuery</title>
|
|
<para><b>Prototype:</b> function AddQuery ( $query, $index="*", $comment="" )</para>
|
|
<para>
|
|
Adds additional query with current settings to multi-query batch.
|
|
<code>$query</code> is a query string. <code>$index</code> is an index name (or names) string.
|
|
Additionally if provided, the contents of <code>$comment</code> are sent to the query log,
|
|
marked in square brackets, just before the search terms, which can be very useful for debugging.
|
|
Currently, this is limited to 128 characters.
|
|
Returns index to results array returned from <link linkend="api-func-runqueries">RunQueries()</link>.
|
|
</para>
|
|
<para>
|
|
Batch queries (or multi-queries) enable <filename>searchd</filename> to perform internal
|
|
optimizations if possible. They also reduce network connection overheads and search process
|
|
creation overheads in all cases. They do not result in any additional overheads compared
|
|
to simple queries. Thus, if you run several different queries from your web page,
|
|
you should always consider using multi-queries.
|
|
</para>
|
|
<para>
|
|
For instance, running the same full-text query but with different
|
|
sorting or group-by settings will enable <filename>searchd</filename>
|
|
to perform expensive full-text search and ranking operation only once,
|
|
but compute multiple group-by results from its output.
|
|
</para>
|
|
<para>
|
|
This can be a big saver when you need to display not just plain
|
|
search results but also some per-category counts, such as the amount of
|
|
products grouped by vendor. Without multi-query, you would have to run several
|
|
queries which perform essentially the same search and retrieve the
|
|
same matches, but create result sets differently. With multi-query,
|
|
you simply pass all these querys in a single batch and Sphinx
|
|
optimizes the redundant full-text search internally.
|
|
</para>
|
|
<para>
|
|
<code>AddQuery()</code> internally saves full current settings state
|
|
along with the query, and you can safely change them afterwards for subsequent
|
|
<code>AddQuery()</code> calls. Already added queries will not be affected;
|
|
there's actually no way to change them at all. Here's an example:
|
|
<programlisting>
|
|
$cl->SetSortMode ( SPH_SORT_RELEVANCE );
|
|
$cl->AddQuery ( "hello world", "documents" );
|
|
|
|
$cl->SetSortMode ( SPH_SORT_ATTR_DESC, "price" );
|
|
$cl->AddQuery ( "ipod", "products" );
|
|
|
|
$cl->AddQuery ( "harry potter", "books" );
|
|
|
|
$results = $cl->RunQueries ();
|
|
</programlisting>
|
|
With the code above, 1st query will search for "hello world" in "documents" index
|
|
and sort results by relevance, 2nd query will search for "ipod" in "products"
|
|
index and sort results by price, and 3rd query will search for "harry potter"
|
|
in "books" index while still sorting by price. Note that 2nd <code>SetSortMode()</code> call
|
|
does not affect the first query (because it's already added) but affects both other
|
|
subsequent queries.
|
|
</para>
|
|
<para>
|
|
Additionally, any filters set up before an <code>AddQuery()</code> will fall through to subsequent
|
|
queries. So, if <code>SetFilter()</code> is called before the first query, the same filter
|
|
will be in place for the second (and subsequent) queries batched through <code>AddQuery()</code>
|
|
unless you call <code>ResetFilters()</code> first. Alternatively, you can add additional filters
|
|
as well.</para>
|
|
<para>This would also be true for grouping options and sorting options; no current sorting,
|
|
filtering, and grouping settings are affected by this call; so subsequent queries will reuse
|
|
current query settings.
|
|
</para>
|
|
<para>
|
|
<code>AddQuery()</code> returns an index into an array of results
|
|
that will be returned from <code>RunQueries()</code> call. It is simply
|
|
a sequentially increasing 0-based integer, ie. first call will return 0,
|
|
second will return 1, and so on. Just a small helper so you won't have
|
|
to track the indexes manualy if you need then.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-runqueries"><title>RunQueries</title>
|
|
<para><b>Prototype:</b> function RunQueries ()</para>
|
|
<para>
|
|
Connect to searchd, runs a batch of all queries added using <code>AddQuery()</code>,
|
|
obtains and returns the result sets. Returns false and sets <code>GetLastError()</code>
|
|
message on general error (such as network I/O failure). Returns a plain array
|
|
of result sets on success.
|
|
</para>
|
|
<para>
|
|
Each result set in the returned array is exactly the same as
|
|
the result set returned from <link linkend="api-func-query"><code>Query()</code></link>.
|
|
</para>
|
|
<para>
|
|
Note that the batch query request itself almost always succeds -
|
|
unless there's a network error, blocking index rotation in progress,
|
|
or another general failure which prevents the whole request from being
|
|
processed.
|
|
</para>
|
|
<para>
|
|
However individual queries within the batch might very well fail.
|
|
In this case their respective result sets will contain non-empty <code>"error"</code> message,
|
|
but no matches or query statistics. In the extreme case all queries within the batch
|
|
could fail. There still will be no general error reported, because API was able to
|
|
succesfully connect to <filename>searchd</filename>, submit the batch, and receive
|
|
the results - but every result set will have a specific error message.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-resetfilters"><title>ResetFilters</title>
|
|
<para><b>Prototype:</b> function ResetFilters ()</para>
|
|
<para>
|
|
Clears all currently set filters.
|
|
</para>
|
|
<para>
|
|
This call is only normally required when using multi-queries. You might want
|
|
to set different filters for different queries in the batch. To do that,
|
|
you should call <code>ResetFilters()</code> and add new filters using
|
|
the respective calls.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-resetgroupby"><title>ResetGroupBy</title>
|
|
<para><b>Prototype:</b> function ResetGroupBy ()</para>
|
|
<para>
|
|
Clears all currently group-by settings, and disables group-by.
|
|
</para>
|
|
<para>
|
|
This call is only normally required when using multi-queries.
|
|
You can change individual group-by settings using <code>SetGroupBy()</code>
|
|
and <code>SetGroupDistinct()</code> calls, but you can not disable
|
|
group-by using those calls. <code>ResetGroupBy()</code>
|
|
fully resets previous group-by settings and disables group-by mode
|
|
in the current state, so that subsequent <code>AddQuery()</code>
|
|
calls can perform non-grouping searches.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
</sect1>
|
|
<sect1 id="api-funcgroup-additional-functionality"><title>Additional functionality</title>
|
|
|
|
|
|
<sect2 id="api-func-buildexcerpts"><title>BuildExcerpts</title>
|
|
<para><b>Prototype:</b> function BuildExcerpts ( $docs, $index, $words, $opts=array() )</para>
|
|
<para>
|
|
Excerpts (snippets) builder function. Connects to <filename>searchd</filename>,
|
|
asks it to generate excerpts (snippets) from given documents, and returns the results.
|
|
</para>
|
|
<para>
|
|
<code>$docs</code> is a plain array of strings that carry the documents' contents.
|
|
<code>$index</code> is an index name string. Different settings (such as charset,
|
|
morphology, wordforms) from given index will be used.
|
|
<code>$words</code> is a string that contains the keywords to highlight. They will
|
|
be processed with respect to index settings. For instance, if English stemming
|
|
is enabled in the index, "shoes" will be highlighted even if keyword is "shoe".
|
|
Starting with version 0.9.9-rc1, keywords can contain wildcards, that work similarly to
|
|
<link linkend="conf-enable-star">star-syntax</link> available in queries.
|
|
<code>$opts</code> is a hash which contains additional optional highlighting parameters:
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>"before_match":</term>
|
|
<listitem><para>A string to insert before a keyword match. Starting with version 1.10-beta,
|
|
a %PASSAGE_ID% macro can be used in this string. The macro is replaced with an incrementing
|
|
passage number within a current snippet. Numbering starts at 1 by default but can be
|
|
overridden with "start_passage_id" option. In a multi-document call, %PASSAGE_ID% would
|
|
restart at every given document. Default is "<b>".</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"after_match":</term>
|
|
<listitem><para>A string to insert after a keyword match. Starting with version 1.10-beta,
|
|
a %PASSAGE_ID% macro can be used in this string. Default is "</b>".</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"chunk_separator":</term>
|
|
<listitem><para>A string to insert between snippet chunks (passages). Default is " ... ".</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"limit":</term>
|
|
<listitem><para>Maximum snippet size, in symbols (codepoints). Integer, default is 256.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"around":</term>
|
|
<listitem><para>How much words to pick around each matching keywords block. Integer, default is 5.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"exact_phrase":</term>
|
|
<listitem><para>Whether to highlight exact query phrase matches only instead of individual keywords. Boolean, default is false.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"single_passage":</term>
|
|
<listitem><para>Whether to extract single best passage only. Boolean, default is false.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"use_boundaries":</term>
|
|
<listitem><para>Whether to additionaly break passages by phrase
|
|
boundary characters, as configured in index settings with
|
|
<link linkend="conf-phrase-boundary">phrase_boundary</link>
|
|
directive. Boolean, default is false.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"weight_order":</term>
|
|
<listitem><para>Whether to sort the extracted passages in order of relevance (decreasing weight),
|
|
or in order of appearance in the document (increasing position). Boolean, default is false.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"query_mode":</term>
|
|
<listitem><para>Added in version 1.10-beta. Whether to handle $words as a query in
|
|
<link linkend="extended-syntax">extended syntax</link>, or as a bag of words
|
|
(default behavior). For instance, in query mode ("one two" | "three four") will
|
|
only highlight and include those occurrences "one two" or "three four" when
|
|
the two words from each pair are adjacent to each other. In default mode,
|
|
any single occurrence of "one", "two", "three", or "four" would be
|
|
highlighted. Boolean, default is false.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"force_all_words":</term>
|
|
<listitem><para>Added in version 1.10-beta. Ignores the snippet length limit until it
|
|
includes all the keywords. Boolean, default is false.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"limit_passages":</term>
|
|
<listitem><para>Added in version 1.10-beta. Limits the maximum number of passages
|
|
that can be included into the snippet. Integer, default is 0 (no limit).
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"limit_words":</term>
|
|
<listitem><para>Added in version 1.10-beta. Limits the maximum number of keywords
|
|
that can be included into the snippet. Integer, default is 0 (no limit).
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"start_passage_id":</term>
|
|
<listitem><para>Added in version 1.10-beta. Specifies the starting value of
|
|
%PASSAGE_ID% macro (that gets detected and expanded in <option>before_match</option>,
|
|
<option>after_match</option> strings). Integer, default is 1.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"load_files":</term>
|
|
<listitem><para>Added in version 1.10-beta. Whether to handle $docs as data
|
|
to extract snippets from (default behavior), or to treat it as file names,
|
|
and load data from specified files on the server side. Starting with
|
|
version 2.0.1-beta, up to <link linkend="conf-dist-threads">dist_threads</link>
|
|
worker threads per request will be created to parallelize the work
|
|
when this flag is enabled. Boolean, default is false.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"html_strip_mode":</term>
|
|
<listitem><para>Added in version 1.10-beta. HTML stripping mode setting.
|
|
Defaults to "index", which means that index settings will be used.
|
|
The other values are "none" and "strip", that forcibly skip or apply
|
|
stripping irregardless of index settings; and "retain", that retains
|
|
HTML markup and protects it from highlighting. The "retain" mode can
|
|
only be used when highlighting full documents and thus requires that
|
|
no snippet size limits are set. String, allowed values are "none",
|
|
"strip", "index", and "retain".
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"allow_empty":</term>
|
|
<listitem><para>Added in version 1.10-beta. Allows empty string to be
|
|
returned as highlighting result when a snippet could not be generated
|
|
(no keywords match, or no passages fit the limit). By default,
|
|
the beginning of original text would be returned instead of an empty
|
|
string. Boolean, default is false.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"passage_boundary":</term>
|
|
<listitem><para>Added in version 2.0.1-beta. Ensures that passages do not
|
|
cross a sentence, paragraph, or zone boundary (when used with an index
|
|
that has the respective indexing settings enabled). String, allowed
|
|
values are "sentence", "paragraph", and "zone".
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>"emit_zones":</term>
|
|
<listitem><para>Added in version 2.0.1-beta. Emits an HTML tag with
|
|
an enclosing zone name before each passage. Boolean, default is false.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
<para>
|
|
Snippets extraction algorithm currently favors better passages
|
|
(with closer phrase matches), and then passages with keywords not
|
|
yet in snippet. Generally, it will try to highlight the best match
|
|
with the query, and it will also to highlight all the query keywords,
|
|
as made possible by the limtis. In case the document does not match
|
|
the query, beginning of the document trimmed down according to the
|
|
limits will be return by default. Starting with 1.10-beta, you can
|
|
also return an empty snippet instead case by setting "allow_empty"
|
|
option to true.
|
|
</para>
|
|
<para>
|
|
Returns false on failure. Returns a plain array of strings with excerpts (snippets) on success.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="api-func-updateatttributes"><title>UpdateAttributes</title>
|
|
<para><b>Prototype:</b> function UpdateAttributes ( $index, $attrs, $values )</para>
|
|
<para>
|
|
Instantly updates given attribute values in given documents.
|
|
Returns number of actually updated documents (0 or more) on success, or -1 on failure.
|
|
</para>
|
|
<para>
|
|
<code>$index</code> is a name of the index (or indexes) to be updated.
|
|
<code>$attrs</code> is a plain array with string attribute names, listing attributes that are updated.
|
|
<code>$values</code> is a hash where key is document ID, and value is a plain array of new attribute values.
|
|
</para>
|
|
<para>
|
|
<code>$index</code> can be either a single index name or a list, like in <code>Query()</code>.
|
|
Unlike <code>Query()</code>, wildcard is not allowed and all the indexes
|
|
to update must be specified explicitly. The list of indexes can include
|
|
distributed index names. Updates on distributed indexes will be pushed
|
|
to all agents.
|
|
</para>
|
|
<para>
|
|
The updates only work with <code>docinfo=extern</code> storage strategy.
|
|
They are very fast because they're working fully in RAM, but they can also
|
|
be made persistent: updates are saved on disk on clean <filename>searchd</filename>
|
|
shutdown initiated by SIGTERM signal. With additional restrictions, updates
|
|
are also possible on MVA attributes; refer to <link linkend="conf-mva-updates-pool">mva_updates_pool</link>
|
|
directive for details.
|
|
</para>
|
|
<para>
|
|
Usage example:
|
|
<programlisting>
|
|
$cl->UpdateAttributes ( "test1", array("group_id"), array(1=>array(456)) );
|
|
$cl->UpdateAttributes ( "products", array ( "price", "amount_in_stock" ),
|
|
array ( 1001=>array(123,5), 1002=>array(37,11), 1003=>(25,129) ) );
|
|
</programlisting>
|
|
The first sample statement will update document 1 in index "test1", setting "group_id" to 456.
|
|
The second one will update documents 1001, 1002 and 1003 in index "products". For document 1001,
|
|
the new price will be set to 123 and the new amount in stock to 5; for document 1002, the new price
|
|
will be 37 and the new amount will be 11; etc.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="api-func-buildkeywords"><title>BuildKeywords</title>
|
|
<para><b>Prototype:</b> function BuildKeywords ( $query, $index, $hits )</para>
|
|
<para>
|
|
Extracts keywords from query using tokenizer settings for given index, optionally with per-keyword occurrence statistics.
|
|
Returns an array of hashes with per-keyword information.
|
|
</para>
|
|
<para>
|
|
<code>$query</code> is a query to extract keywords from.
|
|
<code>$index</code> is a name of the index to get tokenizing settings and keyword occurrence statistics from.
|
|
<code>$hits</code> is a boolean flag that indicates whether keyword occurrence statistics are required.
|
|
</para>
|
|
<para>
|
|
Usage example:
|
|
</para>
|
|
<programlisting>
|
|
$keywords = $cl->BuildKeywords ( "this.is.my query", "test1", false );
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="api-func-escapestring"><title>EscapeString</title>
|
|
<para><b>Prototype:</b> function EscapeString ( $string )</para>
|
|
<para>
|
|
Escapes characters that are treated as special operators by the query language parser.
|
|
Returns an escaped string.
|
|
</para>
|
|
<para>
|
|
<code>$string</code> is a string to escape.
|
|
</para>
|
|
<para>
|
|
This function might seem redundant because it's trivial to implement in any calling
|
|
application. However, as the set of special characters might change over time, it makes
|
|
sense to have an API call that is guaranteed to escape all such characters at all times.
|
|
</para>
|
|
<para>
|
|
Usage example:
|
|
</para>
|
|
<programlisting>
|
|
$escaped = $cl->EscapeString ( "escaping-sample@query/string" );
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="api-func-status"><title>Status</title>
|
|
<para><b>Prototype:</b> function Status ()</para>
|
|
<para>
|
|
Queries searchd status, and returns an array of status variable name and value pairs.
|
|
</para>
|
|
<para>
|
|
Usage example:
|
|
</para>
|
|
<programlisting>
|
|
$status = $cl->Status ();
|
|
foreach ( $status as $row )
|
|
print join ( ": ", $row ) . "\n";
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="api-func-flushattributes"><title>FlushAttributes</title>
|
|
<para><b>Prototype:</b> function FlushAttributes ()</para>
|
|
<para>
|
|
Forces <filename>searchd</filename> to flush pending attribute updates
|
|
to disk, and blocks until completion. Returns a non-negative internal
|
|
"flush tag" on success. Returns -1 and sets an error message on error.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
Attribute values updated using <link linkend="api-func-updateatttributes">UpdateAttributes()</link>
|
|
API call are only kept in RAM until a so-called flush (which writes
|
|
the current, possibly updated attribute values back to disk). FlushAttributes()
|
|
call lets you enforce a flush. The call will block until <filename>searchd</filename>
|
|
finishes writing the data to disk, which might take seconds or even minutes
|
|
depending on the total data size (.spa file size). All the currently updated
|
|
indexes will be flushed.
|
|
</para>
|
|
<para>
|
|
Flush tag should be treated as an ever growing magic number that does not
|
|
mean anything. It's guaranteed to be non-negative. It is guaranteed to grow over
|
|
time, though not necessarily in a sequential fashion; for instance, two calls that
|
|
return 10 and then 1000 respectively are a valid situation. If two calls to
|
|
FlushAttrs() return the same tag, it means that there were no actual attribute
|
|
updates in between them, and therefore current flushed state remained the same
|
|
(for all indexes).
|
|
</para>
|
|
<para>
|
|
Usage example:
|
|
</para>
|
|
<programlisting>
|
|
$status = $cl->FlushAttributes ();
|
|
if ( $status<0 )
|
|
print "ERROR: " . $cl->GetLastError();
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="api-funcgroup-pconn"><title>Persistent connections</title>
|
|
<para>
|
|
Persistent connections allow to use single network connection to run
|
|
multiple commands that would otherwise require reconnects.
|
|
</para>
|
|
|
|
<sect2 id="api-func-open"><title>Open</title>
|
|
<para><b>Prototype:</b> function Open ()</para>
|
|
<para>
|
|
Opens persistent connection to the server.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="api-func-close"><title>Close</title>
|
|
<para><b>Prototype:</b> function Close ()</para>
|
|
<para>
|
|
Closes previously opened persistent connection.
|
|
</para>
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
</chapter>
|
|
<chapter id="sphinxse"><title>MySQL storage engine (SphinxSE)</title>
|
|
|
|
|
|
<sect1 id="sphinxse-overview"><title>SphinxSE overview</title>
|
|
<para>
|
|
SphinxSE is MySQL storage engine which can be compiled
|
|
into MySQL server 5.x using its pluggable architecure.
|
|
It is not available for MySQL 4.x series. It also requires
|
|
MySQL 5.0.22 or higher in 5.0.x series, or MySQL 5.1.12
|
|
or higher in 5.1.x series.
|
|
</para>
|
|
<para>
|
|
Despite the name, SphinxSE does <emphasis>not</emphasis>
|
|
actually store any data itself. It is actually a built-in client
|
|
which allows MySQL server to talk to <filename>searchd</filename>,
|
|
run search queries, and obtain search results. All indexing and
|
|
searching happen outside MySQL.
|
|
</para>
|
|
<para>
|
|
Obvious SphinxSE applications include:
|
|
<itemizedlist>
|
|
<listitem><para>easier porting of MySQL FTS applications to Sphinx;</para></listitem>
|
|
<listitem><para>allowing Sphinx use with progamming languages for which native APIs are not available yet;</para></listitem>
|
|
<listitem><para>optimizations when additional Sphinx result set processing on MySQL side is required
|
|
(eg. JOINs with original document tables, additional MySQL-side filtering, etc).</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxse-installing"><title>Installing SphinxSE</title>
|
|
<para>
|
|
You will need to obtain a copy of MySQL sources, prepare those,
|
|
and then recompile MySQL binary.
|
|
MySQL sources (mysql-5.x.yy.tar.gz) could be obtained from
|
|
<ulink url="http://dev.mysql.com">dev.mysql.com</ulink> Web site.
|
|
</para>
|
|
<para>
|
|
For some MySQL versions, there are delta tarballs with already
|
|
prepared source versions available from Sphinx Web site. After unzipping
|
|
those over original sources MySQL would be ready to be configured and
|
|
built with Sphinx support.
|
|
</para>
|
|
<para>
|
|
If such tarball is not available, or does not work for you for any
|
|
reason, you would have to prepare sources manually. You will need to
|
|
GNU Autotools framework (autoconf, automake and libtool) installed
|
|
to do that.
|
|
</para>
|
|
|
|
|
|
<sect2 id="sphinxse-mysql50"><title>Compiling MySQL 5.0.x with SphinxSE</title>
|
|
<para>
|
|
Skips steps 1-3 if using already prepared delta tarball.
|
|
</para>
|
|
<orderedlist>
|
|
<listitem><para>copy <filename>sphinx.5.0.yy.diff</filename> patch file
|
|
into MySQL sources directory and run
|
|
<programlisting>
|
|
patch -p1 < sphinx.5.0.yy.diff
|
|
</programlisting>
|
|
If there's no .diff file exactly for the specific version you need
|
|
to build, try applying .diff with closest version numbers. It is important
|
|
that the patch should apply with no rejects.
|
|
</para></listitem>
|
|
<listitem><para>in MySQL sources directory, run
|
|
<programlisting>
|
|
sh BUILD/autorun.sh
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>in MySQL sources directory, create <filename>sql/sphinx</filename>
|
|
directory in and copy all files in <filename>mysqlse</filename> directory
|
|
from Sphinx sources there. Example:
|
|
<programlisting>
|
|
cp -R /root/builds/sphinx-0.9.7/mysqlse /root/builds/mysql-5.0.24/sql/sphinx
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>
|
|
configure MySQL and enable Sphinx engine:
|
|
<programlisting>
|
|
./configure --with-sphinx-storage-engine
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>
|
|
build and install MySQL:
|
|
<programlisting>
|
|
make
|
|
make install
|
|
</programlisting>
|
|
</para></listitem>
|
|
</orderedlist>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="sphinxse-mysql51"><title>Compiling MySQL 5.1.x with SphinxSE</title>
|
|
<para>
|
|
Skip steps 1-2 if using already prepared delta tarball.
|
|
</para>
|
|
<orderedlist>
|
|
<listitem><para>in MySQL sources directory, create <filename>storage/sphinx</filename>
|
|
directory in and copy all files in <filename>mysqlse</filename> directory
|
|
from Sphinx sources there. Example:
|
|
<programlisting>
|
|
cp -R /root/builds/sphinx-0.9.7/mysqlse /root/builds/mysql-5.1.14/storage/sphinx
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>in MySQL sources directory, run
|
|
<programlisting>
|
|
sh BUILD/autorun.sh
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>
|
|
configure MySQL and enable Sphinx engine:
|
|
<programlisting>
|
|
./configure --with-plugins=sphinx
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>
|
|
build and install MySQL:
|
|
<programlisting>
|
|
make
|
|
make install
|
|
</programlisting>
|
|
</para></listitem>
|
|
</orderedlist>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="sphinxse-checking"><title>Checking SphinxSE installation</title>
|
|
<para>
|
|
To check whether SphinxSE has been succesfully compiled
|
|
into MySQL, launch newly built servers, run mysql client and
|
|
issue <code>SHOW ENGINES</code> query. You should see a list
|
|
of all available engines. Sphinx should be present and "Support"
|
|
column should contain "YES":
|
|
</para>
|
|
<programlisting>
|
|
mysql> show engines;
|
|
+------------+----------+-------------------------------------------------------------+
|
|
| Engine | Support | Comment |
|
|
+------------+----------+-------------------------------------------------------------+
|
|
| MyISAM | DEFAULT | Default engine as of MySQL 3.23 with great performance |
|
|
...
|
|
| SPHINX | YES | Sphinx storage engine |
|
|
...
|
|
+------------+----------+-------------------------------------------------------------+
|
|
13 rows in set (0.00 sec)
|
|
</programlisting>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxse-using"><title>Using SphinxSE</title>
|
|
<para>
|
|
To search via SphinxSE, you would need to create special ENGINE=SPHINX "search table",
|
|
and then SELECT from it with full text query put into WHERE clause for query column.
|
|
</para>
|
|
<para>
|
|
Let's begin with an example create statement and search query:
|
|
<programlisting>
|
|
CREATE TABLE t1
|
|
(
|
|
id INTEGER UNSIGNED NOT NULL,
|
|
weight INTEGER NOT NULL,
|
|
query VARCHAR(3072) NOT NULL,
|
|
group_id INTEGER,
|
|
INDEX(query)
|
|
) ENGINE=SPHINX CONNECTION="sphinx://localhost:9312/test";
|
|
|
|
SELECT * FROM t1 WHERE query='test it;mode=any';
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
First 3 columns of search table <emphasis>must</emphasis> have a types of
|
|
<code>INTEGER UNSINGED</code> or <code>BIGINT</code> for the 1st column (document id),
|
|
<code>INTEGER</code> or <code>BIGINT</code> for the 2nd column (match weight), and
|
|
<code>VARCHAR</code> or <code>TEXT</code> for the 3rd column (your query), respectively.
|
|
This mapping is fixed; you can not omit any of these three required columns,
|
|
or move them around, or change types. Also, query column must be indexed;
|
|
all the others must be kept unindexed. Columns' names are ignored so you
|
|
can use arbitrary ones.
|
|
</para>
|
|
<para>
|
|
Additional columns must be either <code>INTEGER</code>, <code>TIMESTAMP</code>,
|
|
<code>BIGINT</code>, <code>VARCHAR</code>, or <code>FLOAT</code>.
|
|
They will be bound to attributes provided in Sphinx result set by name, so their
|
|
names must match attribute names specified in <filename>sphinx.conf</filename>.
|
|
If there's no such attribute name in Sphinx search results, column will have
|
|
<code>NULL</code> values.
|
|
</para>
|
|
<para>
|
|
Special "virtual" attributes names can also be bound to SphinxSE columns.
|
|
<code>_sph_</code> needs to be used instead of <code>@</code> for that.
|
|
For instance, to obtain the values of <code>@groupby</code>, <code>@count</code>,
|
|
or <code>@distinct</code> virtual attributes, use <code>_sph_groupby</code>,
|
|
<code>_sph_count</code> or <code>_sph_distinct</code> column names, respectively.
|
|
</para>
|
|
<para>
|
|
<code>CONNECTION</code> string parameter can be used to specify default
|
|
searchd host, port and indexes for queries issued using this table.
|
|
If no connection string is specified in <code>CREATE TABLE</code>,
|
|
index name "*" (ie. search all indexes) and localhost:9312 are assumed.
|
|
Connection string syntax is as follows:
|
|
<programlisting>
|
|
CONNECTION="sphinx://HOST:PORT/INDEXNAME"
|
|
</programlisting>
|
|
You can change the default connection string later:
|
|
<programlisting>
|
|
ALTER TABLE t1 CONNECTION="sphinx://NEWHOST:NEWPORT/NEWINDEXNAME";
|
|
</programlisting>
|
|
You can also override all these parameters per-query.
|
|
</para>
|
|
<para>
|
|
As seen in example, both query text and search options should be put
|
|
into WHERE clause on search query column (ie. 3rd column); the options
|
|
are separated by semicolons; and their names from values by equality sign.
|
|
Any number of options can be specified. Available options are:
|
|
<itemizedlist>
|
|
<listitem><para>query - query text;</para></listitem>
|
|
<listitem><para>mode - matching mode. Must be one of "all", "any", "phrase",
|
|
"boolean", or "extended". Default is "all";</para></listitem>
|
|
<listitem><para>sort - match sorting mode. Must be one of "relevance", "attr_desc",
|
|
"attr_asc", "time_segments", or "extended". In all modes besides "relevance"
|
|
attribute name (or sorting clause for "extended") is also required after a colon:
|
|
<programlisting>
|
|
... WHERE query='test;sort=attr_asc:group_id';
|
|
... WHERE query='test;sort=extended:@weight desc, group_id asc';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>offset - offset into result set, default is 0;</para></listitem>
|
|
<listitem><para>limit - amount of matches to retrieve from result set, default is 20;</para></listitem>
|
|
<listitem><para>index - names of the indexes to search:
|
|
<programlisting>
|
|
... WHERE query='test;index=test1;';
|
|
... WHERE query='test;index=test1,test2,test3;';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>minid, maxid - min and max document ID to match;</para></listitem>
|
|
<listitem><para>weights - comma-separated list of weights to be assigned to Sphinx full-text fields:
|
|
<programlisting>
|
|
... WHERE query='test;weights=1,2,3;';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>filter, !filter - comma-separated attribute name and a set of values to match:
|
|
<programlisting>
|
|
# only include groups 1, 5 and 19
|
|
... WHERE query='test;filter=group_id,1,5,19;';
|
|
|
|
# exclude groups 3 and 11
|
|
... WHERE query='test;!filter=group_id,3,11;';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>range, !range - comma-separated attribute name, min and max value to match:
|
|
<programlisting>
|
|
# include groups from 3 to 7, inclusive
|
|
... WHERE query='test;range=group_id,3,7;';
|
|
|
|
# exclude groups from 5 to 25
|
|
... WHERE query='test;!range=group_id,5,25;';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>maxmatches - per-query max matches value, as in max_matches parameter to
|
|
<link linkend="api-func-setlimits">SetLimits()</link> API call:
|
|
<programlisting>
|
|
... WHERE query='test;maxmatches=2000;';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>cutoff - maximum allowed matches, as in cutoff parameter to
|
|
<link linkend="api-func-setlimits">SetLimits()</link> API call:
|
|
<programlisting>
|
|
... WHERE query='test;cutoff=10000;';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>maxquerytme - maximum allowed query time (in milliseconds), as in
|
|
<link linkend="api-func-setmaxquerytime">SetMaxQueryTime()</link> API call:
|
|
<programlisting>
|
|
... WHERE query='test;maxquerytime=1000;';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>groupby - group-by function and attribute, corresponding to
|
|
<link linkend="api-func-setgroupby">SetGroupBy()</link> API call:
|
|
<programlisting>
|
|
... WHERE query='test;groupby=day:published_ts;';
|
|
... WHERE query='test;groupby=attr:group_id;';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>groupsort - group-by sorting clause:
|
|
<programlisting>
|
|
... WHERE query='test;groupsort=@count desc;';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>distinct - an attribute to compute COUNT(DISTINCT) for when doing group-by, as in
|
|
<link linkend="api-func-setgroupdistinct">SetGroupDistinct()</link> API call:
|
|
<programlisting>
|
|
... WHERE query='test;groupby=attr:country_id;distinct=site_id';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>indexweights - comma-separated list of index names and weights
|
|
to use when searching through several indexes:
|
|
<programlisting>
|
|
... WHERE query='test;indexweights=idx_exact,2,idx_stemmed,1;';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>comment - a string to mark this query in query log
|
|
(mapping to $comment parameter in <link linkend="api-func-query">Query()</link> API call):
|
|
<programlisting>
|
|
... WHERE query='test;comment=marker001;';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>select - a string with expressions to compute
|
|
(mapping to <link linkend="api-func-setselect">SetSelect()</link> API call):
|
|
<programlisting>
|
|
... WHERE query='test;select=2*a+3*b as myexpr;';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>host, port - remote <filename>searchd</filename> host name
|
|
and TCP port, respectively:
|
|
<programlisting>
|
|
... WHERE query='test;host=sphinx-test.loc;port=7312;';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>ranker - a ranking function to use with "extended" matching mode,
|
|
as in <link linkend="api-func-setrankingmode">SetRankingMode()</link> API call
|
|
(the only mode that supports full query syntax).
|
|
Known values are "proximity_bm25", "bm25", "none", "wordcount", "proximity",
|
|
"matchany", and "fieldmask".
|
|
<programlisting>
|
|
... WHERE query='test;mode=extended;ranker=bm25;';
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>geoanchor - geodistance anchor, as in
|
|
<link linkend="api-func-setgeoanchor">SetGeoAnchor()</link> API call.
|
|
Takes 4 parameters which are latitude and longiture attribute names,
|
|
and anchor point coordinates respectively:
|
|
<programlisting>
|
|
... WHERE query='test;geoanchor=latattr,lonattr,0.123,0.456';
|
|
</programlisting>
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
One <emphasis role="bold">very important</emphasis> note that it is
|
|
<emphasis role="bold">much</emphasis> more efficient to allow Sphinx
|
|
to perform sorting, filtering and slicing the result set than to raise
|
|
max matches count and use WHERE, ORDER BY and LIMIT clauses on MySQL
|
|
side. This is for two reasons. First, Sphinx does a number of
|
|
optimizations and performs better than MySQL on these tasks.
|
|
Second, less data would need to be packed by searchd, transferred
|
|
and unpacked by SphinxSE.
|
|
</para>
|
|
<para>
|
|
Starting with version 0.9.9-rc1, additional query info besides result set could be
|
|
retrieved with <code>SHOW ENGINE SPHINX STATUS</code> statement:
|
|
<programlisting>
|
|
mysql> SHOW ENGINE SPHINX STATUS;
|
|
+--------+-------+-------------------------------------------------+
|
|
| Type | Name | Status |
|
|
+--------+-------+-------------------------------------------------+
|
|
| SPHINX | stats | total: 25, total found: 25, time: 126, words: 2 |
|
|
| SPHINX | words | sphinx:591:1256 soft:11076:15945 |
|
|
+--------+-------+-------------------------------------------------+
|
|
2 rows in set (0.00 sec)
|
|
</programlisting>
|
|
This information can also be accessed through status variables. Note
|
|
that this method does not require super-user privileges.
|
|
<programlisting>
|
|
mysql> SHOW STATUS LIKE 'sphinx_%';
|
|
+--------------------+----------------------------------+
|
|
| Variable_name | Value |
|
|
+--------------------+----------------------------------+
|
|
| sphinx_total | 25 |
|
|
| sphinx_total_found | 25 |
|
|
| sphinx_time | 126 |
|
|
| sphinx_word_count | 2 |
|
|
| sphinx_words | sphinx:591:1256 soft:11076:15945 |
|
|
+--------------------+----------------------------------+
|
|
5 rows in set (0.00 sec)
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
You could perform JOINs on SphinxSE search table and tables using
|
|
other engines. Here's an example with "documents" from example.sql:
|
|
<programlisting>
|
|
mysql> SELECT content, date_added FROM test.documents docs
|
|
-> JOIN t1 ON (docs.id=t1.id)
|
|
-> WHERE query="one document;mode=any";
|
|
+-------------------------------------+---------------------+
|
|
| content | docdate |
|
|
+-------------------------------------+---------------------+
|
|
| this is my test document number two | 2006-06-17 14:04:28 |
|
|
| this is my test document number one | 2006-06-17 14:04:28 |
|
|
+-------------------------------------+---------------------+
|
|
2 rows in set (0.00 sec)
|
|
|
|
mysql> SHOW ENGINE SPHINX STATUS;
|
|
+--------+-------+---------------------------------------------+
|
|
| Type | Name | Status |
|
|
+--------+-------+---------------------------------------------+
|
|
| SPHINX | stats | total: 2, total found: 2, time: 0, words: 2 |
|
|
| SPHINX | words | one:1:2 document:2:2 |
|
|
+--------+-------+---------------------------------------------+
|
|
2 rows in set (0.00 sec)
|
|
</programlisting>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="sphinxse-snippets"><title>Building snippets (excerpts) via MySQL</title>
|
|
<para>
|
|
Starting with version 0.9.9-rc2, SphinxSE also includes a UDF function
|
|
that lets you create snippets through MySQL. The functionality is fully
|
|
similar to <link linkend="api-func-buildexcerpts">BuildExcerprts</link>
|
|
API call but accesible through MySQL+SphinxSE.
|
|
</para>
|
|
<para>
|
|
The binary that provides the UDF is named <filename>sphinx.so</filename>
|
|
and should be automatically built and installed to proper location
|
|
along with SphinxSE itself. If it does not get installed automatically
|
|
for some reason, look for <filename>sphinx.so</filename> in the build
|
|
directory and copy it to the plugins directory of your MySQL instance.
|
|
After that, register the UDF using the following statement:
|
|
<programlisting>
|
|
CREATE FUNCTION sphinx_snippets RETURNS STRING SONAME 'sphinx.so';
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
Function name <emphasis>must</emphasis> be sphinx_snippets,
|
|
you can not use an arbitrary name. Function arguments are as follows:
|
|
</para>
|
|
<para>
|
|
<b>Prototype:</b> function sphinx_snippets ( document, index, words, [options] );
|
|
</para>
|
|
<para>
|
|
Document and words arguments can be either strings or table columns.
|
|
Options must be specified like this: <code>'value' AS option_name</code>.
|
|
For a list of supported options, refer to
|
|
<link linkend="api-func-buildexcerpts">BuildExcerprts()</link> API call.
|
|
The only UDF-specific additional option is named <code>'sphinx'</code>
|
|
and lets you specify searchd location (host and port).
|
|
</para>
|
|
<para>
|
|
Usage examples:
|
|
<programlisting>
|
|
SELECT sphinx_snippets('hello world doc', 'main', 'world',
|
|
'sphinx://192.168.1.1/' AS sphinx, true AS exact_phrase,
|
|
'[b]' AS before_match, '[/b]' AS after_match)
|
|
FROM documents;
|
|
|
|
SELECT title, sphinx_snippets(text, 'index', 'mysql php') AS text
|
|
FROM sphinx, documents
|
|
WHERE query='mysql php' AND sphinx.id=documents.id;
|
|
</programlisting>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
</chapter>
|
|
<chapter id="reporting-bugs"><title>Reporting bugs</title>
|
|
|
|
|
|
<para>
|
|
Unfortunately, Sphinx is not yet 100% bug free (even though I'm working hard
|
|
towards that), so you might occasionally run into some issues.
|
|
</para>
|
|
<para>
|
|
Reporting as much as possible about each bug is very important -
|
|
because to fix it, I need to be able either to reproduce and debug the bug,
|
|
or to deduce what's causing it from the information that you provide.
|
|
So here are some instructions on how to do that.
|
|
</para>
|
|
|
|
|
|
<bridgehead>Build-time issues</bridgehead>
|
|
<para>If Sphinx fails to build for some reason, please do the following:</para>
|
|
<orderedlist>
|
|
<listitem><para>check that headers and libraries for your DBMS are properly installed
|
|
(for instance, check that <filename>mysql-devel</filename> package is present);
|
|
</para></listitem>
|
|
<listitem><para>report Sphinx version and config file (be sure to remove the passwords!),
|
|
MySQL (or PostgreSQL) configuration info, gcc version, OS version and CPU type
|
|
(ie. x86, x86-64, PowerPC, etc):
|
|
<programlisting>
|
|
mysql_config
|
|
gcc --version
|
|
uname -a
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>
|
|
report the error message which is produced by <filename>configure</filename>
|
|
or <filename>gcc</filename> (it should be to include error message itself only,
|
|
not the whole build log).
|
|
</para></listitem>
|
|
</orderedlist>
|
|
|
|
|
|
<bridgehead>Run-time issues</bridgehead>
|
|
<para>
|
|
If Sphinx builds and runs, but there are any problems running it,
|
|
please do the following:
|
|
</para>
|
|
<orderedlist>
|
|
<listitem><para>describe the bug (ie. both the expected behavior and actual behavior)
|
|
and all the steps necessary to reproduce it;</para></listitem>
|
|
<listitem><para>include Sphinx version and config file (be sure to remove the passwords!),
|
|
MySQL (or PostgreSQL) version, gcc version, OS version and CPU type (ie. x86, x86-64,
|
|
PowerPC, etc):
|
|
<programlisting>
|
|
mysql --version
|
|
gcc --version
|
|
uname -a
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>build, install and run debug versions of all Sphinx programs (this is
|
|
to enable a lot of additional internal checks, so-called assertions):
|
|
<programlisting>
|
|
make distclean
|
|
./configure --with-debug
|
|
make install
|
|
killall -TERM searchd
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>reindex to check if any assertions are triggered (in this case,
|
|
it's likely that the index is corrupted and causing problems);
|
|
</para></listitem>
|
|
<listitem><para>if the bug does not reproduce with debug versions,
|
|
revert to non-debug and mention it in your report;
|
|
</para></listitem>
|
|
<listitem><para>if the bug could be easily reproduced with a small (1-100 record)
|
|
part of your database, please provide a gzipped dump of that part;
|
|
</para></listitem>
|
|
<listitem><para>if the problem is related to <filename>searchd</filename>, include
|
|
relevant entries from <filename>searchd.log</filename> and
|
|
<filename>query.log</filename> in your bug report;
|
|
</para></listitem>
|
|
<listitem><para>if the problem is related to <filename>searchd</filename>, try
|
|
running it in console mode and check if it dies with an assertion:
|
|
<programlisting>
|
|
./searchd --console
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>if any program dies with an assertion, provide the assertion message.</para></listitem>
|
|
</orderedlist>
|
|
|
|
|
|
<bridgehead>Debugging assertions, crashes and hangups</bridgehead>
|
|
<para>
|
|
If any program dies with an assertion, crashes without an assertion or hangs up,
|
|
you would additionally need to generate a core dump and examine it.
|
|
</para>
|
|
<orderedlist>
|
|
<listitem><para>
|
|
enable core dumps. On most Linux systems, this is done
|
|
using <filename>ulimit</filename>:
|
|
<programlisting>
|
|
ulimit -c 32768
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>
|
|
run the program and try to reproduce the bug;
|
|
</para></listitem>
|
|
<listitem><para>
|
|
if the program crashes (either with or without an assertion),
|
|
find the core file in current directory (it should typically print
|
|
out "Segmentation fault (core dumped)" message);
|
|
</para></listitem>
|
|
<listitem><para>
|
|
if the program hangs, use <filename>kill -SEGV</filename>
|
|
from another console to force it to exit and dump core:
|
|
<programlisting>
|
|
kill -SEGV HANGED-PROCESS-ID
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>
|
|
use <filename>gdb</filename> to examine the core file
|
|
and obtain a backtrace:
|
|
<programlisting>
|
|
gdb ./CRASHED-PROGRAM-FILE-NAME CORE-DUMP-FILE-NAME
|
|
(gdb) bt
|
|
(gdb) quit
|
|
</programlisting>
|
|
</para></listitem>
|
|
</orderedlist>
|
|
<para>
|
|
Note that HANGED-PROCESS-ID, CRASHED-PROGRAM-FILE-NAME and
|
|
CORE-DUMP-FILE-NAME must all be replaced with specific numbers
|
|
and file names. For example, hanged searchd debugging session
|
|
would look like:
|
|
<programlisting>
|
|
# kill -SEGV 12345
|
|
# ls *core*
|
|
core.12345
|
|
# gdb ./searchd core.12345
|
|
(gdb) bt
|
|
...
|
|
(gdb) quit
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
Note that <filename>ulimit</filename> is not server-wide
|
|
and only affects current shell session. This means that you will not
|
|
have to restore any server-wide limits - but if you relogin,
|
|
you will have to set <filename>ulimit</filename> again.
|
|
</para>
|
|
<para>
|
|
Core dumps should be placed in current working directory
|
|
(and Sphinx programs do not change it), so this is where you
|
|
would look for them.
|
|
</para>
|
|
<para>
|
|
Please do not immediately remove the core file because there could
|
|
be additional helpful information which could be retrieved from it.
|
|
You do not need to send me this file (as the debug info there is
|
|
closely tied to your system) but I might need to ask
|
|
you a few additional questions about it.
|
|
</para>
|
|
|
|
|
|
</chapter>
|
|
<chapter id="conf-reference"><title><filename>sphinx.conf</filename> options reference</title>
|
|
|
|
|
|
<sect1 id="confgroup-source"><title>Data source configuration options</title>
|
|
|
|
|
|
<sect2 id="conf-source-type"><title>type</title>
|
|
<para>
|
|
Data source type.
|
|
Mandatory, no default value.
|
|
Known types are <option>mysql</option>, <option>pgsql</option>, <option>mssql</option>,
|
|
<option>xmlpipe</option> and <option>xmlpipe2</option>, and <option>odbc</option>.
|
|
</para>
|
|
<para>
|
|
All other per-source options depend on source type selected by this option.
|
|
Names of the options used for SQL sources (ie. MySQL, PostgreSQL, MS SQL) start with "sql_";
|
|
names of the ones used for xmlpipe and xmlpipe2 start with "xmlpipe_".
|
|
All source types except <option>xmlpipe</option> are conditional; they might or might
|
|
not be supported depending on your build settings, installed client libraries, etc.
|
|
<option>mssql</option> type is currently only available on Windows.
|
|
<option>odbc</option> type is available both on Windows natively and on
|
|
Linux through <ulink url="http://www.unixodbc.org/">UnixODBC library</ulink>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
type = mysql
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-host"><title>sql_host</title>
|
|
<para>
|
|
SQL server host to connect to.
|
|
Mandatory, no default value.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
In the simplest case when Sphinx resides on the same host with your MySQL
|
|
or PostgreSQL installation, you would simply specify "localhost". Note that
|
|
MySQL client library chooses whether to connect over TCP/IP or over UNIX
|
|
socket based on the host name. Specifically "localhost" will force it
|
|
to use UNIX socket (this is the default and generally recommended mode)
|
|
and "127.0.0.1" will force TCP/IP usage. Refer to
|
|
<ulink url="http://dev.mysql.com/doc/refman/5.0/en/mysql-real-connect.html">MySQL manual</ulink>
|
|
for more details.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_host = localhost
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-port"><title>sql_port</title>
|
|
<para>
|
|
SQL server IP port to connect to.
|
|
Optional, default is 3306 for <option>mysql</option> source type and 5432 for <option>pgsql</option> type.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
Note that it depends on <link linkend="conf-sql-host">sql_host</link> setting whether this value will actually be used.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_port = 3306
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-user"><title>sql_user</title>
|
|
<para>
|
|
SQL user to use when connecting to <link linkend="conf-sql-host">sql_host</link>.
|
|
Mandatory, no default value.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_user = test
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-pass"><title>sql_pass</title>
|
|
<para>
|
|
SQL user password to use when connecting to <link linkend="conf-sql-host">sql_host</link>.
|
|
Mandatory, no default value.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_pass = mysecretpassword
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-db"><title>sql_db</title>
|
|
<para>
|
|
SQL database (in MySQL terms) to use after the connection and perform further queries within.
|
|
Mandatory, no default value.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_db = test
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-sock"><title>sql_sock</title>
|
|
<para>
|
|
UNIX socket name to connect to for local SQL servers.
|
|
Optional, default value is empty (use client library default settings).
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
On Linux, it would typically be <filename>/var/lib/mysql/mysql.sock</filename>.
|
|
On FreeBSD, it would typically be <filename>/tmp/mysql.sock</filename>.
|
|
Note that it depends on <link linkend="conf-sql-host">sql_host</link> setting whether this value will actually be used.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_sock = /tmp/mysql.sock
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-mysql-connect-flags"><title>mysql_connect_flags</title>
|
|
<para>
|
|
MySQL client connection flags.
|
|
Optional, default value is 0 (do not set any flags).
|
|
Applies to <option>mysql</option> source type only.
|
|
</para>
|
|
<para>
|
|
This option must contain an integer value with the sum of the flags.
|
|
The value will be passed to <ulink url="http://dev.mysql.com/doc/refman/5.0/en/mysql-real-connect.html">mysql_real_connect()</ulink> verbatim.
|
|
The flags are enumerated in mysql_com.h include file.
|
|
Flags that are especially interesting in regard to indexing, with their respective values, are as follows:
|
|
<itemizedlist>
|
|
<listitem><para>CLIENT_COMPRESS = 32; can use compression protocol</para></listitem>
|
|
<listitem><para>CLIENT_SSL = 2048; switch to SSL after handshake</para></listitem>
|
|
<listitem><para>CLIENT_SECURE_CONNECTION = 32768; new 4.1 authentication</para></listitem>
|
|
</itemizedlist>
|
|
For instance, you can specify 2080 (2048+32) to use both compression and SSL,
|
|
or 32768 to use new authentication only. Initially, this option was introduced
|
|
to be able to use compression when the <filename>indexer</filename>
|
|
and <filename>mysqld</filename> are on different hosts. Compression on 1 Gbps
|
|
links is most likely to hurt indexing time though it reduces network traffic,
|
|
both in theory and in practice. However, enabling compression on 100 Mbps links
|
|
may improve indexing time significantly (upto 20-30% of the total indexing time
|
|
improvement was reported). Your mileage may vary.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
mysql_connect_flags = 32 # enable compression
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-mysql-ssl"><title>mysql_ssl_cert, mysql_ssl_key, mysql_ssl_ca</title>
|
|
<para>
|
|
SSL certificate settings to use for connecting to MySQL server.
|
|
Optional, default values are empty strings (do not use SSL).
|
|
Applies to <option>mysql</option> source type only.
|
|
</para>
|
|
<para>
|
|
These directives let you set up secure SSL connection between
|
|
<filename>indexer</filename> and MySQL. The details on creating
|
|
the certificates and setting up MySQL server can be found in
|
|
MySQL documentation.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
mysql_ssl_cert = /etc/ssl/client-cert.pem
|
|
mysql_ssl_key = /etc/ssl/client-key.pem
|
|
mysql_ssl_ca = /etc/ssl/cacert.pem
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-odbc-dsn"><title>odbc_dsn</title>
|
|
<para>
|
|
ODBC DSN to connect to.
|
|
Mandatory, no default value.
|
|
Applies to <option>odbc</option> source type only.
|
|
</para>
|
|
<para>
|
|
ODBC DSN (Data Source Name) specifies the credentials (host, user, password, etc)
|
|
to use when connecting to ODBC data source. The format depends on specific ODBC
|
|
driver used.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
odbc_dsn = Driver={Oracle ODBC Driver};Dbq=myDBName;Uid=myUsername;Pwd=myPassword
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-query-pre"><title>sql_query_pre</title>
|
|
<para>
|
|
Pre-fetch query, or pre-query.
|
|
Multi-value, optional, default is empty list of queries.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
Multi-value means that you can specify several pre-queries.
|
|
They are executed before <link linkend="conf-sql-query">the main fetch query</link>,
|
|
and they will be exectued exactly in order of appeareance in the configuration file.
|
|
Pre-query results are ignored.
|
|
</para>
|
|
<para>
|
|
Pre-queries are useful in a lot of ways. They are used to setup encoding,
|
|
mark records that are going to be indexed, update internal counters,
|
|
set various per-connection SQL server options and variables, and so on.
|
|
</para>
|
|
<para>
|
|
Perhaps the most frequent pre-query usage is to specify the encoding
|
|
that the server will use for the rows it returnes. It <b>must</b> match
|
|
the encoding that Sphinx expects (as specified by <link linkend="conf-charset-type">charset_type</link>
|
|
and <link linkend="conf-charset-table">charset_table</link> options).
|
|
Two MySQL specific examples of setting the encoding are:
|
|
<programlisting>
|
|
sql_query_pre = SET CHARACTER_SET_RESULTS=cp1251
|
|
sql_query_pre = SET NAMES utf8
|
|
</programlisting>
|
|
Also specific to MySQL sources, it is useful to disable query cache
|
|
(for indexer connection only) in pre-query, because indexing queries
|
|
are not going to be re-run frequently anyway, and there's no sense
|
|
in caching their results. That could be achieved with:
|
|
<programlisting>
|
|
sql_query_pre = SET SESSION query_cache_type=OFF
|
|
</programlisting>
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_query_pre = SET NAMES utf8
|
|
sql_query_pre = SET SESSION query_cache_type=OFF
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-query"><title>sql_query</title>
|
|
<para>
|
|
Main document fetch query.
|
|
Mandatory, no default value.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
There can be only one main query.
|
|
This is the query which is used to retrieve documents from SQL server.
|
|
You can specify up to 32 full-text fields (formally, upto SPH_MAX_FIELDS from sphinx.h), and an arbitrary amount of attributes.
|
|
All of the columns that are neither document ID (the first one) nor attributes will be full-text indexed.
|
|
</para>
|
|
<para>
|
|
Document ID <emphasis role="bold">MUST</emphasis> be the very first field,
|
|
and it <emphasis role="bold">MUST BE UNIQUE UNSIGNED POSITIVE (NON-ZERO, NON-NEGATIVE) INTEGER NUMBER</emphasis>.
|
|
It can be either 32-bit or 64-bit, depending on how you built Sphinx;
|
|
by default it builds with 32-bit IDs support but <option>--enable-id64</option> option
|
|
to <filename>configure</filename> allows to build with 64-bit document and word IDs support.
|
|
<!-- TODO: add more on zero, negative, duplicate ID handling -->
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_query = \
|
|
SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, \
|
|
title, content \
|
|
FROM documents
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-joined-field"><title>sql_joined_field</title>
|
|
<para>
|
|
Joined/payload field fetch query.
|
|
Multi-value, optional, default is empty list of queries.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
<option>sql_joined_field</option> lets you use two different features:
|
|
joined fields, and payloads (payload fields). It's syntax is as follows:
|
|
<programlisting>
|
|
sql_joined_field = FIELD-NAME 'from' ( 'query' | 'payload-query' ); \
|
|
QUERY [ ; RANGE-QUERY ]
|
|
</programlisting>
|
|
where
|
|
<itemizedlist>
|
|
<listitem><para>FIELD-NAME is a joined/payload field name;</para></listitem>
|
|
<listitem><para>QUERY is an SQL query that must fetch values to index.</para></listitem>
|
|
<listitem><para>RANGE-QUERY is an optional SQL query that fetches a range
|
|
of values to index. (Added in version 2.0.1-beta.)</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
<b>Joined fields</b> let you avoid JOIN and/or GROUP_CONCAT statements in the main
|
|
document fetch query (sql_query). This can be useful when SQL-side JOIN is slow,
|
|
or needs to be offloaded on Sphinx side, or simply to emulate MySQL-specific
|
|
GROUP_CONCAT funcionality in case your database server does not support it.
|
|
</para>
|
|
<para>
|
|
The query must return exactly 2 columns: document ID, and text to append
|
|
to a joined field. Document IDs can be duplicate, but they <b>must</b> be
|
|
in ascending order. All the text rows fetched for a given ID will be
|
|
concatented together, and the concatenation result will be indexed
|
|
as the entire contents of a joined field. Rows will be concatenated
|
|
in the order returned from the query, and separating whitespace
|
|
will be inserted between them. For instance, if joined field query
|
|
returns the following rows:
|
|
<programlisting>
|
|
( 1, 'red' )
|
|
( 1, 'right' )
|
|
( 1, 'hand' )
|
|
( 2, 'mysql' )
|
|
( 2, 'sphinx' )
|
|
</programlisting>
|
|
then the indexing results would be equivalent to that of adding
|
|
a new text field with a value of 'red right hand' to document 1 and
|
|
'mysql sphinx' to document 2.
|
|
</para>
|
|
<para>
|
|
Joined fields are only indexed differently. There are no other differences
|
|
between joined fields and regular text fields.
|
|
</para>
|
|
<para>
|
|
Starting with 2.0.1-beta, <b>ranged queries</b> can be used when
|
|
a single query is not efficient enough or does not work because of
|
|
the database driver limitations. It works similar to the ranged
|
|
queries in the main indexing loop, see <xref linkend="ranged-queries"/>.
|
|
The range will be queried for and fetched upfront once,
|
|
then multiple queries with different <code>$start</code>
|
|
and <code>$end</code> substitutions will be run to fetch
|
|
the actual data.
|
|
</para>
|
|
<para>
|
|
<b>Payloads</b> let you create a special field in which, instead of
|
|
keyword positions, so-called user payloads are stored. Payloads are
|
|
custom integer values attached to every keyword. They can then be used
|
|
in search time to affect the ranking.
|
|
</para>
|
|
<para>
|
|
The payload query must return exactly 3 columns: document ID; keyword;
|
|
and integer payload value. Document IDs can be duplicate, but they <b>must</b> be
|
|
in ascending order. Payloads must be unsigned integers within 24-bit range,
|
|
ie. from 0 to 16777215. For reference, payloads are currently internally
|
|
stored as in-field keyword positions, but that is not guaranteed
|
|
and might change in the future.
|
|
</para>
|
|
<para>
|
|
Currently, the only method to account for payloads is to use
|
|
SPH_RANK_PROXIMITY_BM25 ranker. On indexes with payload fields,
|
|
it will automatically switch to a variant that matches keywords
|
|
in those fields, computes a sum of matched payloads multiplied
|
|
by field wieghts, and adds that sum to the final rank.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_joined_field = \
|
|
tagstext from query; \
|
|
SELECT docid, CONCAT('tag',tagid) FROM tags ORDER BY docid ASC
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-query-range"><title>sql_query_range</title>
|
|
<para>
|
|
Range query setup.
|
|
Optional, default is empty.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
Setting this option enables ranged document fetch queries (see <xref linkend="ranged-queries"/>).
|
|
Ranged queries are useful to avoid notorious MyISAM table locks when indexing
|
|
lots of data. (They also help with other less notorious issues, such as reduced
|
|
performance caused by big result sets, or additional resources consumed by InnoDB
|
|
to serialize big read transactions.)
|
|
</para>
|
|
<para>
|
|
The query specified in this option must fetch min and max document IDs that will be
|
|
used as range boundaries. It must return exactly two integer fields, min ID first
|
|
and max ID second; the field names are ignored.
|
|
</para>
|
|
<para>
|
|
When ranged queries are enabled, <link linkend="conf-sql-query">sql_query</link>
|
|
will be required to contain <option>$start</option> and <option>$end</option> macros
|
|
(because it obviously would be a mistake to index the whole table many times over).
|
|
Note that the intervals specified by <option>$start</option>..<option>$end</option>
|
|
will not overlap, so you should <b>not</b> remove document IDs that are
|
|
exactly equal to <option>$start</option> or <option>$end</option> from your query.
|
|
The example in <xref linkend="ranged-queries"/>) illustrates that; note how it
|
|
uses greater-or-equal and less-or-equal comparisons.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_query_range = SELECT MIN(id),MAX(id) FROM documents
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-range-step"><title>sql_range_step</title>
|
|
<para>
|
|
Range query step.
|
|
Optional, default is 1024.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
Only used when <link linkend="ranged-queries">ranged queries</link> are enabled.
|
|
The full document IDs interval fetched by <link linkend="conf-sql-query-range">sql_query_range</link>
|
|
will be walked in this big steps. For example, if min and max IDs fetched
|
|
are 12 and 3456 respectively, and the step is 1000, indexer will call
|
|
<link linkend="conf-sql-query">sql_query</link> several times with the
|
|
following substitutions:
|
|
<itemizedlist>
|
|
<listitem><para>$start=12, $end=1011</para></listitem>
|
|
<listitem><para>$start=1012, $end=2011</para></listitem>
|
|
<listitem><para>$start=2012, $end=3011</para></listitem>
|
|
<listitem><para>$start=3012, $end=3456</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_range_step = 1000
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-query-killlist"><title>sql_query_killlist</title>
|
|
<para>
|
|
Kill-list query.
|
|
Optional, default is empty (no query).
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
This query is expected to return a number of 1-column rows, each containing
|
|
just the document ID. The returned document IDs are stored within an index.
|
|
Kill-list for a given index suppresses results from <emphasis>other</emphasis>
|
|
indexes, depending on index order in the query. The intended use is to help
|
|
implement deletions and updates on existing indexes without rebuilding
|
|
(actually even touching them), and especially to fight phantom results
|
|
problem.
|
|
</para>
|
|
<para>
|
|
Let us dissect an example. Assume we have two indexes, 'main' and 'delta'.
|
|
Assume that documents 2, 3, and 5 were deleted since last reindex of 'main',
|
|
and documents 7 and 11 were updated (ie. their text contents were changed).
|
|
Assume that a keyword 'test' occurred in all these mentioned documents
|
|
when we were indexing 'main'; still occurs in document 7 as we index 'delta';
|
|
but does not occur in document 11 any more. We now reindex delta and then
|
|
search through both these indexes in proper (least to most recent) order:
|
|
<programlisting>
|
|
$res = $cl->Query ( "test", "main delta" );
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
First, we need to properly handle deletions. The result set should not
|
|
contain documents 2, 3, or 5. Second, we also need to avoid phantom results.
|
|
Unless we do something about it, document 11 <emphasis>will</emphasis>
|
|
appear in search results! It will be found in 'main' (but not 'delta').
|
|
And it will make it to the final result set unless something stops it.
|
|
</para>
|
|
<para>
|
|
Kill-list, or K-list for short, is that something. Kill-list attached
|
|
to 'delta' will suppress the specified rows from <b>all</b> the preceding
|
|
indexes, in this case just 'main'. So to get the expected results,
|
|
we should put all the updated <emphasis>and</emphasis> deleted
|
|
document IDs into it.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_query_killlist = \
|
|
SELECT id FROM documents WHERE updated_ts>=@last_reindex UNION \
|
|
SELECT id FROM documents_deleted WHERE deleted_ts>=@last_reindex
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-attr-uint"><title>sql_attr_uint</title>
|
|
<para>
|
|
Unsigned integer <link linkend="attributes">attribute</link> declaration.
|
|
Multi-value (there might be multiple attributes declared), optional.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
The column value should fit into 32-bit unsigned integer range.
|
|
Values outside this range will be accepted but wrapped around.
|
|
For instance, -1 will be wrapped around to 2^32-1 or 4,294,967,295.
|
|
</para>
|
|
<para>
|
|
You can specify bit count for integer attributes by appending
|
|
':BITCOUNT' to attribute name (see example below). Attributes with
|
|
less than default 32-bit size, or bitfields, perform slower.
|
|
But they require less RAM when using <link linkend="conf-docinfo">extern storage</link>:
|
|
such bitfields are packed together in 32-bit chunks in <filename>.spa</filename>
|
|
attribute data file. Bit size settings are ignored if using
|
|
<link linkend="conf-docinfo">inline storage</link>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_attr_uint = group_id
|
|
sql_attr_uint = forum_id:9 # 9 bits for forum_id
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-attr-bool"><title>sql_attr_bool</title>
|
|
<para>
|
|
Boolean <link linkend="attributes">attribute</link> declaration.
|
|
Multi-value (there might be multiple attributes declared), optional.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
Equivalent to <link linkend="conf-sql-attr-uint">sql_attr_uint</link> declaration with a bit count of 1.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_attr_bool = is_deleted # will be packed to 1 bit
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-attr-bigint"><title>sql_attr_bigint</title>
|
|
<para>
|
|
64-bit signed integer <link linkend="attributes">attribute</link> declaration.
|
|
Multi-value (there might be multiple attributes declared), optional.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
Note that unlike <link linkend="conf-sql-attr-uint">sql_attr_uint</link>,
|
|
these values are <b>signed</b>.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_attr_bigint = my_bigint_id
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-attr-timestamp"><title>sql_attr_timestamp</title>
|
|
<para>
|
|
UNIX timestamp <link linkend="attributes">attribute</link> declaration.
|
|
Multi-value (there might be multiple attributes declared), optional.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
Timestamps can store date and time in the range of Jan 01, 1970
|
|
to Jan 19, 2038 with a precision of one second.
|
|
The expected column value should be a timestamp in UNIX format, ie. 32-bit unsigned
|
|
integer number of seconds elapsed since midnight, January 01, 1970, GMT.
|
|
Timestamps are internally stored and handled as integers everywhere.
|
|
But in addition to working with timestamps as integers, it's also legal
|
|
to use them along with different date-based functions, such as time segments
|
|
sorting mode, or day/week/month/year extraction for GROUP BY.
|
|
</para>
|
|
<para>
|
|
Note that DATE or DATETIME column types in MySQL can <b>not</b> be directly
|
|
used as timestamp attributes in Sphinx; you need to explicitly convert such
|
|
columns using UNIX_TIMESTAMP function (if data is in range).
|
|
</para>
|
|
<para>
|
|
Note timestamps can not represent dates before January 01, 1970,
|
|
and UNIX_TIMESTAMP() in MySQL will not return anything expected.
|
|
If you only needs to work with dates, not times, consider TO_DAYS()
|
|
function in MySQL instead.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_attr_timestamp = UNIX_TIMESTAMP(added_datetime) AS added_ts
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-attr-str2ordinal"><title>sql_attr_str2ordinal</title>
|
|
<para>
|
|
Ordinal string number <link linkend="attributes">attribute</link> declaration.
|
|
Multi-value (there might be multiple attributes declared), optional.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
This attribute type (so-called ordinal, for brevity) is intended
|
|
to allow sorting by string values, but without storing the strings
|
|
themselves. When indexing ordinals, string values are fetched from
|
|
database, temporarily stored, sorted, and then replaced by their
|
|
respective ordinal numbers in the array of sorted strings.
|
|
So, the ordinal number is an integer such that sorting by it
|
|
produces the same result as if lexicographically sorting by original strings.
|
|
by string values lexicographically.
|
|
</para>
|
|
<para>
|
|
Earlier versions could consume a lot of RAM for indexing ordinals.
|
|
Starting with revision r1112, ordinals accumulation and sorting
|
|
also runs in fixed memory (at the cost of using additional temporary
|
|
disk space), and honors
|
|
<link linkend="conf-mem-limit">mem_limit</link> settings.
|
|
</para>
|
|
<para>
|
|
Ideally the strings should be sorted differently, depending
|
|
on the encoding and locale. For instance, if the strings are known
|
|
to be Russian text in KOI8R encoding, sorting the bytes 0xE0, 0xE1,
|
|
and 0xE2 should produce 0xE1, 0xE2 and 0xE0, because in KOI8R
|
|
value 0xE0 encodes a character that is (noticeably) after
|
|
characters encoded by 0xE1 and 0xE2. Unfortunately, Sphinx
|
|
does not support that at the moment and will simply sort
|
|
the strings bytewise.
|
|
</para>
|
|
<para>
|
|
Note that the ordinals are by construction local to each index,
|
|
and it's therefore impossible to merge ordinals while retaining
|
|
the proper order. The processed strings are replaced by their
|
|
sequential number in the index they occurred in, but different
|
|
indexes have different sets of strings. For instance, if 'main' index
|
|
contains strings "aaa", "bbb", "ccc", and so on up to "zzz",
|
|
they'll be assigned numbers 1, 2, 3, and so on up to 26,
|
|
respectively. But then if 'delta' only contains "zzz" the assigned
|
|
number will be 1. And after the merge, the order will be broken.
|
|
Unfortunately, this is impossible to workaround without storing
|
|
the original strings (and once Sphinx supports storing the
|
|
original strings, ordinals will not be necessary any more).
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_attr_str2ordinal = author_name
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-attr-float"><title>sql_attr_float</title>
|
|
<para>
|
|
Floating point <link linkend="attributes">attribute</link> declaration.
|
|
Multi-value (there might be multiple attributes declared), optional.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
The values will be stored in single precision, 32-bit IEEE 754 format.
|
|
Represented range is approximately from 1e-38 to 1e+38. The amount
|
|
of decimal digits that can be stored precisely is approximately 7.
|
|
One important usage of the float attributes is storing latitude
|
|
and longitude values (in radians), for further usage in query-time
|
|
geosphere distance calculations.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_attr_float = lat_radians
|
|
sql_attr_float = long_radians
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-attr-multi"><title>sql_attr_multi</title>
|
|
<para>
|
|
<link linkend="mva">Multi-valued attribute</link> (MVA) declaration.
|
|
Multi-value (ie. there may be more than one such attribute declared), optional.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
Plain attributes only allow to attach 1 value per each document.
|
|
However, there are cases (such as tags or categories) when it is
|
|
desired to attach multiple values of the same attribute and be able
|
|
to apply filtering or grouping to value lists.
|
|
</para>
|
|
<para>
|
|
The declaration format is as follows (backslashes are for clarity only;
|
|
everything can be declared in a single line as well):
|
|
<programlisting>
|
|
sql_attr_multi = ATTR-TYPE ATTR-NAME 'from' SOURCE-TYPE \
|
|
[;QUERY] \
|
|
[;RANGE-QUERY]
|
|
</programlisting>
|
|
where
|
|
<itemizedlist>
|
|
<listitem><para>ATTR-TYPE is 'uint' or 'timestamp'</para></listitem>
|
|
<listitem><para>SOURCE-TYPE is 'field', 'query', or 'ranged-query'</para></listitem>
|
|
<listitem><para>QUERY is SQL query used to fetch all ( docid, attrvalue ) pairs</para></listitem>
|
|
<listitem><para>RANGE-QUERY is SQL query used to fetch min and max ID values, similar to 'sql_query_range'</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_attr_multi = uint tag from query; SELECT id, tag FROM tags
|
|
sql_attr_multi = uint tag from ranged-query; \
|
|
SELECT id, tag FROM tags WHERE id>=$start AND id<=$end; \
|
|
SELECT MIN(id), MAX(id) FROM tags
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-attr-string"><title>sql_attr_string</title>
|
|
<para>
|
|
String attribute declaration.
|
|
Multi-value (ie. there may be more than one such attribute declared), optional.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
String attributes can store arbitrary strings attached to every document.
|
|
There's a fixed size limit of 4 MB per value. Also, <filename>searchd</filename>
|
|
will currently cache all the values in RAM, which is an additional implicit limit.
|
|
</para>
|
|
<para>
|
|
As of 1.10-beta, strings can only be used for storage and retrieval.
|
|
They can not participate in expressions, be used for filtering, sorting,
|
|
or grouping (ie. in WHERE, ORDER or GROUP clauses). Note that attributes
|
|
declared using <option>sql_attr_string</option> will <b>not</b> be full-text
|
|
indexed; you can use <link linkend="conf-sql-field-string">sql_field_string</link>
|
|
directive for that.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_attr_string = title # will be stored but will not be indexed
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-attr-str2wordcount"><title>sql_attr_str2wordcount</title>
|
|
<para>
|
|
Word-count attribute declaration.
|
|
Multi-value (ie. there may be more than one such attribute declared), optional.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
Word-count attribute takes a string column, tokenizes it according
|
|
to index settings, and stores the resulting number of tokens in an attribute.
|
|
This number of tokens ("word count") is a normal integer that can be later
|
|
used, for instance, in custom ranking expressions (boost shorter titles,
|
|
help identify exact field matches, etc).
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_attr_str2wordcount = title_wc
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-column-buffers"><title>sql_column_buffers</title>
|
|
<para>
|
|
Per-column buffer sizes.
|
|
Optional, default is empty (deduce the sizes automatically).
|
|
Applies to <option>odbc</option>, <option>mssql</option> source types only.
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
ODBC and MS SQL drivers sometimes can not return the maximum
|
|
actual column size to be expected. For instance, NVARCHAR(MAX) columns
|
|
always report their length as 2147483647 bytes to
|
|
<filename>indexer</filename> even though the actually used length
|
|
is likely considerably less. However, the receiving buffers still
|
|
need to be allocated upfront, and their sizes have to be determined.
|
|
When the driver does not report the column length at all, Sphinx
|
|
allocates default 1 KB buffers for each non-char column, and 1 MB
|
|
buffers for each char column. Driver-reported column length
|
|
also gets clamped by an upper limie of 8 MB, so in case the
|
|
driver reports (almost) a 2 GB column length, it will be clamped
|
|
and a 8 MB buffer will be allocated instead for that column.
|
|
These hard-coded limits can be overridden using the
|
|
<code>sql_column_buffers</code> directive, either in order
|
|
to save memory on actually shorter columns, or overcome
|
|
the 8 MB limit on actually longer columns. The directive values
|
|
must be a comma-separated lists of selected column names and sizes:
|
|
<programlisting>
|
|
sql_column_buffers = <colname>=<size>[K|M] [, ...]
|
|
</programlisting>
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_query = SELECT id, mytitle, mycontent FROM documents
|
|
sql_column_buffers = mytitle=64K, mycontent=10M
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="conf-sql-field-string"><title>sql_field_string</title>
|
|
<para>
|
|
Combined string attribute and full-text field declaration.
|
|
Multi-value (ie. there may be more than one such attribute declared), optional.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
<link linkend="conf-sql-attr-string">sql_attr_string</link> only stores the column
|
|
value but does not full-text index it. In some cases it might be desired to both full-text
|
|
index the column and store it as attribute. <option>sql_field_string</option> lets you do
|
|
exactly that. Both the field and the attribute will be named the same.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_field_string = title # will be both indexed and stored
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-field-str2wordcount"><title>sql_field_str2wordcount</title>
|
|
<para>
|
|
Combined word-count attribute and full-text field declaration.
|
|
Multi-value (ie. there may be more than one such attribute declared), optional.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para><link linkend="conf-sql-attr-str2wordcount">sql_attr_str2wordcount</link> only stores the column
|
|
word count but does not full-text index it. In some cases it might be desired to both full-text
|
|
index the column and also have the count. <option>sql_field_str2wordcount</option> lets you do
|
|
exactly that. Both the field and the attribute will be named the same.</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_field_str2wordcount = title # will be indexed, and counted/stored
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-file-field"><title>sql_file_field</title>
|
|
<para>
|
|
File based field declaration.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
This directive makes <filename>indexer</filename> interpret field contents
|
|
as a file name, and load and index the referred file. Files larger than
|
|
<link linkend="conf-max-file-field-buffer">max_file_field_buffer</link>
|
|
in size are skipped. Any errors during the file loading (IO errors, missed
|
|
limits, etc) will be reported as indexing warnings and will <b>not</b> early
|
|
terminate the indexing. No content will be indexed for such files.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_file_field = my_file_path # load and index files referred to by my_file_path
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-query-post"><title>sql_query_post</title>
|
|
<para>
|
|
Post-fetch query.
|
|
Optional, default value is empty.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
This query is executed immediately after <link linkend="conf-sql-query">sql_query</link>
|
|
completes successfully. When post-fetch query produces errors,
|
|
they are reported as warnings, but indexing is <b>not</b> terminated.
|
|
It's result set is ignored. Note that indexing is <b>not</b> yet completed
|
|
at the point when this query gets executed, and further indexing still may fail.
|
|
Therefore, any permanent updates should not be done from here.
|
|
For instance, updates on helper table that permanently change
|
|
the last successfully indexed ID should not be run from post-fetch
|
|
query; they should be run from <link linkend="conf-sql-query-post-index">post-index query</link> instead.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_query_post = DROP TABLE my_tmp_table
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-query-post-index"><title>sql_query_post_index</title>
|
|
<para>
|
|
Post-index query.
|
|
Optional, default value is empty.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
This query is executed when indexing is fully and succesfully completed.
|
|
If this query produces errors, they are reported as warnings,
|
|
but indexing is <b>not</b> terminated. It's result set is ignored.
|
|
<code>$maxid</code> macro can be used in its text; it will be
|
|
expanded to maximum document ID which was actually fetched
|
|
from the database during indexing. If no documents were indexed,
|
|
$maxid will be expanded to 0.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_query_post_index = REPLACE INTO counters ( id, val ) \
|
|
VALUES ( 'max_indexed_id', $maxid )
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-ranged-throttle"><title>sql_ranged_throttle</title>
|
|
<para>
|
|
Ranged query throttling period, in milliseconds.
|
|
Optional, default is 0 (no throttling).
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
</para>
|
|
<para>
|
|
Throttling can be useful when indexer imposes too much load on the
|
|
database server. It causes the indexer to sleep for given amount of
|
|
milliseconds once per each ranged query step. This sleep is unconditional,
|
|
and is performed before the fetch query.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_ranged_throttle = 1000 # sleep for 1 sec before each query step
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-sql-query-info"><title>sql_query_info</title>
|
|
<para>
|
|
Document info query.
|
|
Optional, default is empty.
|
|
Applies to <option>mysql</option> source type only.
|
|
</para>
|
|
<para>
|
|
Only used by CLI search to fetch and display document information,
|
|
only works with MySQL at the moment, and only intended for debugging purposes.
|
|
This query fetches the row that will be displayed by CLI search utility
|
|
for each document ID. It is required to contain <code>$id</code> macro
|
|
that expands to the queried document ID.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
sql_query_info = SELECT * FROM documents WHERE id=$id
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-xmlpipe-command"><title>xmlpipe_command</title>
|
|
<para>
|
|
Shell command that invokes xmlpipe stream producer.
|
|
Mandatory.
|
|
Applies to <option>xmlpipe</option> and <option>xmlpipe2</option> source types only.
|
|
</para>
|
|
<para>
|
|
Specifies a command that will be executed and which output
|
|
will be parsed for documents. Refer to <xref linkend="xmlpipe"/>
|
|
or <xref linkend="xmlpipe2"/> for specific format description.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
xmlpipe_command = cat /home/sphinx/test.xml
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-xmlpipe-field"><title>xmlpipe_field</title>
|
|
<para>
|
|
xmlpipe field declaration.
|
|
Multi-value, optional.
|
|
Applies to <option>xmlpipe2</option> source type only. Refer to <xref linkend="xmlpipe2"/>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
xmlpipe_field = subject
|
|
xmlpipe_field = content
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-xmlpipe-field-string"><title>xmlpipe_field_string</title>
|
|
<para>
|
|
xmlpipe field and string attribute declaration.
|
|
Multi-value, optional.
|
|
Applies to <option>xmlpipe2</option> source type only. Refer to <xref linkend="xmlpipe2"/>.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
Makes the specified XML element indexed as both a full-text field and a string attribute.
|
|
Equivalent to <![CDATA[<sphinx:field name="field" attr="string"/>]]> declaration within the XML file.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
xmlpipe_field_string = subject
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-xmlpipe-field-wordcount"><title>xmlpipe_field_wordcount</title>
|
|
<para>
|
|
xmlpipe field and word count attribute declaration.
|
|
Multi-value, optional.
|
|
Applies to <option>xmlpipe2</option> source type only. Refer to <xref linkend="xmlpipe2"/>.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
Makes the specified XML element indexed as both a full-text field and a word count attribute.
|
|
Equivalent to <![CDATA[<sphinx:field name="field" attr="wordcount"/>]]> declaration within the XML file.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
xmlpipe_field_wordcount = subject
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-xmlpipe-attr-uint"><title>xmlpipe_attr_uint</title>
|
|
<para>
|
|
xmlpipe integer attribute declaration.
|
|
Multi-value, optional.
|
|
Applies to <option>xmlpipe2</option> source type only.
|
|
Syntax fully matches that of <link linkend="conf-sql-attr-uint">sql_attr_uint</link>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
xmlpipe_attr_uint = author
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-xmlpipe-attr-bool"><title>xmlpipe_attr_bool</title>
|
|
<para>
|
|
xmlpipe boolean attribute declaration.
|
|
Multi-value, optional.
|
|
Applies to <option>xmlpipe2</option> source type only.
|
|
Syntax fully matches that of <link linkend="conf-sql-attr-bool">sql_attr_bool</link>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
xmlpipe_attr_bool = is_deleted # will be packed to 1 bit
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-xmlpipe-attr-timestamp"><title>xmlpipe_attr_timestamp</title>
|
|
<para>
|
|
xmlpipe UNIX timestamp attribute declaration.
|
|
Multi-value, optional.
|
|
Applies to <option>xmlpipe2</option> source type only.
|
|
Syntax fully matches that of <link linkend="conf-sql-attr-timestamp">sql_attr_timestamp</link>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
xmlpipe_attr_timestamp = published
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-xmlpipe-attr-str2ordinal"><title>xmlpipe_attr_str2ordinal</title>
|
|
<para>
|
|
xmlpipe string ordinal attribute declaration.
|
|
Multi-value, optional.
|
|
Applies to <option>xmlpipe2</option> source type only.
|
|
Syntax fully matches that of <link linkend="conf-sql-attr-str2ordinal">sql_attr_str2ordinal</link>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
xmlpipe_attr_str2ordinal = author_sort
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-xmlpipe-attr-float"><title>xmlpipe_attr_float</title>
|
|
<para>
|
|
xmlpipe floating point attribute declaration.
|
|
Multi-value, optional.
|
|
Applies to <option>xmlpipe2</option> source type only.
|
|
Syntax fully matches that of <link linkend="conf-sql-attr-float">sql_attr_float</link>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
xmlpipe_attr_float = lat_radians
|
|
xmlpipe_attr_float = long_radians
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-xmlpipe-attr-multi"><title>xmlpipe_attr_multi</title>
|
|
<para>
|
|
xmlpipe MVA attribute declaration.
|
|
Multi-value, optional.
|
|
Applies to <option>xmlpipe2</option> source type only.
|
|
</para>
|
|
<para>
|
|
This setting declares an MVA attribute tag in xmlpipe2 stream.
|
|
The contents of the specified tag will be parsed and a list of integers
|
|
that will constitute the MVA will be extracted, similar to how
|
|
<link linkend="conf-sql-attr-multi">sql_attr_multi</link> parses
|
|
SQL column contents when 'field' MVA source type is specified.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
xmlpipe_attr_multi = taglist
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-xmlpipe-attr-string"><title>xmlpipe_attr_string</title>
|
|
<para>
|
|
xmlpipe string declaration.
|
|
Multi-value, optional.
|
|
Applies to <option>xmlpipe2</option> source type only.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
This setting declares a string attribute tag in xmlpipe2 stream.
|
|
The contents of the specified tag will be parsed and stored as a string value.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
xmlpipe_attr_string = subject
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-xmlpipe-fixup-utf8"><title>xmlpipe_fixup_utf8</title>
|
|
<para>
|
|
Perform Sphinx-side UTF-8 validation and filtering to prevent XML parser from choking on non-UTF-8 documents.
|
|
Optional, default is 0.
|
|
Applies to <option>xmlpipe2</option> source type only.
|
|
</para>
|
|
<para>
|
|
Under certain occasions it might be hard or even impossible to guarantee
|
|
that the incoming XMLpipe2 document bodies are in perfectly valid and
|
|
conforming UTF-8 encoding. For instance, documents with national
|
|
single-byte encodings could sneak into the stream. libexpat XML parser
|
|
is fragile, meaning that it will stop processing in such cases.
|
|
UTF8 fixup feature lets you avoid that. When fixup is enabled,
|
|
Sphinx will preprocess the incoming stream before passing it to the
|
|
XML parser and replace invalid UTF-8 sequences with spaces.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
xmlpipe_fixup_utf8 = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-mssql-winauth"><title>mssql_winauth</title>
|
|
<para>
|
|
MS SQL Windows authentication flag.
|
|
Boolean, optional, default value is 0 (false).
|
|
Applies to <option>mssql</option> source type only.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
Whether to use currently logged in Windows account credentials for
|
|
authentication when connecting to MS SQL Server. Note that when running
|
|
<filename>searchd</filename> as a service, account user can differ
|
|
from the account you used to install the service.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
mssql_winauth = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-mssql-unicode"><title>mssql_unicode</title>
|
|
<para>
|
|
MS SQL encoding type flag.
|
|
Boolean, optional, default value is 0 (false).
|
|
Applies to <option>mssql</option> source type only.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
Whether to ask for Unicode or single-byte data when querying MS SQL Server.
|
|
This flag <b>must</b> be in sync with <link linkend="conf-charset-type">charset_type</link> directive;
|
|
that is, to index Unicode data, you must set both <option>charset_type</option> in the index
|
|
(to 'utf-8') and <option>mssql_unicode</option> in the source (to 1).
|
|
For reference, MS SQL will actually return data in UCS-2 encoding instead of UTF-8,
|
|
but Sphinx will automatically handle that.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
mssql_unicode = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-unpack-zlib"><title>unpack_zlib</title>
|
|
<para>
|
|
Columns to unpack using zlib (aka deflate, aka gunzip).
|
|
Multi-value, optional, default value is empty list of columns.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
Columns specified using this directive will be unpacked by <filename>indexer</filename>
|
|
using standard zlib algorithm (called deflate and also implemented by <filename>gunzip</filename>).
|
|
When indexing on a different box than the database, this lets you offload the database, and save on network traffic.
|
|
The feature is only available if zlib and zlib-devel were both available during build time.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
unpack_zlib = col1
|
|
unpack_zlib = col2
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-unpack-mysqlcompress"><title>unpack_mysqlcompress</title>
|
|
<para>
|
|
Columns to unpack using MySQL UNCOMPRESS() algorithm.
|
|
Multi-value, optional, default value is empty list of columns.
|
|
Applies to SQL source types (<option>mysql</option>, <option>pgsql</option>, <option>mssql</option>) only.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
Columns specified using this directive will be unpacked by <filename>indexer</filename>
|
|
using modified zlib algorithm used by MySQL COMPRESS() and UNCOMPRESS() functions.
|
|
When indexing on a different box than the database, this lets you offload the database, and save on network traffic.
|
|
The feature is only available if zlib and zlib-devel were both available during build time.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
unpack_mysqlcompress = body_compressed
|
|
unpack_mysqlcompress = description_compressed
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-unpack-mysqlcompress-maxsize"><title>unpack_mysqlcompress_maxsize</title>
|
|
<para>
|
|
Buffer size for UNCOMPRESS()ed data.
|
|
Optional, default value is 16M.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
When using <link linkend="conf-unpack-mysqlcompress">unpack_mysqlcompress</link>,
|
|
due to implementation intrincacies it is not possible to deduce the required buffer size
|
|
from the compressed data. So the buffer must be preallocated in advance, and unpacked
|
|
data can not go over the buffer size. This option lets you control the buffer size,
|
|
both to limit <filename>indexer</filename> memory use, and to enable unpacking
|
|
of really long data fields if necessary.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
unpack_mysqlcompress_maxsize = 1M
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
</sect1>
|
|
<sect1 id="confgroup-index"><title>Index configuration options</title>
|
|
|
|
|
|
<sect2 id="conf-index-type"><title>type</title>
|
|
<para>
|
|
Index type.
|
|
Known values are 'plain', 'distributed', and 'rt'.
|
|
Optional, default is 'plain' (plain local index).
|
|
</para>
|
|
<para>
|
|
Sphinx supports several different types of indexes.
|
|
Versions 0.9.x supported two index types: plain local indexes
|
|
that are stored and processed on the local machine; and distributed indexes,
|
|
that involve not only local searching but querying remote <filename>searchd</filename>
|
|
instances over the network as well (see <xref linkend="distributed"/>).
|
|
Version 1.10-beta also adds support
|
|
for so-called real-time indexes (or RT indexes for short) that
|
|
are also stored and processed locally, but additionally allow
|
|
for on-the-fly updates of the full-text index (see <xref linkend="rt-indexes"/>).
|
|
Note that <emphasis>attributes</emphasis> can be updated on-the-fly using
|
|
either plain local indexes or RT ones.
|
|
</para>
|
|
<para>
|
|
Index type setting lets you choose the needed type.
|
|
By default, plain local index type will be assumed.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
type = distributed
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-source"><title>source</title>
|
|
<para>
|
|
Adds document source to local index.
|
|
Multi-value, mandatory.
|
|
</para>
|
|
<para>
|
|
Specifies document source to get documents from when the current
|
|
index is indexed. There must be at least one source. There may be multiple
|
|
sources, without any restrictions on the source types: ie. you can pull
|
|
part of the data from MySQL server, part from PostgreSQL, part from
|
|
the filesystem using xmlpipe2 wrapper.
|
|
</para>
|
|
<para>
|
|
However, there are some restrictions on the source data. First,
|
|
document IDs must be globally unique across all sources. If that
|
|
condition is not met, you might get unexpected search results.
|
|
Second, source schemas must be the same in order to be stored
|
|
within the same index.
|
|
</para>
|
|
<para>
|
|
No source ID is stored automatically. Therefore, in order to be able
|
|
to tell what source the matched document came from, you will need to
|
|
store some additional information yourself. Two typical approaches
|
|
include:
|
|
<orderedlist>
|
|
<listitem><para>mangling document ID and encoding source ID in it:
|
|
<programlisting>
|
|
source src1
|
|
{
|
|
sql_query = SELECT id*10+1, ... FROM table1
|
|
...
|
|
}
|
|
|
|
source src2
|
|
{
|
|
sql_query = SELECT id*10+2, ... FROM table2
|
|
...
|
|
}
|
|
</programlisting>
|
|
</para></listitem>
|
|
<listitem><para>
|
|
storing source ID simply as an attribute:
|
|
<programlisting>
|
|
source src1
|
|
{
|
|
sql_query = SELECT id, 1 AS source_id FROM table1
|
|
sql_attr_uint = source_id
|
|
...
|
|
}
|
|
|
|
source src2
|
|
{
|
|
sql_query = SELECT id, 2 AS source_id FROM table2
|
|
sql_attr_uint = source_id
|
|
...
|
|
}
|
|
</programlisting>
|
|
</para></listitem>
|
|
</orderedlist>
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
source = srcpart1
|
|
source = srcpart2
|
|
source = srcpart3
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-path"><title>path</title>
|
|
<para>
|
|
Index files path and file name (without extension).
|
|
Mandatory.
|
|
</para>
|
|
<para>
|
|
Path specifies both directory and file name, but without extension.
|
|
<filename>indexer</filename> will append different extensions
|
|
to this path when generating final names for both permanent and
|
|
temporary index files. Permanent data files have several different
|
|
extensions starting with '.sp'; temporary files' extensions
|
|
start with '.tmp'. It's safe to remove <filename>.tmp*</filename>
|
|
files is if indexer fails to remove them automatically.
|
|
</para>
|
|
<para>
|
|
For reference, different index files store the following data:
|
|
<itemizedlist>
|
|
<listitem><para><filename>.spa</filename> stores document attributes (used in <link linkend="conf-docinfo">extern docinfo</link> storage mode only);</para></listitem>
|
|
<listitem><para><filename>.spd</filename> stores matching document ID lists for each word ID;</para></listitem>
|
|
<listitem><para><filename>.sph</filename> stores index header information;</para></listitem>
|
|
<listitem><para><filename>.spi</filename> stores word lists (word IDs and pointers to <filename>.spd</filename> file);</para></listitem>
|
|
<listitem><para><filename>.spk</filename> stores kill-lists;</para></listitem>
|
|
<listitem><para><filename>.spm</filename> stores MVA data;</para></listitem>
|
|
<listitem><para><filename>.spp</filename> stores hit (aka posting, aka word occurence) lists for each word ID;</para></listitem>
|
|
<listitem><para><filename>.sps</filename> stores string attribute data.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
path = /var/data/test1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-docinfo"><title>docinfo</title>
|
|
<para>
|
|
Document attribute values (docinfo) storage mode.
|
|
Optional, default is 'extern'.
|
|
Known values are 'none', 'extern' and 'inline'.
|
|
</para>
|
|
<para>
|
|
Docinfo storage mode defines how exactly docinfo will be
|
|
physically stored on disk and RAM. "none" means that there will be
|
|
no docinfo at all (ie. no attributes). Normally you need not to set
|
|
"none" explicitly because Sphinx will automatically select "none"
|
|
when there are no attributes configured. "inline" means that the
|
|
docinfo will be stored in the <filename>.spd</filename> file,
|
|
along with the document ID lists. "extern" means that the docinfo
|
|
will be stored separately (externally) from document ID lists,
|
|
in a special <filename>.spa</filename> file.
|
|
</para>
|
|
<para>
|
|
Basically, externally stored docinfo must be kept in RAM when querying.
|
|
for performance reasons. So in some cases "inline" might be the only option.
|
|
However, such cases are infrequent, and docinfo defaults to "extern".
|
|
Refer to <xref linkend="attributes"/> for in-depth discussion
|
|
and RAM usage estimates.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
docinfo = inline
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-mlock"><title>mlock</title>
|
|
<para>
|
|
Memory locking for cached data.
|
|
Optional, default is 0 (do not call mlock()).
|
|
</para>
|
|
<para>
|
|
For search performance, <filename>searchd</filename> preloads
|
|
a copy of <filename>.spa</filename> and <filename>.spi</filename>
|
|
files in RAM, and keeps that copy in RAM at all times. But if there
|
|
are no searches on the index for some time, there are no accesses
|
|
to that cached copy, and OS might decide to swap it out to disk.
|
|
First queries to such "cooled down" index will cause swap-in
|
|
and their latency will suffer.
|
|
</para>
|
|
<para>
|
|
Setting mlock option to 1 makes Sphinx lock physical RAM used
|
|
for that cached data using mlock(2) system call, and that prevents
|
|
swapping (see man 2 mlock for details). mlock(2) is a privileged call,
|
|
so it will require <filename>searchd</filename> to be either run
|
|
from root account, or be granted enough privileges otherwise.
|
|
If mlock() fails, a warning is emitted, but index continues
|
|
working.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
mlock = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-morphology"><title>morphology</title>
|
|
<para>
|
|
A list of morphology preprocessors to apply.
|
|
Optional, default is empty (do not apply any preprocessor).
|
|
</para>
|
|
<para>
|
|
Morphology preprocessors can be applied to the words being
|
|
indexed to replace different forms of the same word with the base,
|
|
normalized form. For instance, English stemmer will normalize
|
|
both "dogs" and "dog" to "dog", making search results for
|
|
both searches the same.
|
|
</para>
|
|
<para>
|
|
Built-in preprocessors include English stemmer, Russian stemmer
|
|
(that supports UTF-8 and Windows-1251 encodings), Soundex,
|
|
and Metaphone. The latter two replace the words with special
|
|
phonetic codes that are equal is words are phonetically close.
|
|
Additional stemmers provided by <ulink url="http://snowball.tartarus.org/">Snowball</ulink>
|
|
project <ulink url="http://snowball.tartarus.org/dist/libstemmer_c.tgz">libstemmer</ulink> library
|
|
can be enabled at compile time using <option>--with-libstemmer</option> <filename>configure</filename> option.
|
|
Built-in English and Russian stemmers should be faster than their
|
|
libstemmer counterparts, but can produce slightly different results,
|
|
because they are based on an older version. Metaphone implementation
|
|
is based on Double Metaphone algorithm and indexes the primary code.
|
|
</para>
|
|
<para>
|
|
Built-in values that are available for use in <option>morphology</option>
|
|
option are as follows:
|
|
<itemizedlist>
|
|
<listitem><para>none - do not perform any morphology processing;</para></listitem>
|
|
<listitem><para>stem_en - apply Porter's English stemmer;</para></listitem>
|
|
<listitem><para>stem_ru - apply Porter's Russian stemmer;</para></listitem>
|
|
<listitem><para>stem_enru - apply Porter's English and Russian stemmers;</para></listitem>
|
|
<listitem><para>stem_cz - apply Czech stemmer;</para></listitem>
|
|
<listitem><para>soundex - replace keywords with their SOUNDEX code;</para></listitem>
|
|
<listitem><para>metaphone - replace keywords with their METAPHONE code.</para></listitem>
|
|
</itemizedlist>
|
|
Additional values provided by libstemmer are in 'libstemmer_XXX' format,
|
|
where XXX is libstemmer algorithm codename (refer to
|
|
<filename>libstemmer_c/libstemmer/modules.txt</filename> for a complete list).
|
|
</para>
|
|
<para>
|
|
Several stemmers can be specified (comma-separated). They will be applied
|
|
to incoming words in the order they are listed, and the processing will stop
|
|
once one of the stemmers actually modifies the word.
|
|
Also when <link linkend="conf-wordforms">wordforms</link> feature is enabled
|
|
the word will be looked up in word forms dictionary first, and if there is
|
|
a matching entry in the dictionary, stemmers will not be applied at all.
|
|
Or in other words, <link linkend="conf-wordforms">wordforms</link> can be
|
|
used to implement stemming exceptions.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
morphology = stem_en, libstemmer_sv
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-dict"><title>dict</title>
|
|
<para>
|
|
The keywords dictionary type.
|
|
Known values are 'crc' and 'keywords'.
|
|
Optional, default is 'crc'.
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
The default dictionary type in Sphinx, and the only one available
|
|
until version 2.0.1-beta, is a so-called CRC dictionary which never
|
|
stores the original keyword text in the index. Instead, keywords are
|
|
replaced with their control sum value (either CRC32 or FNV64, depending
|
|
whether Sphinx was built with <option>--enable-id64</option>) both
|
|
when searching and indexing, and that value is used internally
|
|
in the index.
|
|
</para>
|
|
<para>
|
|
That approach has two drawbacks. First, in CRC32 case there is
|
|
a chance of control sum collision between several pairs of different
|
|
keywords, growing quadratically with the number of unique keywords
|
|
in the index. (FNV64 case is unaffected in practice, as a chance
|
|
of a single FNV64 collision in a dictionary of 1 billion entries
|
|
is approximately 1:16, or 6.25 percent. And most dictionaries
|
|
will be much more compact that a billion keywords, as a typical
|
|
spoken human language has in the region of 1 to 10 million word
|
|
forms.) Second, and more importantly, substring searches are not
|
|
directly possible with control sums. Sphinx alleviated that by
|
|
pre-indexing all the possible substrings as separate keywords
|
|
(see <xref linkend="conf-min-prefix-len"/>, <xref linkend="conf-min-infix-len"/>
|
|
directives). That actually has an added benefit of matching
|
|
substrings in the quickest way possible. But at the same time
|
|
pre-indexing all substrings grows the index size a lot (factors
|
|
of 3-10x and even more would not be unusual) and impacts the
|
|
indexing time respectively, rendering substring searches
|
|
on big indexes rather impractical.
|
|
</para>
|
|
<para>
|
|
Keywords dictionary, introduced in 2.0.1-beta, fixes both these
|
|
drawbacks. It stores the keywords in the index and performs
|
|
search-time wildcard expansion. For example, a search for a
|
|
'test*' prefix could internally expand to 'test|tests|testing'
|
|
query based on the dictionary contents. That expansion is fully
|
|
transparent to the application, except that the separate
|
|
per-keyword statistics for all the actually matched keywords
|
|
would now also be reported.
|
|
</para>
|
|
<para>
|
|
Indexing with keywords dictionary should be 1.1x to 1.3x slower
|
|
compared to regular, non-substring indexing - but times faster
|
|
compared to substring indexing (either prefix or infix). Index size
|
|
should only be slightly bigger that than of the regular non-substring
|
|
index, with a 1..10% percent total difference
|
|
Regular keyword searching time must be very close or identical across
|
|
all three discussed index kinds (CRC non-substring, CRC substring,
|
|
keywords). Substring searching time can vary greatly depending
|
|
on how many actual keywords match the given substring (in other
|
|
words, into how many keywords does the search term expand).
|
|
The maximum number of keywords matched is restricted by the
|
|
<link linkend="conf-expansion-limit">expansion_limit</link>
|
|
directive.
|
|
</para>
|
|
<para>
|
|
Essentially, keywords and CRC dictionaries represent the two
|
|
different trade-off substring searching decisions. You can choose
|
|
to either sacrifice indexing time and index size in favor of
|
|
top-speed worst-case searches (CRC dictionary), or only slightly
|
|
impact indexing time but sacrifice worst-case searching time when
|
|
the prefix expands into very many keywords (keywords dictionary).
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
dict = keywords
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-index-sp"><title>index_sp</title>
|
|
<para>
|
|
Whether to detect and index sentence and paragraph boundaries.
|
|
Optional, default is 0 (do not detect and index).
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
This directive enables sentence and paragraph boundary indexing.
|
|
It's required for the SENTENCE and PARAGRAPH operators to work.
|
|
Sentence boundary detection is based on plain text analysis, so you
|
|
only need to set <code>index_sp = 1</code> to enable it. Paragraph
|
|
detection is however based on HTML markup, and happens in the
|
|
<link linkend="conf-html-strip">HTML stripper</link>.
|
|
So to index paragraph locations you also need to enable the stripper
|
|
by specifying <code>html_strip = 1</code>. Both types of boundaries
|
|
are detected based on a few built-in rules enumerated just below.
|
|
</para>
|
|
<para>
|
|
Sentence boundary detection rules are as follows.
|
|
<itemizedlist>
|
|
<listitem><para>Question and excalamation signs (? and !) are always a sentence boundary.</para></listitem>
|
|
<listitem><para>Trailing dot (.) is a sentence boundary, except:
|
|
<itemizedlist>
|
|
<listitem><para>When followed by a letter. That's considered a part of an abbreviation (as in "S.T.A.L.K.E.R" or "Goldman Sachs S.p.A.").</para></listitem>
|
|
<listitem><para>When followed by a comma. That's considered an abbreviation followed by a comma (as in "Telecom Italia S.p.A., founded in 1994").</para></listitem>
|
|
<listitem><para>When followed by a space and a small letter. That's considered an abbreviation within a sentence (as in "News Corp. announced in Februrary").</para></listitem>
|
|
<listitem><para>When preceded by a space and a capital letter, and followed by a space. That's considered a middle initial (as in "John D. Doe").</para></listitem>
|
|
</itemizedlist>
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
Paragraph boundaries are inserted at every block-level HTML tag.
|
|
Namely, those are (as taken from HTML 4 standard) ADDRESS, BLOCKQUOTE,
|
|
CAPTION, CENTER, DD, DIV, DL, DT, H1, H2, H3, H4, H5, LI, MENU, OL, P,
|
|
PRE, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, and UL.
|
|
</para>
|
|
<para>
|
|
Both sentences and paragraphs increment the keyword position counter by 1.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
index_sp = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-index-zones"><title>index_zones</title>
|
|
<para>
|
|
A list of in-field HTML/XML zones to index.
|
|
Optional, default is empty (do not index zones).
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
Zones can be formally defined as follows. Everything between
|
|
an opening and a matching closing tag is called a span, and
|
|
the aggregate of all spans corresponding sharing the same
|
|
tag name is called a zone. For instance, everything between
|
|
the occurrences of <H1> and </H1> in the document
|
|
field belongs to H1 zone.
|
|
</para>
|
|
<para>
|
|
Zone indexing, enabled by <code>index_zones</code> directive,
|
|
is an optional extension of the HTML stripper. So it will also
|
|
require that the <link linkend="conf-html-strip">stripper</link>
|
|
is enabled (with <code>html_strip = 1</code>). The value of the
|
|
<code>index_zones</code> should be a comma-separated list of
|
|
those tag names and wildcards (ending with a star) that should
|
|
be indexed as zones.
|
|
</para>
|
|
<para>
|
|
Zones can nest and overlap arbitrarily. The only requirement
|
|
is that every opening tag has a matching tag. You can also have
|
|
an arbitrary number of both zones (as in unique zone names,
|
|
such as H1) and spans (all the occurrences of those H1 tags)
|
|
in a document.
|
|
Once indexed, zones can then be used for matching with
|
|
the ZONE operator, see <xref linkend="extended-syntax"/>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
index_zones = h*, th, title
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-min-stemming-len"><title>min_stemming_len</title>
|
|
<para>
|
|
Minimum word length at which to enable stemming.
|
|
Optional, default is 1 (stem everything).
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
Stemmers are not perfect, and might sometimes produce undesired results.
|
|
For instance, running "gps" keyword through Porter stemmer for English
|
|
results in "gp", which is not really the intent. <option>min_stemming_len</option>
|
|
feature lets you suppress stemming based on the source word length,
|
|
ie. to avoid stemming too short words. Keywords that are shorter than
|
|
the given threshold will not be stemmed. Note that keywords that are
|
|
exactly as long as specified <b>will</b> be stemmed. So in order to avoid
|
|
stemming 3-character keywords, you should specify 4 for the value.
|
|
For more finely grained control, refer to <link linkend="conf-wordforms">wordforms</link> feature.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
min_stemming_len = 4
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-stopwords"><title>stopwords</title>
|
|
<para>
|
|
Stopword files list (space separated).
|
|
Optional, default is empty.
|
|
</para>
|
|
<para>
|
|
Stopwords are the words that will not be indexed. Typically you'd
|
|
put most frequent words in the stopwords list because they do not add
|
|
much value to search results but consume a lot of resources to process.
|
|
</para>
|
|
<para>
|
|
You can specify several file names, separated by spaces. All the files
|
|
will be loaded. Stopwords file format is simple plain text. The encoding
|
|
must match index encoding specified in <link linkend="conf-charset-type">charset_type</link>.
|
|
File data will be tokenized with respect to <link linkend="conf-charset-table">charset_table</link>
|
|
settings, so you can use the same separators as in the indexed data.
|
|
The <link linkend="conf-morphology">stemmers</link> will also be
|
|
applied when parsing stopwords file.
|
|
</para>
|
|
<para>
|
|
While stopwords are not indexed, they still do affect the keyword positions.
|
|
For instance, assume that "the" is a stopword, that document 1 contains the line
|
|
"in office", and that document 2 contains "in the office". Searching for "in office"
|
|
as for exact phrase will only return the first document, as expected, even though
|
|
"the" in the second one is stopped.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
stopwords = /usr/local/sphinx/data/stopwords.txt
|
|
stopwords = stopwords-ru.txt stopwords-en.txt
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-wordforms"><title>wordforms</title>
|
|
<para>
|
|
Word forms dictionary.
|
|
Optional, default is empty.
|
|
</para>
|
|
<para>
|
|
Word forms are applied after tokenizing the incoming text
|
|
by <link linkend="conf-charset-table">charset_table</link> rules.
|
|
They essentialy let you replace one word with another. Normally,
|
|
that would be used to bring different word forms to a single
|
|
normal form (eg. to normalize all the variants such as "walks",
|
|
"walked", "walking" to the normal form "walk"). It can also be used
|
|
to implement stemming exceptions, because stemming is not applied
|
|
to words found in the forms list.
|
|
</para>
|
|
<para>
|
|
Dictionaries are used to normalize incoming words both during indexing
|
|
and searching. Therefore, to pick up changes in wordforms file
|
|
it's required to reindex and restart <filename>searchd</filename>.
|
|
</para>
|
|
<para>
|
|
Word forms support in Sphinx is designed to support big dictionaries well.
|
|
They moderately affect indexing speed: for instance, a dictionary with 1 million
|
|
entries slows down indexing about 1.5 times. Searching speed is not affected at all.
|
|
Additional RAM impact is roughly equal to the dictionary file size,
|
|
and dictionaries are shared across indexes: ie. if the very same 50 MB wordforms
|
|
file is specified for 10 different indexes, additional <filename>searchd</filename>
|
|
RAM usage will be about 50 MB.
|
|
</para>
|
|
<para>
|
|
Dictionary file should be in a simple plain text format. Each line
|
|
should contain source and destination word forms, in exactly the same
|
|
encoding as specified in <link linkend="conf-charset-type">charset_type</link>,
|
|
separated by "greater" sign. Rules from the
|
|
<link linkend="conf-charset-table">charset_table</link> will be
|
|
applied when the file is loaded. So basically it's as case sensitive
|
|
as your other full-text indexed data, ie. typically case insensitive.
|
|
Here's the file contents sample:
|
|
<programlisting>
|
|
walks > walk
|
|
walked > walk
|
|
walking > walk
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
There is a bundled <filename>spelldump</filename> utility that
|
|
helps you create a dictionary file in the format Sphinx can read
|
|
from source <filename>.dict</filename> and <filename>.aff</filename>
|
|
dictionary files in <filename>ispell</filename> or <filename>MySpell</filename>
|
|
format (as bundled with OpenOffice).
|
|
</para>
|
|
<para>
|
|
Starting with version 0.9.9-rc1, you can map several source words
|
|
to a single destination word. Because the work happens on tokens,
|
|
not the source text, differences in whitespace and markup are ignored.
|
|
<programlisting>
|
|
core 2 duo > c2d
|
|
e6600 > c2d
|
|
core 2duo > c2d
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
Notice however that the <emphasis>destination</emphasis> wordforms
|
|
are still always interpreted as a <emphasis>single</emphasis> keyword!
|
|
Having a mapping like "St John > Saint John" will result in <b>not</b>
|
|
matching "St John" when searching for "Saint" or "John", because the
|
|
destination keyword will be "Saint John" with a space character in it
|
|
(and it's barely possible to input a destination keyword with a space).
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
wordforms = /usr/local/sphinx/data/wordforms.txt
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-exceptions"><title>exceptions</title>
|
|
<para>
|
|
Tokenizing exceptions file.
|
|
Optional, default is empty.
|
|
</para>
|
|
<para>
|
|
Exceptions allow to map one or more tokens (including tokens with
|
|
characters that would normally be excluded) to a single keyword.
|
|
They are similar to <link linkend="conf-wordforms">wordforms</link>
|
|
in that they also perform mapping, but have a number of important
|
|
differences.
|
|
</para>
|
|
<para>
|
|
Short summary of the differences is as follows:
|
|
<itemizedlist>
|
|
<listitem><para>exceptions are case sensitive, wordforms are not;</para></listitem>
|
|
<listitem><para>exceptions allow to detect sequences of tokens, wordforms work with single words only;</para></listitem>
|
|
<listitem><para>exceptions can use special characters that are <b>not</b> in charset_table, wordforms fully obey charset_table;</para></listitem>
|
|
<listitem><para>exceptions can underperform on huge dictionaries, wordforms handle millions of entries well.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
The expected file format is also plain text, with one line per exception,
|
|
and the line format is as follows:
|
|
<programlisting>
|
|
map-from-tokens => map-to-token
|
|
</programlisting>
|
|
Example file:
|
|
<programlisting>
|
|
AT & T => AT&T
|
|
AT&T => AT&T
|
|
Standarten Fuehrer => standartenfuhrer
|
|
Standarten Fuhrer => standartenfuhrer
|
|
MS Windows => ms windows
|
|
Microsoft Windows => ms windows
|
|
C++ => cplusplus
|
|
c++ => cplusplus
|
|
C plus plus => cplusplus
|
|
</programlisting>
|
|
All tokens here are case sensitive: they will <b>not</b> be processed by
|
|
<link linkend="conf-charset-table">charset_table</link> rules. Thus, with
|
|
the example exceptions file above, "At&t" text will be tokenized as two
|
|
keywords "at" and "t", because of lowercase letters. On the other hand,
|
|
"AT&T" will match exactly and produce single "AT&T" keyword.
|
|
</para>
|
|
<para>
|
|
Note that this map-to keyword is a) always interpereted
|
|
as a <emphasis>single</emphasis> word, and b) is both case and space
|
|
sensitive! In our sample, "ms windows" query will <emphasis>not</emphasis>
|
|
match the document with "MS Windows" text. The query will be interpreted
|
|
as a query for two keywords, "ms" and "windows". And what "MS Windows"
|
|
gets mapped to is a <emphasis>single</emphasis> keyword "ms windows",
|
|
with a space in the middle. On the other hand, "standartenfuhrer"
|
|
will retrieve documents with "Standarten Fuhrer" or "Standarten Fuehrer"
|
|
contents (capitalized exactly like this), or any capitalization variant
|
|
of the keyword itself, eg. "staNdarTenfUhreR". (It won't catch
|
|
"standarten fuhrer", however: this text does not match any of the
|
|
listed exceptions because of case sensitivity, and gets indexed
|
|
as two separate keywords.)
|
|
</para>
|
|
<para>
|
|
Whitespace in the map-from tokens list matters, but its amount does not.
|
|
Any amount of the whitespace in the map-form list will match any other amount
|
|
of whitespace in the indexed document or query. For instance, "AT & T"
|
|
map-from token will match "AT    &  T" text,
|
|
whatever the amount of space in both map-from part and the indexed text.
|
|
Such text will therefore be indexed as a special "AT&T" keyword,
|
|
thanks to the very first entry from the sample.
|
|
</para>
|
|
<para>
|
|
Exceptions also allow to capture special characters (that are exceptions
|
|
from general <link linkend="conf-charset-table">charset_table</link> rules;
|
|
hence the name). Assume that you generally do not want to treat '+'
|
|
as a valid character, but still want to be able search for some exceptions
|
|
from this rule such as 'C++'. The sample above will do just that, totally
|
|
independent of what characters are in the table and what are not.
|
|
</para>
|
|
<para>
|
|
Exceptions are applied to raw incoming document and query data
|
|
during indexing and searching respectively. Therefore, to pick up
|
|
changes in the file it's required to reindex and restart
|
|
<filename>searchd</filename>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
exceptions = /usr/local/sphinx/data/exceptions.txt
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-min-word-len"><title>min_word_len</title>
|
|
<para>
|
|
Minimum indexed word length.
|
|
Optional, default is 1 (index everything).
|
|
</para>
|
|
<para>
|
|
Only those words that are not shorter than this minimum will be indexed.
|
|
For instance, if min_word_len is 4, then 'the' won't be indexed, but 'they' will be.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
min_word_len = 4
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-charset-type"><title>charset_type</title>
|
|
<para>
|
|
Character set encoding type.
|
|
Optional, default is 'sbcs'.
|
|
Known values are 'sbcs' and 'utf-8'.
|
|
</para>
|
|
<para>
|
|
Different encodings have different methods for mapping their internal
|
|
characters codes into specific byte sequences. Two most common methods
|
|
in use today are single-byte encoding and UTF-8. Their corresponding
|
|
charset_type values are 'sbcs' (stands for Single Byte Character Set)
|
|
and 'utf-8'. The selected encoding type will be used everywhere where
|
|
the index is used: when indexing the data, when parsing the query
|
|
against this index, when generating snippets, etc.
|
|
</para>
|
|
<para>
|
|
Note that while 'utf-8' implies that the decoded values must be treated
|
|
as Unicode codepoint numbers, there's a family of 'sbcs' encodings that
|
|
may in turn treat different byte values differently, and that should be
|
|
properly reflected in your <link linkend="conf-charset-table">charset_table</link> settings.
|
|
For example, the same byte value of 224 (0xE0 hex) maps to different Russian letters
|
|
depending on whether koi-8r or windows-1251 encoding is used.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
charset_type = utf-8
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-charset-table"><title>charset_table</title>
|
|
<para>
|
|
Accepted characters table, with case folding rules.
|
|
Optional, default value depends on <link linkend="conf-charset-type">charset_type</link> value.
|
|
</para>
|
|
<para>
|
|
charset_table is the main workhorse of Sphinx tokenizing process,
|
|
ie. the process of extracting keywords from document text or query txet.
|
|
It controls what characters are accepted as valid and what are not,
|
|
and how the accepted characters should be transformed (eg. should
|
|
the case be removed or not).
|
|
</para>
|
|
<para>
|
|
You can think of charset_table as of a big table that has a mapping
|
|
for each and every of 100K+ characters in Unicode (or as of a small
|
|
256-character table if you're using SBCS). By default, every character
|
|
maps to 0, which means that it does not occur within keywords and
|
|
should be treated as a separator. Once mentioned in the table,
|
|
character is mapped to some other character (most frequently,
|
|
either to itself or to a lowercase letter), and is treated
|
|
as a valid keyword part.
|
|
</para>
|
|
<para>
|
|
The expected value format is a commas-separated list of mappings.
|
|
Two simplest mappings simply declare a character as valid, and map
|
|
a single character to another single character, respectively.
|
|
But specifying the whole table in such form would result
|
|
in bloated and barely manageable specifications. So there are
|
|
several syntax shortcuts that let you map ranges of characters
|
|
at once. The complete list is as follows:
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>A->a</term>
|
|
<listitem><para>Single char mapping, declares source char 'A' as allowed
|
|
to occur within keywords and maps it to destination char 'a'
|
|
(but does <emphasis>not</emphasis> declare 'a' as allowed).
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>A..Z->a..z</term>
|
|
<listitem><para>Range mapping, declares all chars in source range
|
|
as allowed and maps them to the destination range. Does <emphasis>not</emphasis>
|
|
declare destination range as allowed. Also checks ranges' lengths
|
|
(the lengths must be equal).
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>a</term>
|
|
<listitem><para>Stray char mapping, declares a character as allowed
|
|
and maps it to itself. Equivalent to a->a single char mapping.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>a..z</term>
|
|
<listitem><para>Stray range mapping, declares all characters in range
|
|
as allowed and maps them to themselves. Equivalent to
|
|
a..z->a..z range mapping.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>A..Z/2</term>
|
|
<listitem><para>Checkerboard range map. Maps every pair of chars
|
|
to the second char. More formally, declares odd characters
|
|
in range as allowed and maps them to the even ones; also
|
|
declares even characters as allowed and maps them to themselves.
|
|
For instance, A..Z/2 is equivalent to A->B, B->B, C->D, D->D,
|
|
..., Y->Z, Z->Z. This mapping shortcut is helpful for
|
|
a number of Unicode blocks where uppercase and lowercase
|
|
letters go in such interleaved order instead of contiguous
|
|
chunks.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
<para>
|
|
Control characters with codes from 0 to 31 are always treated as separators.
|
|
Characters with codes 32 to 127, ie. 7-bit ASCII characters, can be used
|
|
in the mappings as is. To avoid configuration file encoding issues,
|
|
8-bit ASCII characters and Unicode characters must be specified in U+xxx form,
|
|
where 'xxx' is hexadecimal codepoint number. This form can also be used
|
|
for 7-bit ASCII characters to encode special ones: eg. use U+20 to
|
|
encode space, U+2E to encode dot, U+2C to encode comma.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
# 'sbcs' defaults for English and Russian
|
|
charset_table = 0..9, A..Z->a..z, _, a..z, \
|
|
U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF
|
|
|
|
# 'utf-8' defaults for English and Russian
|
|
charset_table = 0..9, A..Z->a..z, _, a..z, \
|
|
U+410..U+42F->U+430..U+44F, U+430..U+44F
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-ignore-chars"><title>ignore_chars</title>
|
|
<para>
|
|
Ignored characters list.
|
|
Optional, default is empty.
|
|
</para>
|
|
<para>
|
|
Useful in the cases when some characters, such as soft hyphenation mark (U+00AD),
|
|
should be not just treated as separators but rather fully ignored.
|
|
For example, if '-' is simply not in the charset_table,
|
|
"abc-def" text will be indexed as "abc" and "def" keywords.
|
|
On the contrary, if '-' is added to ignore_chars list, the same
|
|
text will be indexed as a single "abcdef" keyword.
|
|
</para>
|
|
<para>
|
|
The syntax is the same as for <link linkend="conf-charset-table">charset_table</link>,
|
|
but it's only allowed to declare characters, and not allowed to map them. Also,
|
|
the ignored characters must not be present in charset_table.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
ignore_chars = U+AD
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-min-prefix-len"><title>min_prefix_len</title>
|
|
<para>
|
|
Minimum word prefix length to index.
|
|
Optional, default is 0 (do not index prefixes).
|
|
</para>
|
|
<para>
|
|
Prefix indexing allows to implement wildcard searching by 'wordstart*' wildcards
|
|
(refer to <link linkend="conf-enable-star">enable_star</link> option for details on wildcard syntax).
|
|
When mininum prefix length is set to a positive number, indexer will index
|
|
all the possible keyword prefixes (ie. word beginnings) in addition to the keywords
|
|
themselves. Too short prefixes (below the minimum allowed length) will not
|
|
be indexed.
|
|
</para>
|
|
<para>
|
|
For instance, indexing a keyword "example" with min_prefix_len=3
|
|
will result in indexing "exa", "exam", "examp", "exampl" prefixes along
|
|
with the word itself. Searches against such index for "exam" will match
|
|
documents that contain "example" word, even if they do not contain "exam"
|
|
on itself. However, indexing prefixes will make the index grow significantly
|
|
(because of many more indexed keywords), and will degrade both indexing
|
|
and searching times.
|
|
</para>
|
|
<para>
|
|
There's no automatic way to rank perfect word matches higher
|
|
in a prefix index, but there's a number of tricks to achieve that.
|
|
First, you can setup two indexes, one with prefix indexing and one
|
|
without it, search through both, and use <link linkend="api-func-setindexweights">SetIndexWeights()</link>
|
|
call to combine weights. Second, you can enable star-syntax and rewrite
|
|
your extended-mode queries:
|
|
<programlisting>
|
|
# in sphinx.conf
|
|
enable_star = 1
|
|
|
|
// in query
|
|
$cl->Query ( "( keyword | keyword* ) other keywords" );
|
|
</programlisting>
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
min_prefix_len = 3
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-min-infix-len"><title>min_infix_len</title>
|
|
<para>
|
|
Minimum infix prefix length to index.
|
|
Optional, default is 0 (do not index infixes).
|
|
</para>
|
|
<para>
|
|
Infix indexing allows to implement wildcard searching by 'start*', '*end', and '*middle*' wildcards
|
|
(refer to <link linkend="conf-enable-star">enable_star</link> option for details on wildcard syntax).
|
|
When mininum infix length is set to a positive number, indexer will index all the possible keyword infixes
|
|
(ie. substrings) in addition to the keywords themselves. Too short infixes
|
|
(below the minimum allowed length) will not be indexed. For instance,
|
|
indexing a keyword "test" with min_infix_len=2 will result in indexing
|
|
"te", "es", "st", "tes", "est" infixes along with the word itself.
|
|
Searches against such index for "es" will match documents that contain
|
|
"test" word, even if they do not contain "es" on itself. However,
|
|
indexing infixes will make the index grow significantly (because of
|
|
many more indexed keywords), and will degrade both indexing and
|
|
searching times.</para>
|
|
<para>
|
|
There's no automatic way to rank perfect word matches higher
|
|
in an infix index, but the same tricks as with <link linkend="conf-min-prefix-len">prefix indexes</link>
|
|
can be applied.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
min_infix_len = 3
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-prefix-fields"><title>prefix_fields</title>
|
|
<para>
|
|
The list of full-text fields to limit prefix indexing to.
|
|
Optional, default is empty (index all fields in prefix mode).
|
|
</para>
|
|
<para>
|
|
Because prefix indexing impacts both indexing and searching performance,
|
|
it might be desired to limit it to specific full-text fields only:
|
|
for instance, to provide prefix searching through URLs, but not through
|
|
page contents. prefix_fields specifies what fields will be prefix-indexed;
|
|
all other fields will be indexed in normal mode. The value format is a
|
|
comma-separated list of field names.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
prefix_fields = url, domain
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-infix-fields"><title>infix_fields</title>
|
|
<para>
|
|
The list of full-text fields to limit infix indexing to.
|
|
Optional, default is empty (index all fields in infix mode).
|
|
</para>
|
|
<para>
|
|
Similar to <link linkend="conf-prefix-fields">prefix_fields</link>,
|
|
but lets you limit infix-indexing to given fields.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
infix_fields = url, domain
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-enable-star"><title>enable_star</title>
|
|
<para>
|
|
Enables star-syntax (or wildcard syntax) when searching through prefix/infix indexes.
|
|
Optional, default is is 0 (do not use wildcard syntax), for compatibility with 0.9.7.
|
|
Known values are 0 and 1.
|
|
</para>
|
|
<para>
|
|
This feature enables "star-syntax", or wildcard syntax, when searching
|
|
through indexes which were created with prefix or infix indexing enabled.
|
|
It only affects searching; so it can be changed without reindexing
|
|
by simply restarting <filename>searchd</filename>.
|
|
</para>
|
|
<para>
|
|
The default value is 0, that means to disable star-syntax
|
|
and treat all keywords as prefixes or infixes respectively,
|
|
depending on indexing-time <link linkend="conf-min-prefix-len">min_prefix_len</link>
|
|
and <link linkend="conf-min-infix-len">min_infix_len settings</link>.
|
|
The value of 1 means that star ('*') can be used at the start
|
|
and/or the end of the keyword. The star will match zero or more characters.
|
|
</para>
|
|
<para>
|
|
For example, assume that the index was built with infixes and
|
|
that enable_star is 1. Searching should work as follows:
|
|
<orderedlist>
|
|
<listitem><para>"abcdef" query will match only those documents that contain the exact "abcdef" word in them.</para></listitem>
|
|
<listitem><para>"abc*" query will match those documents that contain
|
|
any words starting with "abc" (including the documents which
|
|
contain the exact "abc" word only);</para></listitem>
|
|
<listitem><para>"*cde*" query will match those documents that contain
|
|
any words which have "cde" characters in any part of the word
|
|
(including the documents which contain the exact "cde" word only).</para></listitem>
|
|
<listitem><para>"*def" query will match those documents that contain
|
|
any words ending with "def" (including the documents that
|
|
contain the exact "def" word only).</para></listitem>
|
|
</orderedlist>
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
enable_star = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-ngram-len"><title>ngram_len</title>
|
|
<para>
|
|
N-gram lengths for N-gram indexing.
|
|
Optional, default is 0 (disable n-gram indexing).
|
|
Known values are 0 and 1 (other lengths to be implemented).
|
|
</para>
|
|
<para>
|
|
N-grams provide basic CJK (Chinese, Japanese, Korean) support for
|
|
unsegmented texts. The issue with CJK searching is that there could be no
|
|
clear separators between the words. Ideally, the texts would be filtered
|
|
through a special program called segmenter that would insert separators
|
|
in proper locations. However, segmenters are slow and error prone,
|
|
and it's common to index contiguous groups of N characters, or n-grams,
|
|
instead.
|
|
</para>
|
|
<para>
|
|
When this feature is enabled, streams of CJK characters are indexed
|
|
as N-grams. For example, if incoming text is "ABCDEF" (where A to F represent
|
|
some CJK characters) and length is 1, in will be indexed as if
|
|
it was "A B C D E F". (With length equal to 2, it would produce "AB BC CD DE EF";
|
|
but only 1 is supported at the moment.) Only those characters that are
|
|
listed in <link linkend="conf-ngram-chars">ngram_chars</link> table
|
|
will be split this way; other ones will not be affected.
|
|
</para>
|
|
<para>
|
|
Note that if search query is segmented, ie. there are separators between
|
|
individual words, then wrapping the words in quotes and using extended mode
|
|
will resut in proper matches being found even if the text was <b>not</b>
|
|
segmented. For instance, assume that the original query is BC DEF.
|
|
After wrapping in quotes on the application side, it should look
|
|
like "BC" "DEF" (<emphasis>with</emphasis> quotes). This query
|
|
will be passed to Sphinx and internally split into 1-grams too,
|
|
resulting in "B C" "D E F" query, still with
|
|
quotes that are the phrase matching operator. And it will match
|
|
the text even though there were no separators in the text.
|
|
</para>
|
|
<para>
|
|
Even if the search query is not segmented, Sphinx should still produce
|
|
good results, thanks to phrase based ranking: it will pull closer phrase
|
|
matches (which in case of N-gram CJK words can mean closer multi-character
|
|
word matches) to the top.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
ngram_len = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-ngram-chars"><title>ngram_chars</title>
|
|
<para>
|
|
N-gram characters list.
|
|
Optional, default is empty.
|
|
</para>
|
|
<para>
|
|
To be used in conjunction with in <link linkend="conf-ngram-len">ngram_len</link>,
|
|
this list defines characters, sequences of which are subject to N-gram extraction.
|
|
Words comprised of other characters will not be affected by N-gram indexing
|
|
feature. The value format is identical to <link linkend="conf-charset-table">charset_table</link>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
ngram_chars = U+3000..U+2FA1F
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-phrase-boundary"><title>phrase_boundary</title>
|
|
<para>
|
|
Phrase boundary characters list.
|
|
Optional, default is empty.
|
|
</para>
|
|
<para>
|
|
This list controls what characters will be treated as phrase boundaries,
|
|
in order to adjust word positions and enable phrase-level search
|
|
emulation through proximity search. The syntax is similar
|
|
to <link linkend="conf-charset-table">charset_table</link>.
|
|
Mappings are not allowed and the boundary characters must not
|
|
overlap with anything else.
|
|
</para>
|
|
<para>
|
|
On phrase boundary, additional word position increment (specified by
|
|
<link linkend="conf-phrase-boundary-step">phrase_boundary_step</link>)
|
|
will be added to current word position. This enables phrase-level
|
|
searching through proximity queries: words in different phrases
|
|
will be guaranteed to be more than phrase_boundary_step distance
|
|
away from each other; so proximity search within that distance
|
|
will be equivalent to phrase-level search.
|
|
</para>
|
|
<para>
|
|
Phrase boundary condition will be raised if and only if such character
|
|
is followed by a separator; this is to avoid abbreviations such as
|
|
S.T.A.L.K.E.R or URLs being treated as several phrases.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-phrase-boundary-step"><title>phrase_boundary_step</title>
|
|
<para>
|
|
Phrase boundary word position increment.
|
|
Optional, default is 0.
|
|
</para>
|
|
<para>
|
|
On phrase boundary, current word position will be additionally incremented
|
|
by this number. See <link linkend="conf-phrase-boundary">phrase_boundary</link> for details.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
phrase_boundary_step = 100
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-html-strip"><title>html_strip</title>
|
|
<para>
|
|
Whether to strip HTML markup from incoming full-text data.
|
|
Optional, default is 0.
|
|
Known values are 0 (disable stripping) and 1 (enable stripping).
|
|
</para>
|
|
<para>
|
|
Both HTML tags and entities and considered markup and get processed.
|
|
</para>
|
|
<para>HTML tags are removed, their contents (i.e., everything between
|
|
<P> and </P>) are left intact by default. You can choose
|
|
to keep and index attributes of the tags (e.g., HREF attribute in
|
|
an A tag, or ALT in an IMG one). Several well-known inline tags are
|
|
completely removed, all other tags are treated as block level and
|
|
replaced with whitespace. For example, 'te<B>st</B>'
|
|
text will be indexed as a single keyword 'test', however,
|
|
'te<P>st</P>' will be indexed as two keywords
|
|
'te' and 'st'. Known inline tags are as follows: A, B, I, S, U, BASEFONT,
|
|
BIG, EM, FONT, IMG, LABEL, SMALL, SPAN, STRIKE, STRONG, SUB, SUP, TT.
|
|
</para>
|
|
<para>
|
|
HTML entities get decoded and replaced with corresponding UTF-8
|
|
characters. Stripper supports both numeric forms (such as &#239;)
|
|
and text forms (such as &oacute; or &nbsp;). All entities
|
|
as specified by HTML4 standard are supported.
|
|
</para>
|
|
<para>
|
|
Stripping does not work with <option>xmlpipe</option> source type
|
|
(it's suggested to upgrade to xmlpipe2 anyway). It should work with
|
|
properly formed HTML and XHTML, but, just as most browsers, may produce
|
|
unexpected results on malformed input (such as HTML with stray <'s
|
|
or unclosed >'s).
|
|
</para>
|
|
<para>
|
|
Only the tags themselves, and also HTML comments, are stripped.
|
|
To strip the contents of the tags too (eg. to strip embedded scripts),
|
|
see <link linkend="conf-html-remove-elements">html_remove_elements</link> option.
|
|
There are no restrictions on tag names; ie. everything
|
|
that looks like a valid tag start, or end, or a comment
|
|
will be stripped.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
html_strip = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-html-index-attrs"><title>html_index_attrs</title>
|
|
<para>
|
|
A list of markup attributes to index when stripping HTML.
|
|
Optional, default is empty (do not index markup attributes).
|
|
</para>
|
|
<para>
|
|
Specifies HTML markup attributes whose contents should be retained and indexed
|
|
even though other HTML markup is stripped. The format is per-tag enumeration of
|
|
indexable attributes, as shown in the example below.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
html_index_attrs = img=alt,title; a=title;
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-html-remove-elements"><title>html_remove_elements</title>
|
|
<para>
|
|
A list of HTML elements for which to strip contents along with the elements themselves.
|
|
Optional, default is empty string (do not strip contents of any elements).
|
|
</para>
|
|
<para>
|
|
This feature allows to strip element contents, ie. everything that
|
|
is between the opening and the closing tags. It is useful to remove
|
|
embedded scripts, CSS, etc. Short tag form for empty elements
|
|
(ie. <br />) is properly supported; ie. the text that
|
|
follows such tag will <b>not</b> be removed.
|
|
</para>
|
|
<para>
|
|
The value is a comma-separated list of element (tag) names whose
|
|
contents should be removed. Tag names are case insensitive.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
html_remove_elements = style, script
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-local"><title>local</title>
|
|
<para>
|
|
Local index declaration in the <link linkend="distributed">distributed index</link>.
|
|
Multi-value, optional, default is empty.
|
|
</para>
|
|
<para>
|
|
This setting is used to declare local indexes that will be searched when
|
|
given distributed index is searched. All local indexes will be searched
|
|
<b>sequentially</b>, utilizing only 1 CPU or core; to parallelize processing,
|
|
you can configure <filename>searchd</filename> to query itself (refer to
|
|
<xref linkend="conf-agent"/> for the details). There might be several local
|
|
indexes declared per each distributed index. Any local index can be mentioned
|
|
several times in other distributed indexes.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
local = chunk1
|
|
local = chunk2
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-agent"><title>agent</title>
|
|
<para>
|
|
Remote agent declaration in the <link linkend="distributed">distributed index</link>.
|
|
Multi-value, optional, default is empty.
|
|
</para>
|
|
<para>
|
|
This setting is used to declare remote agents that will be searched
|
|
when given distributed index is searched. The agents can be thought of
|
|
as network pointers that specify host, port, and index names. In the basic
|
|
case agents would correspond to remote physical machines. More formally,
|
|
that is not always correct: you can point several agents to the
|
|
same remote machine; or you can even point agents to the very same
|
|
single instance of <filename>searchd</filename> (in order to utilize
|
|
many CPUs or cores).
|
|
</para>
|
|
<para>
|
|
The value format is as follows:
|
|
<programlisting>
|
|
agent = specification:remote-indexes-list
|
|
specification = hostname ":" port | path
|
|
</programlisting>
|
|
Where 'hostname' is remote host name; 'port' is remote TCP port; 'path'
|
|
is Unix-domain socket path and 'remote-indexes-list' is a
|
|
comma-separated list of remote index names.
|
|
</para>
|
|
<para>
|
|
All agents will be searched in parallel. However, all indexes
|
|
specified for a given agent will be searched sequentially
|
|
in this agent. This lets you fine-tune the configuration
|
|
to the hardware. For instance, if two remote indexes are stored
|
|
on the same physical HDD, it's better to configure one agent
|
|
with several sequentially searched indexes to avoid HDD steping.
|
|
If they are stored on different HDDs, having two agents will be
|
|
advantageous, because the work will be fully parallelized.
|
|
The same applies to CPUs; though CPU performance impact caused
|
|
by two processes stepping on each other is somewhat smaller
|
|
and frequently can be ignored at all.
|
|
</para>
|
|
<para>
|
|
On machines with many CPUs and/or HDDs, agents can be pointed
|
|
to the same machine to utilize all of the hardware in parallel
|
|
and reduce query latency. There is no need to setup several
|
|
<filename>searchd</filename> instances for that; it's legal
|
|
to configure the instance to contact itself. Here's an example
|
|
setup, intended for a 4-CPU machine, that will use up to
|
|
4 CPUs in parallel to process each query:
|
|
<programlisting>
|
|
index dist
|
|
{
|
|
type = distributed
|
|
local = chunk1
|
|
agent = localhost:9312:chunk2
|
|
agent = localhost:9312:chunk3
|
|
agent = localhost:9312:chunk4
|
|
}
|
|
</programlisting>
|
|
Note how one of the chunks is searched locally and the same instance
|
|
of searchd queries itself to launch searches through three other ones
|
|
in parallel.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
agent = localhost:9312:chunk2 # contact itself
|
|
agent = /var/run/searchd.s:chunk2
|
|
agent = searchbox2:9312:chunk3,chunk4 # search remote indexes
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-agent-blackhole"><title>agent_blackhole</title>
|
|
<para>
|
|
Remote blackhole agent declaration in the <link linkend="distributed">distributed index</link>.
|
|
Multi-value, optional, default is empty.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
<option>agent_blackhole</option> lets you fire-and-forget queries
|
|
to remote agents. That is useful for debugging (or just testing)
|
|
production clusters: you can setup a separate debugging/testing searchd
|
|
instance, and forward the requests to this instance from your production
|
|
master (aggregator) instance without interfering with production work.
|
|
Master searchd will attempt to connect and query blackhole agent
|
|
normally, but it will neither wait nor process any responses.
|
|
Also, all network errors on blackhole agents will be ignored.
|
|
The value format is completely identical to regular
|
|
<link linkend="conf-agent">agent</link> directive.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
agent_blackhole = testbox:9312:testindex1,testindex2
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-agent-connect-timeout"><title>agent_connect_timeout</title>
|
|
<para>
|
|
Remote agent connection timeout, in milliseconds.
|
|
Optional, default is 1000 (ie. 1 second).
|
|
</para>
|
|
<para>
|
|
When connecting to remote agents, <filename>searchd</filename>
|
|
will wait at most this much time for connect() call to complete
|
|
succesfully. If the timeout is reached but connect() does not complete,
|
|
and <link linkend="api-func-setretries">retries</link> are enabled,
|
|
retry will be initiated.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
agent_connect_timeout = 300
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-agent-query-timeout"><title>agent_query_timeout</title>
|
|
<para>
|
|
Remote agent query timeout, in milliseconds.
|
|
Optional, default is 3000 (ie. 3 seconds).
|
|
</para>
|
|
<para>
|
|
After connection, <filename>searchd</filename> will wait at most this
|
|
much time for remote queries to complete. This timeout is fully separate
|
|
from connection timeout; so the maximum possible delay caused by
|
|
a remote agent equals to the sum of <code>agent_connection_timeout</code> and
|
|
<code>agent_query_timeout</code>. Queries will <b>not</b> be retried
|
|
if this timeout is reached; a warning will be produced instead.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
agent_query_timeout = 10000 # our query can be long, allow up to 10 sec
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-preopen"><title>preopen</title>
|
|
<para>
|
|
Whether to pre-open all index files, or open them per each query.
|
|
Optional, default is 0 (do not preopen).
|
|
</para>
|
|
<para>
|
|
This option tells <filename>searchd</filename> that it should pre-open
|
|
all index files on startup (or rotation) and keep them open while it runs.
|
|
Currently, the default mode is <b>not</b> to pre-open the files (this may
|
|
change in the future). Preopened indexes take a few (currently 2) file
|
|
descriptors per index. However, they save on per-query <code>open()</code> calls;
|
|
and also they are invulnerable to subtle race conditions that may happen during
|
|
index rotation under high load. On the other hand, when serving many indexes
|
|
(100s to 1000s), it still might be desired to open the on per-query basis
|
|
in order to save file descriptors.
|
|
</para>
|
|
<para>
|
|
This directive does not affect <filename>indexer</filename> in any way,
|
|
it only affects <filename>searchd</filename>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
preopen = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-ondisk-dict"><title>ondisk_dict</title>
|
|
<para>
|
|
Whether to keep the dictionary file (.spi) for this index on disk, or precache it in RAM.
|
|
Optional, default is 0 (precache in RAM).
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
The dictionary (.spi) can be either kept on RAM or on disk. The default
|
|
is to fully cache it in RAM. That improves performance, but might cause
|
|
too much RAM pressure, especially if prefixes or infixes were used.
|
|
Enabling <option>ondisk_dict</option> results in 1 additional disk IO
|
|
per keyword per query, but reduces memory footprint.
|
|
</para>
|
|
<para>
|
|
This directive does not affect <filename>indexer</filename> in any way,
|
|
it only affects <filename>searchd</filename>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
ondisk_dict = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-inplace-enable"><title>inplace_enable</title>
|
|
<para>
|
|
Whether to enable in-place index inversion.
|
|
Optional, default is 0 (use separate temporary files).
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
<option>inplace_enable</option> greatly reduces indexing disk footprint,
|
|
at a cost of slightly slower indexing (it uses around 2x less disk,
|
|
but yields around 90-95% the original performance).
|
|
</para>
|
|
<para>
|
|
Indexing involves two major phases. The first phase collects,
|
|
processes, and partially sorts documents by keyword, and writes
|
|
the intermediate result to temporary files (.tmp*). The second
|
|
phase fully sorts the documents, and creates the final index
|
|
files. Thus, rebuilding a production index on the fly involves
|
|
around 3x peak disk footprint: 1st copy for the intermediate
|
|
temporary files, 2nd copy for newly constructed copy, and 3rd copy
|
|
for the old index that will be serving production queries in the meantime.
|
|
(Intermediate data is comparable in size to the final index.)
|
|
That might be too much disk footprint for big data collections,
|
|
and <option>inplace_enable</option> allows to reduce it.
|
|
When enabled, it reuses the temporary files, outputs the
|
|
final data back to them, and renames them on completion.
|
|
However, this might require additional temporary data chunk
|
|
relocation, which is where the performance impact comes from.
|
|
</para>
|
|
<para>
|
|
This directive does not affect <filename>searchd</filename> in any way,
|
|
it only affects <filename>indexer</filename>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
inplace_enable = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-inplace-hit-gap"><title>inplace_hit_gap</title>
|
|
<para>
|
|
<link linkend="conf-inplace-enable">In-place inversion</link> fine-tuning option.
|
|
Controls preallocated hitlist gap size.
|
|
Optional, default is 0.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
This directive does not affect <filename>searchd</filename> in any way,
|
|
it only affects <filename>indexer</filename>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
inplace_hit_gap = 1M
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-inplace-docinfo-gap"><title>inplace_docinfo_gap</title>
|
|
<para>
|
|
<link linkend="conf-inplace-enable">In-place inversion</link> fine-tuning option.
|
|
Controls preallocated docinfo gap size.
|
|
Optional, default is 0.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
This directive does not affect <filename>searchd</filename> in any way,
|
|
it only affects <filename>indexer</filename>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
inplace_docinfo_gap = 1M
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-inplace-reloc-factor"><title>inplace_reloc_factor</title>
|
|
<para>
|
|
<link linkend="conf-inplace-reloc-factor">In-place inversion</link> fine-tuning option.
|
|
Controls relocation buffer size within indexing memory arena.
|
|
Optional, default is 0.1.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
This directive does not affect <filename>searchd</filename> in any way,
|
|
it only affects <filename>indexer</filename>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
inplace_reloc_factor = 0.1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-inplace-write-factor"><title>inplace_write_factor</title>
|
|
<para>
|
|
<link linkend="conf-inplace-write-factor">In-place inversion</link> fine-tuning option.
|
|
Controls in-place write buffer size within indexing memory arena.
|
|
Optional, default is 0.1.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
This directive does not affect <filename>searchd</filename> in any way,
|
|
it only affects <filename>indexer</filename>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
inplace_write_factor = 0.1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-index-exact-words"><title>index_exact_words</title>
|
|
<para>
|
|
Whether to index the original keywords along with the stemmed/remapped versions.
|
|
Optional, default is 0 (do not index).
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
When enabled, <option>index_exact_words</option> forces <filename>indexer</filename>
|
|
to put the raw keywords in the index along with the stemmed versions. That, in turn,
|
|
enables <link linkend="extended-syntax">exact form operator</link> in the query language to work.
|
|
This impacts the index size and the indexing time. However, searching performance
|
|
is not impacted at all.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
index_exact_words = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-overshort-step"><title>overshort_step</title>
|
|
<para>
|
|
Position increment on overshort (less that <link linkend="conf-min-word-len">min_word_len</link>) keywords.
|
|
Optional, allowed values are 0 and 1, default is 1.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
This directive does not affect <filename>searchd</filename> in any way,
|
|
it only affects <filename>indexer</filename>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
overshort_step = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-stopword-step"><title>stopword_step</title>
|
|
<para>
|
|
Position increment on <link linkend="conf-stopwords">stopwords</link>.
|
|
Optional, allowed values are 0 and 1, default is 1.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
This directive does not affect <filename>searchd</filename> in any way,
|
|
it only affects <filename>indexer</filename>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
stopword_step = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-hitless-words"><title>hitless_words</title>
|
|
<para>
|
|
Hitless words list.
|
|
Optional, allowed values are 'all', or a list file name.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
By default, Sphinx full-text index stores not only a list of matching
|
|
documents for every given keyword, but also a list of its in-document positions
|
|
(aka hitlist). Hitlists enables phrase, proximity, strict order and other
|
|
advanced types of searching, as well as phrase proximity ranking. However,
|
|
hitlists for specific frequent keywords (that can not be stopped for
|
|
some reason despite being frequent) can get huge and thus slow to process
|
|
while querying. Also, in some cases we might only care about boolean
|
|
keyword matching, and never need position-based searching operators
|
|
(such as phrase matching) nor phrase ranking.
|
|
</para>
|
|
<para>
|
|
<option>hitless_words</option> lets you create indexes that either
|
|
do not have positional information (hitlists) at all, or skip it for
|
|
specific keywords.
|
|
</para>
|
|
<para>
|
|
Hitless index will generally use less space than the respective
|
|
regular index (about 1.5x can be expected). Both indexing and searching
|
|
should be faster, at a cost of missing positional query and ranking support.
|
|
When searching, positional queries (eg. phrase queries) will be automatically
|
|
converted to respective non-positional (document-level) or combined queries.
|
|
For instance, if keywords "hello" and "world" are hitless, "hello world"
|
|
phrase query will be converted to (hello & world) bag-of-words query,
|
|
matching all documents that mention either of the keywords but not necessarily
|
|
the exact phrase. And if, in addition, keywords "simon" and "says" are not
|
|
hitless, "simon says hello world" will be converted to ("simon says" &
|
|
hello & world) query, matching all documents that contain "hello" and
|
|
"world" anywhere in the document, and also "simon says" as an exact phrase.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
hitless_words = all
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-expand-keywords"><title>expand_keywords</title>
|
|
<para>
|
|
Expand keywords with exact forms and/or stars when possible.
|
|
Optional, default is 0 (do not expand keywords).
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
Queries against indexes with <option>expand_keywords</option> feature
|
|
enabled are internally expanded as follows. If the index was built with
|
|
prefix or infix indexing enabled, every keyword gets internally replaced
|
|
with a disjunction of keyword itself and a respective prefix or infix
|
|
(keyword with stars). If the index was built with both stemming and
|
|
<link linkend="conf-index-exact-words">index_exact_words</link> enabled,
|
|
exact form is also added. Here's an example that shows how internal
|
|
expansion works when all of the above (infixes, stemming, and exact
|
|
words) are combined:
|
|
<programlisting>
|
|
running -> ( running | *running* | =running )
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
Expanded queries take naturally longer to complete, but can possibly
|
|
improve the search quality, as the documents with exact form matches
|
|
should be ranked generally higher than documents with stemmed or infix matches.
|
|
</para>
|
|
<para>
|
|
Note that the existing query syntax does not allowe to emulate this
|
|
kind of expansion, because internal expansion works on keyword level and
|
|
expands keywords within phrase or quorum operators too (which is not
|
|
possible through the query syntax).
|
|
</para>
|
|
<para>
|
|
This directive does not affect <filename>indexer</filename> in any way,
|
|
it only affects <filename>searchd</filename>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
expand_keywords = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-blend-chars"><title>blend_chars</title>
|
|
<para>
|
|
Blended characters list.
|
|
Optional, default is empty.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
Blended characters are indexed both as separators and valid characters.
|
|
For instance, assume that & is configured as blended and AT&T
|
|
occurs in an indexed document. Three different keywords will get indexed,
|
|
namely "at&t", treating blended characters as valid, plus "at" and "t",
|
|
treating them as separators.
|
|
</para>
|
|
<para>
|
|
Positions for tokens obtained by replacing blended characters with whitespace
|
|
are assigned as usual, so regular keywords will be indexed just as if there was
|
|
no <option>blend_chars</option> specified at all. An additional token that
|
|
mixes blended and non-blended characters will be put at the starting position.
|
|
For instance, if the field contents are "AT&T company" occurs in the very
|
|
beginning of the text field, "at" will be given position 1, "t" position 2,
|
|
"company" positin 3, and "AT&T" will also be given position 1 ("blending"
|
|
with the opening regular keyword). Thus, querying for either AT&T or just
|
|
AT will match that document, and querying for "AT T" as a phrase also match it.
|
|
Last but not least, phrase query for "AT&T company" will <emphasis>also</emphasis>
|
|
match it, despite the position
|
|
</para>
|
|
<para>
|
|
Blended characters can overlap with special characters used in query
|
|
syntax (think of T-Mobile or @twitter). Where possible, query parser will
|
|
automatically handle blended character as blended. For instance, "hello @twitter"
|
|
within quotes (a phrase operator) would handle @-sign as blended, because
|
|
@-syntax for field operator is not allowed within phrases. Otherwise,
|
|
the character would be handled as an operator. So you might want to
|
|
escape the keywords.
|
|
</para>
|
|
<para>
|
|
Starting with version 2.0.1-beta, blended characters can be remapped,
|
|
so that multiple different blended characters could be normalized into
|
|
just one base form. This is useful when indexing multiple alternative
|
|
Unicode codepoints with equivalent glyphs.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
blend_chars = +, &, U+23
|
|
blend_chars = +, &->+ # 2.0.1 and above
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-blend-mode"><title>blend_mode</title>
|
|
<para>
|
|
Blended tokens indexing mode.
|
|
Optional, default is <option>trim_none</option>.
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
By default, tokens that mix blended and non-blended characters
|
|
get indexed in there entirety. For instance, when both at-sign and
|
|
an exclamation are in <option>blend_chars</option>, "@dude!" will get
|
|
result in two tokens indexed: "@dude!" (with all the blended characters)
|
|
and "dude" (without any). Therefore "@dude" query will <emphasis>not</emphasis>
|
|
match it.
|
|
</para>
|
|
<para>
|
|
<option>blend_mode</option> directive adds flexibility to this indexing
|
|
behavior. It takes a comma-separated list of options.
|
|
<programlisting>
|
|
blend_mode = option [, option [, ...]]
|
|
option = trim_none | trim_head | trim_tail | trim_both | skip_pure
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
Options specify token indexing variants. If multiple options are
|
|
specified, multiple variants of the same token will be indexed.
|
|
Regular keywords (resulting from that token by replacing blended
|
|
with whitespace) are always be indexed.
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>trim_none</term>
|
|
<listitem><para>Index the entire token.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>trim_head</term>
|
|
<listitem><para>Trim heading blended characters, and index the resulting token.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>trim_tail</term>
|
|
<listitem><para>Trim trailing blended characters, and index the resulting token.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>trim_both</term>
|
|
<listitem><para>Trim both heading and trailing blended characters, and index the resulting token.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>skip_pure</term>
|
|
<listitem><para>Do not index the token if it's purely blended, that is, consists of blended characters only.</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
Returning to the "@dude!" example above, setting <option>blend_mode = trim_head,
|
|
trim_tail</option> will result in two tokens being indexed, "@dude" and "dude!".
|
|
In this particular example, <option>trim_both</option> would have no effect,
|
|
because trimming both blended characters results in "dude" which is already
|
|
indexed as a regular keyword. Indexing "@U.S.A." with <option>trim_both</option>
|
|
(and assuming that dot is blended two) would result in "U.S.A" being indexed.
|
|
Last but not least, <option>skip_pure</option> enables you to fully ignore
|
|
sequences of blended characters only. For example, "one @@@ two" would be
|
|
indexed exactly as "one two", and match that as a phrase. That is not the case
|
|
by default because a fully blended token gets indexed and offsets the second
|
|
keyword position.
|
|
</para>
|
|
<para>
|
|
Default behavior is to index the entire token, equivalent to
|
|
<option>blend_mode = trim_none</option>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
blend_mode = trim_tail, skip_pure
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-rt-mem-limit"><title>rt_mem_limit</title>
|
|
<para>
|
|
RAM chunk size limit.
|
|
Optional, default is empty.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
RT index keeps some data in memory (so-called RAM chunk) and
|
|
also maintains a number of on-disk indexes (so-called disk chunks).
|
|
This directive lets you control the RAM chunk size. Once there's
|
|
too much data to keep in RAM, RT index will flush it to disk,
|
|
activate a newly created disk chunk, and reset the RAM chunk.
|
|
</para>
|
|
<para>
|
|
The limit is pretty strict; RT index should never allocate more
|
|
memory than it's limited to. The memory is not preallocated either,
|
|
hence, specifying 512 MB limit and only inserting 3 MB of data
|
|
should result in allocating 3 MB, not 512 MB.
|
|
</para>
|
|
<para>
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
rt_mem_limit = 512M
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-rt-field"><title>rt_field</title>
|
|
<para>
|
|
Full-text field declaration.
|
|
Multi-value, mandatory
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
Full-text fields to be indexed are declared using <option>rt_field</option>
|
|
directive. The names must be unique. The order is preserved; and so field values
|
|
in INSERT statements without an explicit list of inserted columns will have to be
|
|
in the same order as configured.
|
|
</para>
|
|
<para>
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
rt_field = author
|
|
rt_field = title
|
|
rt_field = content
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-rt-attr-uint"><title>rt_attr_uint</title>
|
|
<para>
|
|
Unsigned integer attribute declaration.
|
|
Multi-value (an arbitrary number of attributes is allowed), optional.
|
|
Declares an unsigned 32-bit attribute.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
rt_attr_uint = gid
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-rt-attr-bigint"><title>rt_attr_bigint</title>
|
|
<para>
|
|
BIGINT attribute declaration.
|
|
Multi-value (an arbitrary number of attributes is allowed), optional.
|
|
Declares a signed 64-bit attribute.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
rt_attr_bigint = guid
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-rt-attr-float"><title>rt_attr_float</title>
|
|
<para>
|
|
Floating point attribute declaration.
|
|
Multi-value (an arbitrary number of attributes is allowed), optional.
|
|
Declares a single precision, 32-bit IEEE 754 format float attribute.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
rt_attr_float = gpa
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-rt-attr-timestamp"><title>rt_attr_timestamp</title>
|
|
<para>
|
|
Timestamp attribute declaration.
|
|
Multi-value (an arbitrary number of attributes is allowed), optional.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
rt_attr_timestamp = date_added
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-rt-attr-string"><title>rt_attr_string</title>
|
|
<para>
|
|
String attribute declaration.
|
|
Multi-value (an arbitrary number of attributes is allowed), optional.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
rt_attr_string = author
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
</sect1>
|
|
<sect1 id="confgroup-indexer"><title><filename>indexer</filename> program configuration options</title>
|
|
|
|
|
|
<sect2 id="conf-mem-limit"><title>mem_limit</title>
|
|
<para>
|
|
Indexing RAM usage limit.
|
|
Optional, default is 32M.
|
|
</para>
|
|
<para>
|
|
Enforced memory usage limit that the <filename>indexer</filename>
|
|
will not go above. Can be specified in bytes, or kilobytes
|
|
(using K postfix), or megabytes (using M postfix); see the example.
|
|
This limit will be automatically raised if set to extremely low value
|
|
causing I/O buffers to be less than 8 KB; the exact lower bound
|
|
for that depends on the indexed data size. If the buffers are
|
|
less than 256 KB, a warning will be produced.
|
|
</para>
|
|
<para>
|
|
Maximum possible limit is 2047M. Too low values can hurt
|
|
indexing speed, but 256M to 1024M should be enough for most
|
|
if not all datasets. Setting this value too high can cause
|
|
SQL server timeouts. During the document collection phase,
|
|
there will be periods when the memory buffer is partially
|
|
sorted and no communication with the database is performed;
|
|
and the database server can timeout. You can resolve that
|
|
either by raising timeouts on SQL server side or by lowering
|
|
<code>mem_limit</code>.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
mem_limit = 256M
|
|
# mem_limit = 262144K # same, but in KB
|
|
# mem_limit = 268435456 # same, but in bytes
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-max-iops"><title>max_iops</title>
|
|
<para>
|
|
Maximum I/O operations per second, for I/O throttling.
|
|
Optional, default is 0 (unlimited).
|
|
</para>
|
|
<para>
|
|
I/O throttling related option.
|
|
It limits maximum count of I/O operations (reads or writes) per any given second.
|
|
A value of 0 means that no limit is imposed.
|
|
</para>
|
|
<para>
|
|
<filename>indexer</filename> can cause bursts of intensive disk I/O during
|
|
indexing, and it might desired to limit its disk activity (and keep something
|
|
for other programs running on the same machine, such as <filename>searchd</filename>).
|
|
I/O throttling helps to do that. It works by enforcing a minimum guaranteed
|
|
delay between subsequent disk I/O operations performed by <filename>indexer</filename>.
|
|
Modern SATA HDDs are able to perform up to 70-100+ I/O operations per second
|
|
(that's mostly limited by disk heads seek time). Limiting indexing I/O
|
|
to a fraction of that can help reduce search performance dedgradation
|
|
caused by indexing.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
max_iops = 40
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-max-iosize"><title>max_iosize</title>
|
|
<para>
|
|
Maximum allowed I/O operation size, in bytes, for I/O throttling.
|
|
Optional, default is 0 (unlimited).
|
|
</para>
|
|
<para>
|
|
I/O throttling related option. It limits maximum file I/O operation
|
|
(read or write) size for all operations performed by <filename>indexer</filename>.
|
|
A value of 0 means that no limit is imposed.
|
|
Reads or writes that are bigger than the limit
|
|
will be split in several smaller operations, and counted as several operation
|
|
by <link linkend="conf-max-iops">max_iops</link> setting. At the time of this
|
|
writing, all I/O calls should be under 256 KB (default internal buffer size)
|
|
anyway, so <code>max_iosize</code> values higher than 256 KB must not affect anything.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
max_iosize = 1048576
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-max-xmlpipe2-field"><title>max_xmlpipe2_field</title>
|
|
<para>
|
|
Maximum allowed field size for XMLpipe2 source type, bytes.
|
|
Optional, default is 2 MB.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
max_xmlpipe2_field = 8M
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-write-buffer"><title>write_buffer</title>
|
|
<para>
|
|
Write buffer size, bytes.
|
|
Optional, default is 1 MB.
|
|
</para>
|
|
<para>
|
|
Write buffers are used to write both temporary and final index
|
|
files when indexing. Larger buffers reduce the number of required
|
|
disk writes. Memory for the buffers is allocated in addition to
|
|
<link linkend="conf-mem-limit">mem_limit</link>. Note that several
|
|
(currently up to 4) buffers for different files will be allocated,
|
|
proportionally increasing the RAM usage.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
write_buffer = 4M
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-max-file-field-buffer"><title>max_file_field_buffer</title>
|
|
<para>
|
|
Maximum file field adaptive buffer size, bytes.
|
|
Optional, default is 8 MB, minimum is 1 MB.
|
|
</para>
|
|
<para>
|
|
File field buffer is used to load files referred to from
|
|
<link linkend="conf-sql-file-field">sql_file_field</link> columns.
|
|
This buffer is adaptive, starting at 1 MB at first allocation,
|
|
and growing in 2x steps until either file contents can be loaded,
|
|
or maximum buffer size, specified by <option>max_file_field_buffer</option>
|
|
directive, is reached.
|
|
</para>
|
|
<para>
|
|
Thus, if there are no file fields are specified, no buffer
|
|
is allocated at all. If all files loaded during indexing are under
|
|
(for example) 2 MB in size, but <option>max_file_field_buffer</option>
|
|
value is 128 MB, peak buffer usage would still be only 2 MB. However,
|
|
files over 128 MB would be entirely skipped.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
max_file_field_buffer = 128M
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
<sect2 id="conf-on-file-field-error"><title>on_file_field_error</title>
|
|
<para>
|
|
How to handle IO errors in file fields.
|
|
Optional, default is <code>ignore_field</code>.
|
|
Introduced in version 2.0.2-beta.
|
|
</para>
|
|
<para>
|
|
When there is a problem indexing a file referenced by a file field
|
|
(<xref linkend="conf-sql-file-field"/>), <filename>indexer</filename> can
|
|
either index the document, assuming empty content in this particular field,
|
|
or skip the document, or fail indexing entirely. <option>on_file_field_error</option>
|
|
directive controls that behavior. The values it takes are:
|
|
<itemizedlist>
|
|
<listitem><code>ignore_field</code>, index the current document without field;</listitem>
|
|
<listitem><code>skip_document</code>, skip the current document but continue indexing;</listitem>
|
|
<listitem><code>fail_index</code>, fail indexing with an error message.</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
The problems that can arise are: open error, size error (file too big),
|
|
and data read error. Warning messages on any problem will be given at all times,
|
|
irregardless of the phase and the <code>on_file_field_error</code> setting.
|
|
</para>
|
|
<para>
|
|
Note that with <option>on_file_field_error = skip_document</option>
|
|
documents will only be ignored if problems are detected during
|
|
an early check phase, and <b>not</b> during the actual file parsing
|
|
phase. <filename>indexer</filename> will open every referenced file
|
|
and check its size before doing any work, and then open it again
|
|
when doing actual parsing work. So in case a file goes away
|
|
between these two open attempts, the document will still be
|
|
indexed.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
on_file_field_errors = skip_document
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
</sect1>
|
|
<sect1 id="confgroup-searchd"><title><filename>searchd</filename> program configuration options</title>
|
|
|
|
|
|
<sect2 id="conf-listen"><title>listen</title>
|
|
<para>
|
|
This setting lets you specify IP address and port, or Unix-domain
|
|
socket path, that <code>searchd</code> will listen on.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
The informal grammar for <code>listen</code> setting is:
|
|
<programlisting>
|
|
listen = ( address ":" port | port | path ) [ ":" protocol ]
|
|
</programlisting>
|
|
I.e. you can specify either an IP address (or hostname) and port
|
|
number, or just a port number, or Unix socket path. If you specify
|
|
port number but not the address, <code>searchd</code> will listen on
|
|
all network interfaces. Unix path is identified by a leading slash.
|
|
</para>
|
|
<para>
|
|
Starting with version 0.9.9-rc2, you can also specify a protocol
|
|
handler (listener) to be used for connections on this socket.
|
|
Supported protocol values are 'sphinx' (Sphinx 0.9.x API protocol)
|
|
and 'mysql41' (MySQL protocol used since 4.1 upto at least 5.1).
|
|
More details on MySQL protocol support can be found in
|
|
<xref linkend="sphinxql"/> section.
|
|
</para>
|
|
<bridgehead>Examples:</bridgehead>
|
|
<programlisting>
|
|
listen = localhost
|
|
listen = localhost:5000
|
|
listen = 192.168.0.1:5000
|
|
listen = /var/run/sphinx.s
|
|
listen = 9312
|
|
listen = localhost:9306:mysql41
|
|
</programlisting>
|
|
<para>
|
|
There can be multiple listen directives, <code>searchd</code> will
|
|
listen for client connections on all specified ports and sockets. If
|
|
no <code>listen</code> directives are found then the server will listen
|
|
on all available interfaces using the default SphinxAPI port 9312.
|
|
Starting with 1.10-beta, it will also listen on default SphinxQL
|
|
port 9306. Both port numbers are assigned by IANA (see
|
|
<ulink url="http://www.iana.org/assignments/port-numbers">http://www.iana.org/assignments/port-numbers</ulink>
|
|
for details) and should therefore be available.
|
|
</para>
|
|
<para>
|
|
Unix-domain sockets are not supported on Windows.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-address"><title>address</title>
|
|
<para>
|
|
Interface IP address to bind on.
|
|
Optional, default is 0.0.0.0 (ie. listen on all interfaces).
|
|
<b>DEPRECATED</b>, use <link linkend="conf-listen">listen</link> instead.
|
|
</para>
|
|
<para>
|
|
<code>address</code> setting lets you specify which network interface
|
|
<filename>searchd</filename> will bind to, listen on, and accept incoming
|
|
network connections on. The default value is 0.0.0.0 which means to listen
|
|
on all interfaces. At the time, you can <b>not</b> specify multiple interfaces.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
address = 192.168.0.1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-port"><title>port</title>
|
|
<para>
|
|
<filename>searchd</filename> TCP port number.
|
|
<b>DEPRECATED</b>, use <link linkend="conf-listen">listen</link> instead.
|
|
Used to be mandatory. Default port number is 9312.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
port = 9312
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-log"><title>log</title>
|
|
<para>
|
|
Log file name.
|
|
Optional, default is 'searchd.log'.
|
|
All <filename>searchd</filename> run time events will be logged in this file.
|
|
</para>
|
|
<para>
|
|
Also you can use the 'syslog' as the file name. In this case the events will be sent to syslog daemon.
|
|
To use the syslog option the sphinx must be configured '--with-syslog' on building.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
log = /var/log/searchd.log
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-query-log"><title>query_log</title>
|
|
<para>
|
|
Query log file name.
|
|
Optional, default is empty (do not log queries).
|
|
All search queries will be logged in this file. The format is described in <xref linkend="query-log-format"/>.
|
|
</para>
|
|
<para>
|
|
In case of 'plain' format, you can use the 'syslog' as the path to the log file.
|
|
In this case all search queries will be sent to syslog daemon with LOG_INFO priority,
|
|
prefixed with '[query]' instead of timestamp.
|
|
To use the syslog option the sphinx must be configured '--with-syslog' on building.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
query_log = /var/log/query.log
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-query-log-format"><title>query_log_format</title>
|
|
<para>
|
|
Query log format.
|
|
Optional, allowed values are 'plain' and 'sphinxql', default is 'plain'.
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
Starting with version 2.0.1-beta, two different log formats are supported.
|
|
The default one logs queries in a custom text format. The new one logs
|
|
valid SphinxQL statements. This directive allows to switch between the two
|
|
formats on search daemon startup. The log format can also be altered
|
|
on the fly, using <code>SET GLOBAL query_log_format=sphinxql</code> syntax.
|
|
Refer to <xref linkend="query-log-format"/> for more discussion and format
|
|
details.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
query_log_format = sphinxql
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-read-timeout"><title>read_timeout</title>
|
|
<para>
|
|
Network client request read timeout, in seconds.
|
|
Optional, default is 5 seconds.
|
|
<filename>searchd</filename> will forcibly close the client connections which fail to send a query within this timeout.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
read_timeout = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-client-timeout"><title>client_timeout</title>
|
|
<para>
|
|
Maximum time to wait between requests (in seconds) when using
|
|
persistent connections. Optional, default is five minutes.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
client_timeout = 3600
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-max-children"><title>max_children</title>
|
|
<para>
|
|
Maximum amount of children to fork (or in other words, concurrent searches to run in parallel).
|
|
Optional, default is 0 (unlimited).
|
|
</para>
|
|
<para>
|
|
Useful to control server load. There will be no more than this much concurrent
|
|
searches running, at all times. When the limit is reached, additional incoming
|
|
clients are dismissed with temporarily failure (SEARCHD_RETRY) status code
|
|
and a message stating that the server is maxed out.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
max_children = 10
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-pid-file"><title>pid_file</title>
|
|
<para>
|
|
<filename>searchd</filename> process ID file name.
|
|
Mandatory.
|
|
</para>
|
|
<para>
|
|
PID file will be re-created (and locked) on startup. It will contain
|
|
head daemon process ID while the daemon is running, and it will be unlinked
|
|
on daemon shutdown. It's mandatory because Sphinx uses it internally
|
|
for a number of things: to check whether there already is a running instance
|
|
of <filename>searchd</filename>; to stop <filename>searchd</filename>;
|
|
to notify it that it should rotate the indexes. Can also be used for
|
|
different external automation scripts.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
pid_file = /var/run/searchd.pid
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-max-matches"><title>max_matches</title>
|
|
<para>
|
|
Maximum amount of matches that the daemon keeps in RAM for each index and can return to the client.
|
|
Optional, default is 1000.
|
|
</para>
|
|
<para>
|
|
Introduced in order to control and limit RAM usage, <code>max_matches</code>
|
|
setting defines how much matches will be kept in RAM while searching each index.
|
|
Every match found will still be <emphasis>processed</emphasis>; but only
|
|
best N of them will be kept in memory and return to the client in the end.
|
|
Assume that the index contains 2,000,000 matches for the query. You rarely
|
|
(if ever) need to retrieve <emphasis>all</emphasis> of them. Rather, you need
|
|
to scan all of them, but only choose "best" at most, say, 500 by some criteria
|
|
(ie. sorted by relevance, or price, or anything else), and display those
|
|
500 matches to the end user in pages of 20 to 100 matches. And tracking
|
|
only the best 500 matches is much more RAM and CPU efficient than keeping
|
|
all 2,000,000 matches, sorting them, and then discarding everything but
|
|
the first 20 needed to display the search results page. <code>max_matches</code>
|
|
controls N in that "best N" amount.
|
|
</para>
|
|
<para>
|
|
This parameter noticeably affects per-query RAM and CPU usage.
|
|
Values of 1,000 to 10,000 are generally fine, but higher limits must be
|
|
used with care. Recklessly raising <code>max_matches</code> to 1,000,000
|
|
means that <filename>searchd</filename> will have to allocate and
|
|
initialize 1-million-entry matches buffer for <emphasis>every</emphasis>
|
|
query. That will obviously increase per-query RAM usage, and in some cases
|
|
can also noticeably impact performance.
|
|
</para>
|
|
<para>
|
|
<b>CAVEAT EMPTOR!</b> Note that there also is <b>another</b> place where this limit
|
|
is enforced. <code>max_matches</code> can be decreased on the fly
|
|
through the <link linkend="api-func-setlimits">corresponding API call</link>,
|
|
and the default value in the API is <b>also</b> set to 1,000. So in order
|
|
to retrieve more than 1,000 matches to your application, you will have
|
|
to change the configuration file, restart searchd, and set proper limit
|
|
in <link linkend="api-func-setlimits">SetLimits()</link> call.
|
|
Also note that you can not set the value in the API higher than the value
|
|
in the .conf file. This is prohibited in order to have some protection
|
|
against malicious and/or malformed requests.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
max_matches = 10000
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-seamless-rotate"><title>seamless_rotate</title>
|
|
<para>
|
|
Prevents <filename>searchd</filename> stalls while rotating indexes with huge amounts of data to precache.
|
|
Optional, default is 1 (enable seamless rotation).
|
|
</para>
|
|
<para>
|
|
Indexes may contain some data that needs to be precached in RAM.
|
|
At the moment, <filename>.spa</filename>, <filename>.spi</filename> and
|
|
<filename>.spm</filename> files are fully precached (they contain attribute data,
|
|
MVA data, and keyword index, respectively.)
|
|
Without seamless rotate, rotating an index tries to use as little RAM
|
|
as possible and works as follows:
|
|
<orderedlist>
|
|
<listitem><para>new queries are temporarly rejected (with "retry" error code);</para></listitem>
|
|
<listitem><para><filename>searchd</filename> waits for all currently running queries to finish;</para></listitem>
|
|
<listitem><para>old index is deallocated and its files are renamed;</para></listitem>
|
|
<listitem><para>new index files are renamed and required RAM is allocated;</para></listitem>
|
|
<listitem><para>new index attribute and dictionary data is preloaded to RAM;</para></listitem>
|
|
<listitem><para><filename>searchd</filename> resumes serving queries from new index.</para></listitem>
|
|
</orderedlist>
|
|
</para>
|
|
<para>
|
|
However, if there's a lot of attribute or dictionary data, then preloading step
|
|
could take noticeble time - up to several minutes in case of preloading 1-5+ GB files.
|
|
</para>
|
|
<para>
|
|
With seamless rotate enabled, rotation works as follows:
|
|
<orderedlist>
|
|
<listitem><para>new index RAM storage is allocated;</para></listitem>
|
|
<listitem><para>new index attribute and dictionary data is asynchronously preloaded to RAM;</para></listitem>
|
|
<listitem><para>on success, old index is deallocated and both indexes' files are renamed;</para></listitem>
|
|
<listitem><para>on failure, new index is deallocated;</para></listitem>
|
|
<listitem><para>at any given moment, queries are served either from old or new index copy.</para></listitem>
|
|
</orderedlist>
|
|
</para>
|
|
<para>
|
|
Seamless rotate comes at the cost of higher <emphasis role="bold">peak</emphasis>
|
|
memory usage during the rotation (because both old and new copies of
|
|
<filename>.spa/.spi/.spm</filename> data need to be in RAM while
|
|
preloading new copy). Average usage stays the same.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
seamless_rotate = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-preopen-indexes"><title>preopen_indexes</title>
|
|
<para>
|
|
Whether to forcibly preopen all indexes on startup.
|
|
Optional, default is 1 (preopen everything).
|
|
</para>
|
|
<para>
|
|
Starting with 2.0.1-beta, the default value for this
|
|
option is now 1 (foribly preopen all indexes). In prior
|
|
versions, it used to be 0 (use per-index settings).
|
|
</para>
|
|
<para>
|
|
When set to 1, this directive overrides and enforces
|
|
<link linkend="conf-preopen">preopen</link> on all indexes.
|
|
They will be preopened, no matter what is the per-index
|
|
<code>preopen</code> setting. When set to 0, per-index
|
|
settings can take effect. (And they default to 0.)
|
|
</para>
|
|
<para>
|
|
Pre-opened indexes avoid races between search queries
|
|
and rotations that can cause queries to fail occasionally.
|
|
They also make <filename>searchd</filename> use more file
|
|
handles. In most scenarios it's therefore preferred and
|
|
recommended to preopen indexes.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
preopen_indexes = 1
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-unlink-old"><title>unlink_old</title>
|
|
<para>
|
|
Whether to unlink .old index copies on succesful rotation.
|
|
Optional, default is 1 (do unlink).
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
unlink_old = 0
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
<sect2 id="conf-attr-flush-period"><title>attr_flush_period</title>
|
|
<para>
|
|
When calling <code>UpdateAttributes()</code> to update document attributes in
|
|
real-time, changes are first written to the in-memory copy of attributes
|
|
(<option>docinfo</option> must be set to <option>extern</option>).
|
|
Then, once <filename>searchd</filename> shuts down normally (via <code>SIGTERM</code>
|
|
being sent), the changes are written to disk.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>Starting with 0.9.9-rc1, it is possible to tell <filename>searchd</filename>
|
|
to periodically write these changes back to disk, to avoid them being lost. The time
|
|
between those intervals is set with <option>attr_flush_period</option>, in seconds.
|
|
</para>
|
|
<para>It defaults to 0, which disables the periodic flushing, but flushing will
|
|
still occur at normal shut-down.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
attr_flush_period = 900 # persist updates to disk every 15 minutes
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-ondisk-dict-default"><title>ondisk_dict_default</title>
|
|
<para>
|
|
Instance-wide defaults for <link linkend="conf-ondisk-dict">ondisk_dict</link> directive.
|
|
Optional, default it 0 (precache dictionaries in RAM).
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
This directive lets you specify the default value of
|
|
<link linkend="conf-ondisk-dict">ondisk_dict</link> for all the indexes
|
|
served by this copy of <filename>searchd</filename>. Per-index directive
|
|
take precedence, and will overwrite this instance-wide default value,
|
|
allowing for fine-grain control.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
ondisk_dict_default = 1 # keep all dictionaries on disk
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-max-packet-size"><title>max_packet_size</title>
|
|
<para>
|
|
Maximum allowed network packet size.
|
|
Limits both query packets from clients, and response packets from remote agents in distributed environment.
|
|
Only used for internal sanity checks, does not directly affect RAM use or performance.
|
|
Optional, default is 8M.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
max_packet_size = 32M
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-mva-updates-pool"><title>mva_updates_pool</title>
|
|
<para>
|
|
Shared pool size for in-memory MVA updates storage.
|
|
Optional, default size is 1M.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<para>
|
|
This setting controls the size of the shared storage pool for updated MVA values.
|
|
Specifying 0 for the size disable MVA updates at all. Once the pool size limit
|
|
is hit, MVA update attempts will result in an error. However, updates on regular
|
|
(scalar) attributes will still work. Due to internal technical difficulties,
|
|
currently it is <b>not</b> possible to store (flush) <b>any</b> updates on indexes
|
|
where MVA were updated; though this might be implemented in the future.
|
|
In the meantime, MVA updates are intended to be used as a measure to quickly
|
|
catchup with latest changes in the database until the next index rebuild;
|
|
not as a persistent storage mechanism.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
mva_updates_pool = 16M
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-crash-log-path"><title>crash_log_path</title>
|
|
<para>
|
|
Deprecated debugging setting, path (formally prefix) for crash log files.
|
|
Introduced in version 0.9.9-rc1. Deprecated in version 2.0.1-beta,
|
|
as crash debugging information now gets logged into searchd.log
|
|
in text form, and separate binary crash logs are no longer needed.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-max-filters"><title>max_filters</title>
|
|
<para>
|
|
Maximum allowed per-query filter count.
|
|
Only used for internal sanity checks, does not directly affect RAM use or performance.
|
|
Optional, default is 256.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
max_filters = 1024
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-max-filter-values"><title>max_filter_values</title>
|
|
<para>
|
|
Maximum allowed per-filter values count.
|
|
Only used for internal sanity checks, does not directly affect RAM use or performance.
|
|
Optional, default is 4096.
|
|
Introduced in version 0.9.9-rc1.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
max_filter_values = 16384
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-listen-backlog"><title>listen_backlog</title>
|
|
<para>
|
|
TCP listen backlog.
|
|
Optional, default is 5.
|
|
</para>
|
|
<para>
|
|
Windows builds currently (as of 0.9.9) can only process the requests
|
|
one by one. Concurrent requests will be enqueued by the TCP stack
|
|
on OS level, and requests that can not be enqueued will immediately
|
|
fail with "connection refused" message. listen_backlog directive
|
|
controls the length of the connection queue. Non-Windows builds
|
|
should work fine with the default value.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
listen_backlog = 20
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-read-buffer"><title>read_buffer</title>
|
|
<para>
|
|
Per-keyword read buffer size.
|
|
Optional, default is 256K.
|
|
</para>
|
|
<para>
|
|
For every keyword occurrence in every search query, there are
|
|
two associated read buffers (one for document list and one for
|
|
hit list). This setting lets you control their sizes, increasing
|
|
per-query RAM use, but possibly decreasing IO time.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
read_buffer = 1M
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-read-unhinted"><title>read_unhinted</title>
|
|
<para>
|
|
Unhinted read size.
|
|
Optional, default is 32K.
|
|
</para>
|
|
<para>
|
|
When querying, some reads know in advance exactly how much data
|
|
is there to be read, but some currently do not. Most prominently,
|
|
hit list size in not currently known in advance. This setting
|
|
lest you control how much data to read in such cases. It will
|
|
impact hit list IO time, reducing it for lists larger than
|
|
unhinted read size, but raising it for smaller lists. It will
|
|
<b>not</b> affect RAM use because read buffer will be already
|
|
allocated. So it should be not greater than read_buffer.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
read_unhinted = 32K
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-max-batch-queries"><title>max_batch_queries</title>
|
|
<para>
|
|
Limits the amount of queries per batch.
|
|
Optional, default is 32.
|
|
</para>
|
|
<para>
|
|
Makes searchd perform a sanity check of the amount of the queries
|
|
submitted in a single batch when using <link linkend="multi-queries">multi-queries</link>.
|
|
Set it to 0 to skip the check.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
max_batch_queries = 256
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-subtree-docs-cache"><title>subtree_docs_cache</title>
|
|
<para>
|
|
Max common subtree document cache size, per-query.
|
|
Optional, default is 0 (disabled).
|
|
</para>
|
|
<para>
|
|
Limits RAM usage of a common subtree optimizer (see <xref linkend="multi-queries"/>).
|
|
At most this much RAM will be spent to cache document entries per each query.
|
|
Setting the limit to 0 disables the optimizer.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
subtree_docs_cache = 8M
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-subtree-hits-cache"><title>subtree_hits_cache</title>
|
|
<para>
|
|
Max common subtree hit cache size, per-query.
|
|
Optional, default is 0 (disabled).
|
|
</para>
|
|
<para>
|
|
Limits RAM usage of a common subtree optimizer (see <xref linkend="multi-queries"/>).
|
|
At most this much RAM will be spent to cache keyword occurrences (hits) per each query.
|
|
Setting the limit to 0 disables the optimizer.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
subtree_hits_cache = 16M
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-workers"><title>workers</title>
|
|
<para>
|
|
Multi-processing mode (MPM).
|
|
Optional; allowed values are none, fork, prefork, and threads.
|
|
Default is fork on Unix based systems, and threads on Windows.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
Lets you choose how <filename>searchd</filename> processes multiple
|
|
concurrent requests. The possible values are:
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>none</term>
|
|
<listitem><para>All requests will be handled serially, one-by-one.
|
|
Prior to 1.x, this was the only mode available on Windows.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>fork</term>
|
|
<listitem><para>A new child process will be forked to handle every
|
|
incoming request. Historically, this is the default mode.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>prefork</term>
|
|
<listitem><para>On startup, <filename>searchd</filename> will pre-fork
|
|
a number of worker processes, and pass the incoming requests
|
|
to one of those children.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>threads</term>
|
|
<listitem><para>A new thread will be created to handle every
|
|
incoming request. This is the only mode compatible with
|
|
RT indexing backend.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
<para>
|
|
Historically, <filename>searchd</filename> used fork-based model,
|
|
which generally performs OK but spends a noticeable amount of CPU
|
|
in fork() system call when there's a high amount of (tiny) requests
|
|
per second. Prefork mode was implemented to alleviate that; with
|
|
prefork, worker processes are basically only created on startup
|
|
and re-created on index rotation, somewhat reducing fork() call
|
|
pressure.
|
|
</para>
|
|
<para>
|
|
Threads mode was implemented along with RT backend and is required
|
|
to use RT indexes. (Regular disk-based indexes work in all the
|
|
available modes.)
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
workers = threads
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-dist-threads"><title>dist_threads</title>
|
|
<para>
|
|
Max local worker threads to use for parallelizable requests (searching a distributed index; building a batch of snippets).
|
|
Optional, default is 0, which means to disable in-request parallelism.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
Distributed index can include several local indexes. <option>dist_threads</option>
|
|
lets you easily utilize multiple CPUs/cores for that (previously existing
|
|
alternative was to specify the indexes as remote agents, pointing searchd
|
|
to itself and paying some network overheads).
|
|
</para>
|
|
<para>
|
|
When set to a value N greater than 1, this directive will create up to
|
|
N threads for every query, and schedule the specific searches within these
|
|
threads. For example, if there are 7 local indexes to search and dist_threads
|
|
is set to 2, then 2 parallel threads would be created: one that sequentially
|
|
searches 4 indexes, and another one that searches the other 3 indexes.
|
|
</para>
|
|
<para>
|
|
In case of CPU bound workload, setting <option>dist_threads</option>
|
|
to 1x the number of cores is advised (creating more threads than cores
|
|
will not improve query time). In case of mixed CPU/disk bound workload
|
|
it might sometimes make sense to use more (so that all cores could be
|
|
utilizes even when there are threads that wait for I/O completion).
|
|
</para>
|
|
<para>
|
|
Note that <option>dist_threads</option> does <b>not</b> require
|
|
threads MPM. You can perfectly use it with fork or prefork MPMs too.
|
|
</para>
|
|
<para>
|
|
Starting with version 2.0.1-beta, building a batch of snippets
|
|
with <option>load_files</option> flag enabled can also be parallelized.
|
|
Up to <option>dist_threads</option> threads are be created to process
|
|
those files. That speeds up snippet extraction when the total amount
|
|
of document data to process is significant (hundreds of megabytes).
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
index dist_test
|
|
{
|
|
type = distributed
|
|
local = chunk1
|
|
local = chunk2
|
|
local = chunk3
|
|
local = chunk4
|
|
}
|
|
|
|
# ...
|
|
|
|
dist_threads = 4
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-binlog-path"><title>binlog_path</title>
|
|
<para>
|
|
Binary log (aka transaction log) files path.
|
|
Optional, default is build-time configured data directory.
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
Binary logs are used for crash recovery of RT index data that
|
|
would otherwise only be stored in RAM. When logging is enabled,
|
|
every transaction COMMIT-ted into RT index gets written into
|
|
a log file. Logs are then automatically replayed on startup
|
|
after an unclean shutdown, recovering the logged changes.
|
|
</para>
|
|
<para>
|
|
<option>binlog_path</option> directive specifies the binary log
|
|
files location. It should contain just the path; <option>searchd</option>
|
|
will create and unlink multiple binlog.* files in that path as necessary
|
|
(binlog data, metadata, and lock files, etc).
|
|
</para>
|
|
<para>
|
|
Empty value disables binary logging. That improves performance,
|
|
but puts RT index data at risk.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
binlog_path = # disable logging
|
|
binlog_path = /var/data # /var/data/binlog.001 etc will be created
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-binlog-flush"><title>binlog_flush</title>
|
|
<para>
|
|
Binary log transaction flush/sync mode.
|
|
Optional, default is 2 (flush every transaction, sync every second).
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
This directive controls how frequently will binary log be flushed
|
|
to OS and synced to disk. Three modes are supported:
|
|
<itemizedlist>
|
|
<listitem><para>0, flush and sync every second. Best performance,
|
|
but up to 1 second worth of committed transactions can be lost
|
|
both on daemon crash, or OS/hardware crash.
|
|
</para></listitem>
|
|
<listitem><para>1, flush and sync every transaction. Worst performance,
|
|
but every committed transaction data is guaranteed to be saved.
|
|
</para></listitem>
|
|
<listitem><para>2, flush every transaction, sync every second.
|
|
Good performance, and every committed transaction is guaranteed
|
|
to be saved in case of daemon crash. However, in case of OS/hardware
|
|
crash up to 1 second worth of committed transactions can be lost.
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
For those familiar with MySQL and InnoDB, this directive is entirely
|
|
similar to <option>innodb_flush_log_at_trx_commit</option>. In most
|
|
cases, the default hybrid mode 2 provides a nice balance of speed
|
|
and safety, with full RT index data protection against daemon crashes,
|
|
and some protection against hardware ones.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
binlog_flush = 1 # ultimate safety, low speed
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-binlog-max-log-size"><title>binlog_max_log_size</title>
|
|
<para>
|
|
Maximum binary log file size.
|
|
Optional, default is 0 (do not reopen binlog file based on size).
|
|
Introduced in version 1.10-beta.
|
|
</para>
|
|
<para>
|
|
A new binlog file will be forcibly opened once the current binlog file
|
|
reaches this limit. This achieves a finer granularity of logs and can yield
|
|
more efficient binlog disk usage under certain borderline workloads.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
binlog_max_log_size = 16M
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-collation-server"><title>collation_server</title>
|
|
<para>
|
|
Default server collation.
|
|
Optional, default is libc_ci.
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
Specifies the default collation used for incoming requests.
|
|
The collation can be overridden on a per-query basis.
|
|
Refer to <xref linkend="collations"/> section for the list of available collations and other details.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
collation_server = utf8_ci
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-collation-libc-locale"><title>collation_libc_locale</title>
|
|
<para>
|
|
Server libc locale.
|
|
Optional, default is C.
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
Specifies the libc locale, affecting the libc-based collations.
|
|
Refer to <xref linkend="collations"/> section for the details.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
collation_libc_locale = fr_FR
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-plugin-dir"><title>plugin_dir</title>
|
|
<para>
|
|
Trusted location for the dynamic libraries (UDFs).
|
|
Optional, default is empty (no location).
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
Specifies the trusted directory from which the
|
|
<link linkend="udf">UDF libraries</link> can be loaded. Requires
|
|
<link linkend="conf-workers">workers = thread</link> to take effect.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
workers = threads
|
|
plugin_dir = /usr/local/sphinx/lib
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-mysql-version-string"><title>mysql_version_string</title>
|
|
<para>
|
|
A server version string to return via MySQL protocol.
|
|
Optional, default is empty (return Sphinx version).
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
Several picky MySQL client libraries depend on a particular version
|
|
number format used by MySQL, and moreover, sometimes choose a different
|
|
execution path based on the reported version number (rather than the
|
|
indicated capabilities flags). For instance, Python MySQLdb 1.2.2 throws
|
|
an exception when the version number is not in X.Y.ZZ format; MySQL .NET
|
|
connector 6.3.x fails internally on version numbers 1.x along with
|
|
a certain combination of flags, etc. To workaround that, you can use
|
|
<option>mysql_version_string</option> directive and have <filename>searchd</filename>
|
|
report a different version to clients connecting over MySQL protocol.
|
|
(By default, it reports its own version.)
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
mysql_version_string = 5.0.37
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-rt-flush-period"><title>rt_flush_period</title>
|
|
<para>
|
|
RT indexes RAM chunk flush check period, in seconds.
|
|
Optional, default is 0 (do not flush).
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
Actively updated RT indexes that however fully fit in RAM chunks
|
|
can result in ever-growing binlogs, impacting disk use and crash
|
|
recovery time. With this directive the search daemon performs
|
|
periodic flush checks, and eligible RAM chunks can get saved,
|
|
enabling consequential binlog cleanup. See <xref linkend="rt-binlog"/>
|
|
for more details.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
rt_flush_period = 3600
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-thread-stack"><title>thread_stack</title>
|
|
<para>
|
|
Per-thread stack size.
|
|
Optional, default is 64K.
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
In the <code>workers = threads</code> mode, every request is processed
|
|
with a separate thread that needs its own stack space. By default, 64K per
|
|
thread are allocated for stack. However, extremely complex search requests
|
|
might eventually exhaust the default stack and require more. For instance,
|
|
a query that matches a few thousand keywords (either directly or through
|
|
term expansion) can eventually run out of stack. Previously, that resulted
|
|
in crashes. Starting with 2.0.1-beta, <filename>searchd</filename> attempts
|
|
to estimate the expected stack use, and blocks the potentially dangerous
|
|
queries. To process such queries, you can either the thread stack size
|
|
by using the <code>thread_stack</code> directive (or switch to a different
|
|
<code>workers</code> setting if that is possible).
|
|
</para>
|
|
<para>
|
|
A query with N levels of nesting is estimated to require approximately
|
|
30+0.12*N KB of stack, meaning that the default 64K is enough for queries
|
|
with upto 300 levels, 150K for upto 1000 levels, etc. If the stack size limit
|
|
is not met, <filename>searchd</filename> fails the query and reports
|
|
the required stack size in the error message.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
thread_stack = 256K
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-expansion-limit"><title>expansion_limit</title>
|
|
<para>
|
|
The maximum number of expanded keywords for a single wildcard.
|
|
Optional, default is 0 (no limit).
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
When doing substring searches against indexes built with
|
|
<code>dict = keywords</code> enabled, a single wildcard may
|
|
potentially result in thousands and even millions of matched
|
|
keywords (think of matching 'a*' against the entire Oxford
|
|
dictionary). This directive lets you limit the impact
|
|
of such expansions. Setting <code>expansion_limit = N</code>
|
|
restricts expansions to no more than N of the most frequent
|
|
matching keywords (per each wildcard in the query).
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
expansion_limit = 16
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-compat-sphinxql-magics"><title>compat_sphinxql_magics</title>
|
|
<para>
|
|
Legacy SphinxQL quirks compatiblity mode.
|
|
Optional, default is 1 (keep compatibility).
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
Starting with version 2.0.1-beta, we're bringing SphinxQL in closer
|
|
compliance with standard SQL. However, existing applications must not
|
|
get broken, and <code>compat_sphinxql_magics</code> lets you upgrade
|
|
safely. It defauls to 1, which enables the compatibility mode.
|
|
However, <b>SphinxQL compatibility mode is now deprecated and
|
|
will be removed</b> once we complete bringing SphinxQL in line
|
|
with standard SQL syntax. So it's advised to update the applications
|
|
utilising SphinxQL and then switch the daemon to the new, more SQL
|
|
compliant mode by setting <code>compat_sphinxql_magics = 0</code>.
|
|
Please refer to <xref linkend="sphinxql-upgrading-magics"/>
|
|
for the details and update instruction.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
compat_sphinxql_magics = 0 # the future is now
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="conf-watchdog"><title>watchdog</title>
|
|
<para>
|
|
Threaded server watchdog.
|
|
Optional, default is 1 (watchdog enabled).
|
|
Introduced in version 2.0.1-beta.
|
|
</para>
|
|
<para>
|
|
A crashed query in <code>threads</code> multi-processing mode
|
|
(<code><link linkend="conf-workers">workers</link> = threads</code>)
|
|
can take down the entire server. With watchdog feature enabled,
|
|
<filename>searchd</filename> additionally keeps a separate lightweight
|
|
process that monitors the main server process, and automatically
|
|
restarts the latter in case of abnormal termination. Watchdog
|
|
is enabled by default.
|
|
</para>
|
|
<bridgehead>Example:</bridgehead>
|
|
<programlisting>
|
|
watchdog = 0 # disable watchdog
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
|
|
</sect1>
|
|
|
|
|
|
</chapter>
|
|
|
|
<!--
|
|
<chapter id="developers"><title>Developer's corner</title>
|
|
|
|
<sect1 id="architecture-overview"><title>Sphinx architecture overview</title>
|
|
(to be added)
|
|
</sect1>
|
|
|
|
<sect1 id="adding-data-sources"><title>Adding new data source drivers</title>
|
|
(to be added)
|
|
</sect1>
|
|
|
|
<sect1 id="adding-data-sources"><title>API porting guidelines</title>
|
|
(to be added)
|
|
</sect1>
|
|
|
|
</chapter>
|
|
-->
|
|
|
|
<appendix id="changelog"><title>Sphinx revision history</title>
|
|
|
|
<sect1 id="rel111"><title>Version 2.0.1-beta, 22 apr 2011</title>
|
|
<bridgehead>New general features</bridgehead>
|
|
<itemizedlist>
|
|
<listitem><para>added remapping support to <link linkend="conf-blend-chars">blend_chars</link> directive</para></listitem>
|
|
<listitem><para>added multi-threaded snippet batches support (requires a batch sent via API, <link linkend="conf-dist-threads">dist_threads</link>, and <code>load_files</code>)</para></listitem>
|
|
<listitem><para>added collations (<link linkend="conf-collation-server">collation_server</link>, <link linkend="conf-collation-libc-locale">collation_libc_locale directives</link>)</para></listitem>
|
|
<listitem><para>added support for sorting and grouping on string attributes (<code>ORDER BY</code>, <code>GROUP BY</code>, <code>WITHING GROUP ORDER BY</code>)</para></listitem>
|
|
<listitem><para>added UDF support (<link linkend="conf-plugin-dir">plugin_dir</link> directive; <link linkend="sphinxql-create-function">CREATE FUNCTION</link>, <link linkend="sphinxql-drop-function">DROP FUNCTION</link> statements)</para></listitem>
|
|
<listitem><para>added <link linkend="conf-query-log-format">query_log_format</link> directive, <link linkend="sphinxql-set">SET GLOBAL query_log_format | log_level = ...</link> statements; and connection id tracking</para></listitem>
|
|
<listitem><para>added <link linkend="conf-sql-column-buffers">sql_column_buffers</link> directive, fixed out-of-buffer column handling in ODBC/MS SQL sources</para></listitem>
|
|
<listitem><para>added <link linkend="conf-blend-mode">blend_mode</link> directive that enables indexing multiple variants of a blended sequence</para></listitem>
|
|
<listitem><para>added UNIX socket support to C, Ruby APIs</para></listitem>
|
|
<listitem><para>added ranged query support to <link linkend="conf-sql-joined-field">sql_joined_field</link></para></listitem>
|
|
<listitem><para>added <link linkend="conf-rt-flush-period">rt_flush_period</link> directive</para></listitem>
|
|
<listitem><para>added <link linkend="conf-thread-stack">thread_stack</link> directive</para></listitem>
|
|
<listitem><para>added SENTENCE, PARAGRAPH, ZONE operators (and <link linkend="conf-index-sp">index_sp</link>, <link linkend="conf-index-zones">index_zones</link> directives)</para></listitem>
|
|
<listitem><para>added keywords dictionary support (and <link linkend="conf-dict">dict</link>, <link linkend="conf-expansion-limit">expansion_limit</link> directives)</para></listitem>
|
|
<listitem><para>added <code>passage_boundary</code>, <code>emit_zones</code> options to snippets</para></listitem>
|
|
<listitem><para>added <link linkend="conf-watchdog">a watchdog process</link> in threaded mode</para></listitem>
|
|
<listitem><para>added persistent MVA updates</para></listitem>
|
|
<listitem><para>added crash dumps to <filename>searchd.log</filename>, deprecated <code>crash_log_path</code> directive</para></listitem>
|
|
<listitem><para>added id32 index support in id64 binaries (EXPERIMENTAL)</para></listitem>
|
|
<listitem><para>added SphinxSE support for DELETE and REPLACE on SphinxQL tables</para></listitem>
|
|
</itemizedlist>
|
|
<bridgehead>New SphinxQL features</bridgehead>
|
|
<itemizedlist>
|
|
<listitem><para>added new, more SQL compliant SphinxQL syntax; and a <link linkend="conf-compat-sphinxql-magics">compat_sphinxql_magics</link> directive</para></listitem>
|
|
<listitem><para>added <link linkend="expr-func-crc32">CRC32()</link>, <link linkend="expr-func-day">DAY()</link>, <link linkend="expr-func-month">MONTH()</link>, <link linkend="expr-func-year">YEAR()</link>, <link linkend="expr-func-yearmonth">YEARMONTH()</link>, <link linkend="expr-func-yearmonthday">YEARMONTHDAY()</link> functions</para></listitem>
|
|
<listitem><para>added <link linkend="expr-ari-ops">DIV, MOD, and % operators</link></para></listitem>
|
|
<listitem><para>added <link linkend="sphinxql-select">reverse_scan=(0|1)</link> option to SELECT</para></listitem>
|
|
<listitem><para>added support for MySQL packets over 16M</para></listitem>
|
|
<listitem><para>added dummy SHOW VARIABLES, SHOW COLLATION, and SET character_set_results support (to support handshake with certain client libraries and frameworks)</para></listitem>
|
|
<listitem><para>added <link linkend="conf-mysql-version-string">mysql_version_string</link> directive (to workaround picky MySQL client libraries)</para></listitem>
|
|
<listitem><para>added support for global filter variables, <link linkend="sphinxql-set">SET GLOBAL @uservar=(int_list)</link> </para></listitem>
|
|
<listitem><para>added <link linkend="sphinxql-delete">DELETE ... IN (id_list)</link> syntax support</para></listitem>
|
|
<listitem><para>added C-style comments syntax (for example, <code>SELECT /*!40000 some comment*/ id FROM test</code>)</para></listitem>
|
|
<listitem><para>added <link linkend="sphinxql-update">UPDATE ... WHERE id=X</link> syntax support</para></listitem>
|
|
<listitem><para>added <link linkend="sphinxql-multi-queries">SphinxQL multi-query support</link></para></listitem>
|
|
<listitem><para>added <link linkend="sphinxql-describe">DESCRIBE</link>, <link linkend="sphinxql-show-tables">SHOW TABLES</link> statements</para></listitem>
|
|
</itemizedlist>
|
|
<bridgehead>New command-line switches</bridgehead>
|
|
<itemizedlist>
|
|
<listitem><para>added <code>--print-queries</code> switch to <filename>indexer</filename> that dumps SQL queries it runs</para></listitem>
|
|
<listitem><para>added <code>--sighup-each </code> switch to <filename>indexer</filename> that rotates indexes one by one</para></listitem>
|
|
<listitem><para>added <code>--strip-path</code> switch to <filename>searchd</filename> that skips file paths embedded in the index(-es)</para></listitem>
|
|
<listitem><para>added <code>--dumpconfig</code> switch to <filename>indextool</filename> that dumps an index header in <filename>sphinx.conf</filename> format</para></listitem>
|
|
</itemizedlist>
|
|
<bridgehead>Major changes and optimizations</bridgehead>
|
|
<itemizedlist>
|
|
<listitem><para>changed default preopen_indexes value to 1</para></listitem>
|
|
<listitem><para>optimized English stemmer (results in 1.3x faster snippets and indexing with morphology=stem_en)</para></listitem>
|
|
<listitem><para>optimized snippets, 1.6x general speedup</para></listitem>
|
|
<listitem><para>optimized const-list parsing in SphinxQL</para></listitem>
|
|
<listitem><para>optimized full-document highlighting CPU/RAM use</para></listitem>
|
|
<listitem><para>optimized binlog replay (improved performance on K-list update)</para></listitem>
|
|
</itemizedlist>
|
|
<bridgehead>Bug fixes</bridgehead>
|
|
<itemizedlist>
|
|
<listitem><para>fixed #767, joined fields vs ODBC sources</para></listitem>
|
|
<listitem><para>fixed #757, wordforms shared by indexes with different settings</para></listitem>
|
|
<listitem><para>fixed #733, loading of indexes in formats prior to v.14</para></listitem>
|
|
<listitem><para>fixed #763, occasional snippets failures</para></listitem>
|
|
<listitem><para>fixed #648, occasionally missed rotations on multiple SIGHUPs</para></listitem>
|
|
<listitem><para>fixed #750, an RT segment merge leading to false positives and/or crashes in some cases</para></listitem>
|
|
<listitem><para>fixed #755, zones in snippets output</para></listitem>
|
|
<listitem><para>fixed #754, stopwords counting at snippet passage generation</para></listitem>
|
|
<listitem><para>fixed #723, fork/prefork index rotation in children processes</para></listitem>
|
|
<listitem><para>fixed #696, freeze on zero threshold in quorum operator</para></listitem>
|
|
<listitem><para>fixed #732, query escaping in SphinxSE</para></listitem>
|
|
<listitem><para>fixed #739, occasional crashes in MT mode on result set send</para></listitem>
|
|
<listitem><para>fixed #746, crash with a named list in SphinxQL option</para></listitem>
|
|
<listitem><para>fixed #674, AVG vs group order</para></listitem>
|
|
<listitem><para>fixed #734, occasional crashes attempting to report NULL errors</para></listitem>
|
|
<listitem><para>fixed #829, tail hits within field position modifier</para></listitem>
|
|
<listitem><para>fixed #712, missing query_mode, force_all_words snippet option defaults in Java API</para></listitem>
|
|
<listitem><para>fixed #721, added dupe removal on RT batch INSERT/REPLACE</para></listitem>
|
|
<listitem><para>fixed #720, potential extraneous highlighting after a blended keyword</para></listitem>
|
|
<listitem><para>fixed #702, exceptions vs star search</para></listitem>
|
|
<listitem><para>fixed #666, ext2 query grouping vs exceptions</para></listitem>
|
|
<listitem><para>fixed #688, WITHIN GROUP ORDER BY related crash</para></listitem>
|
|
<listitem><para>fixed #660, multi-queue batches vs dist_threads</para></listitem>
|
|
<listitem><para>fixed #678, crash on dict=keywords vs xmlpipe vs min_prefix_len</para></listitem>
|
|
<listitem><para>fixed #596, ECHILD vs scripted configs</para></listitem>
|
|
<listitem><para>fixed #653, dependency in expression, sorting, grouping</para></listitem>
|
|
<listitem><para>fixed #661, concurrent distributed searches vs workers=threads</para></listitem>
|
|
<listitem><para>fixed #646, crash on status query via UNIX socket</para></listitem>
|
|
<listitem><para>fixed #589, libexpat.dll missing from some Win32 build types</para></listitem>
|
|
<listitem><para>fixed #574, quorum match order</para></listitem>
|
|
<listitem><para>fixed multiple documentation issues (#372, #483, #495, #601, #623, #632, #654)</para></listitem>
|
|
<listitem><para>fixed that ondisk_dict did not affect RT indexes</para></listitem>
|
|
<listitem><para>fixed that string attributes check in indextool --check was erroneously sensitive to string data order</para></listitem>
|
|
<listitem><para>fixed a rare crash when using BEFORE operator</para></listitem>
|
|
<listitem><para>fixed an issue with multiforms vs BuildKeywords()</para></listitem>
|
|
<listitem><para>fixed an edge case in OR operator (emitted wrong hits order sometimes)</para></listitem>
|
|
<listitem><para>fixed aliasing in docinfo accessors that lead to very rare crashes and/or missing results</para></listitem>
|
|
<listitem><para>fixed a syntax error on a short token at the end of a query</para></listitem>
|
|
<listitem><para>fixed id64 filtering and performance degradation with range filters</para></listitem>
|
|
<listitem><para>fixed missing rankers in libsphinxclient</para></listitem>
|
|
<listitem><para>fixed missing SPH04 ranker in SphinxSE</para></listitem>
|
|
<listitem><para>fixed column names in sql_attr_multi sample (works with example.sql now)</para></listitem>
|
|
<listitem><para>fixed an issue with distributed local+remote setup vs aggregate functions</para></listitem>
|
|
<listitem><para>fixed case sensitive columns names in RT indexes</para></listitem>
|
|
<listitem><para>fixed a crash vs strings from multiple indexes in result set</para></listitem>
|
|
<listitem><para>fixed blended keywords vs snippets</para></listitem>
|
|
<listitem><para>fixed secure_connection vs MySQL protocol vs MySQL.NET connector</para></listitem>
|
|
<listitem><para>fixed that Python API did not works with Python 2.3</para></listitem>
|
|
<listitem><para>fixed overshort_step vs snippets</para></listitem>
|
|
<listitem><para>fixed keyword staistics vs dist_threads searching</para></listitem>
|
|
<listitem><para>fixed multiforms vs query parsing (vs quorum)</para></listitem>
|
|
<listitem><para>fixed missed quorum words vs RT segments</para></listitem>
|
|
<listitem><para>fixed blended keywords occasionally skipping extra character when querying (eg "abc[]")</para></listitem>
|
|
<listitem><para>fixed Python API to handle int32 values</para></listitem>
|
|
<listitem><para>fixed prefix and infix indexing of joined fields</para></listitem>
|
|
<listitem><para>fixed MVA ranged query</para></listitem>
|
|
<listitem><para>fixed missing blended state reset on document boundary</para></listitem>
|
|
<listitem><para>fixed a crash on missing index while replaying binlog</para></listitem>
|
|
<listitem><para>fixed an error message on filter values overrun</para></listitem>
|
|
<listitem><para>fixed passage duplication in snippets in weight_order mode</para></listitem>
|
|
<listitem><para>fixed select clauses over 1K vs remote agents</para></listitem>
|
|
<listitem><para>fixed overshort accounting vs soft-whitespace tokens</para></listitem>
|
|
<listitem><para>fixed rotation vs workers=threads</para></listitem>
|
|
<listitem><para>fixed schema issues vs distributed indexes</para></listitem>
|
|
<listitem><para>fixed blended-escaped sequence parsing issue</para></listitem>
|
|
<listitem><para>fixed MySQL IN clause (values order etc)</para></listitem>
|
|
<listitem><para>fixed that post_index did not execute when 0 documents were succesfully indexed</para></listitem>
|
|
<listitem><para>fixed field position limit vs many hits</para></listitem>
|
|
<listitem><para>fixed that joined fields missed an end marker at field end</para></listitem>
|
|
<listitem><para>fixed that xxx_step settings were missing from .sph index header</para></listitem>
|
|
<listitem><para>fixed libsphinxclient missing request cleanup in sphinx_query() (eg after network errors)</para></listitem>
|
|
<listitem><para>fixed that index_weights were ignored when grouping</para></listitem>
|
|
<listitem><para>fixed multi wordforms vs blend_chars</para></listitem>
|
|
<listitem><para>fixed broken MVA output in SphinxQL</para></listitem>
|
|
<listitem><para>fixed a few RT leaks</para></listitem>
|
|
<listitem><para>fixed an issue with RT string storage going missing</para></listitem>
|
|
<listitem><para>fixed an issue with repeated queries vs dist_threads</para></listitem>
|
|
<listitem><para>fixed an issue with string attributes vs buffer overrun in SphinxQL</para></listitem>
|
|
<listitem><para>fixed unexpected character data warnings within ignored xmlpipe tags</para></listitem>
|
|
<listitem><para>fixed a crash in snippets with NEAR syntax query</para></listitem>
|
|
<listitem><para>fixed passage duplication in snippets</para></listitem>
|
|
<listitem><para>fixed libsphinxclient SIGPIPE handling</para></listitem>
|
|
<listitem><para>fixed libsphinxclient vs VS2003 compiler bug</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
<sect1 id="rel110"><title>Version 1.10-beta, 19 jul 2010</title>
|
|
<itemizedlist>
|
|
<listitem><para>added RT indexes support (<xref linkend="rt-indexes"/>)</para></listitem>
|
|
<listitem><para>added prefork and threads support (<link linkend="conf-workers">workers</link> directives)</para></listitem>
|
|
<listitem><para>added multi-threaded local searches in distributed indexes (<link linkend="conf-dist-threads">dist_threads</link> directive)</para></listitem>
|
|
<listitem><para>added common subquery cache (<link linkend="conf-subtree-docs-cache">subtree_docs_cache</link>,
|
|
<link linkend="conf-subtree-hits-cache">subtree_hits_cache</link> directives)</para></listitem>
|
|
<listitem><para>added string attributes support (<link linkend="conf-sql-attr-string">sql_attr_string</link>,
|
|
<link linkend="conf-sql-field-string">sql_field_string</link>,
|
|
<link linkend="conf-xmlpipe-attr-string">xml_attr_string</link>,
|
|
<link linkend="conf-xmlpipe-field-string">xml_field_string</link> directives)</para></listitem>
|
|
<listitem><para>added indexing-time word counter (<link linkend="conf-sql-attr-str2wordcount">sql_attr_str2wordcount</link>,
|
|
<link linkend="conf-sql-field-str2wordcount">sql_field_str2wordcount</link> directives)</para></listitem>
|
|
<listitem><para>added <link linkend="sphinxql-call-snippets">CALL SNIPPETS()</link>,
|
|
<link linkend="sphinxql-call-keywords">CALL KEYWORDS()</link> SphinxQL statements</para></listitem>
|
|
<listitem><para>added <option>field_weights</option>, <option>index_weights</option> options to
|
|
SphinxQL <link linkend="sphinxql-select">SELECT</link> statement</para></listitem>
|
|
<listitem><para>added insert-only SphinxQL-talking tables to SphinxSE (connection='sphinxql://host[:port]/index')</para></listitem>
|
|
<listitem><para>added <option>select</option> option to SphinxSE queries</para></listitem>
|
|
<listitem><para>added backtrace on crash to <filename>searchd</filename></para></listitem>
|
|
<listitem><para>added SQL+FS indexing, aka loading files by names fetched from SQL
|
|
(<link linkend="conf-sql-file-field">sql_file_field</link> directive)</para></listitem>
|
|
<listitem><para>added a watchdog in threads mode to <filename>searchd</filename></para></listitem>
|
|
<listitem><para>added automatic row phantoms elimination to index merge</para></listitem>
|
|
<listitem><para>added hitless indexing support (hitless_words directive)</para></listitem>
|
|
<listitem><para>added --check, --strip-path, --htmlstrip, --dumphitlist ... --wordid switches to <link linkend="ref-indextool">indextool</link></para></listitem>
|
|
<listitem><para>added --stopwait, --logdebug switches to <link linkend="ref-searchd">searchd</link></para></listitem>
|
|
<listitem><para>added --dump-rows, --verbose switches to <link linkend="ref-indexer">indexer</link></para></listitem>
|
|
<listitem><para>added "blended" characters indexing support (<link linkend="conf-blend-chars">blend_chars</link> directive)</para></listitem>
|
|
<listitem><para>added joined/payload field indexing (<link linkend="conf-sql-joined-field">sql_joined_field</link> directive)</para></listitem>
|
|
<listitem><para>added <link linkend="api-func-flushattributes">FlushAttributes() API call</link></para></listitem>
|
|
<listitem><para>added query_mode, force_all_words, limit_passages, limit_words, start_passage_id, load_files, html_strip_mode,
|
|
allow_empty options, and %PASSAGE_ID% macro in before_match, after_match options
|
|
to <link linkend="api-func-buildexcerpts">BuildExcerpts()</link> API call</para></listitem>
|
|
<listitem><para>added @groupby/@count/@distinct columns support to SELECT (but not to expressions)</para></listitem>
|
|
<listitem><para>added query-time keyword expansion support (<link linkend="conf-expand-keywords">expand_keywords</link> directive,
|
|
<link linkend="api-func-setrankingmode">SPH_RANK_SPH04</link> ranker)</para></listitem>
|
|
<listitem><para>added query batch size limit option (<link linkend="conf-max-batch-queries">max_batch_queries</link> directive; was hardcoded)</para></listitem>
|
|
<listitem><para>added SINT() function to expressions</para></listitem>
|
|
<listitem><para>improved SphinxQL syntax error reporting</para></listitem>
|
|
<listitem><para>improved expression optimizer (better constant handling)</para></listitem>
|
|
<listitem><para>improved dash handling within keywords (no longer treated as an operator)</para></listitem>
|
|
<listitem><para>improved snippets (better passage selection/trimming, around option now a hard limit)</para></listitem>
|
|
<listitem><para>optimized index format that yields ~20-30% smaller indexes</para></listitem>
|
|
<listitem><para>optimized sorting code (indexing time 1-5% faster on average; 100x faster in worst case)</para></listitem>
|
|
<listitem><para>optimized searchd startup time (moved .spa preindexing to indexer), added a progress bar</para></listitem>
|
|
<listitem><para>optimized queries against indexes with many attributes (eliminated redundant copying)</para></listitem>
|
|
<listitem><para>optimized 1-keyword queries (performace regression introduced in 0.9.9)</para></listitem>
|
|
<listitem><para>optimized SphinxQL protocol overheads, and performance on bigger result sets</para></listitem>
|
|
<listitem><para>optimized unbuffered attributes writes on index merge</para></listitem>
|
|
<listitem><para>changed attribute handling, duplicate names are strictly forbidden now</para></listitem>
|
|
<listitem><para>fixed that SphinxQL sessions could stall shutdown</para></listitem>
|
|
<listitem><para>fixed consts with leading minus in SphinxQL</para></listitem>
|
|
<listitem><para>fixed AND/OR precedence in expressions</para></listitem>
|
|
<listitem><para>fixed #334, AVG() on integers was not computed in floats</para></listitem>
|
|
<listitem><para>fixed #371, attribute flush vs 2+ GB files</para></listitem>
|
|
<listitem><para>fixed #373, segfault on distributed queries vs certain libc versions</para></listitem>
|
|
<listitem><para>fixed #398, stopwords not stopped in prefix/infix indexes</para></listitem>
|
|
<listitem><para>fixed #404, erroneous MVA failures in indextool --check</para></listitem>
|
|
<listitem><para>fixed #408, segfault on certain query batches (regular scan, plus a scan with MVA groupby)</para></listitem>
|
|
<listitem><para>fixed #431, occasional shutdown hangs in preforked workers</para></listitem>
|
|
<listitem><para>fixed #436, trunk checkout builds vs Solaris sh</para></listitem>
|
|
<listitem><para>fixed #440, escaping vs parentheses declared as valid in charset_table</para></listitem>
|
|
<listitem><para>fixed #442, occasional non-aligned free in MVA indexing</para></listitem>
|
|
<listitem><para>fixed #447, occasional crashes in MVA indexing</para></listitem>
|
|
<listitem><para>fixed #449, pconn busyloop on aborted clients on certain arches</para></listitem>
|
|
<listitem><para>fixed #465, build issue on Alpha</para></listitem>
|
|
<listitem><para>fixed #468, build issue in libsphinxclient</para></listitem>
|
|
<listitem><para>fixed #472, multiple stopword files failing to load</para></listitem>
|
|
<listitem><para>fixed #489, buffer overflow in query logging</para></listitem>
|
|
<listitem><para>fixed #493, Python API assertion after error returned from Query()</para></listitem>
|
|
<listitem><para>fixed #500, malformed MySQL packet when sending MVAs</para></listitem>
|
|
<listitem><para>fixed #504, SIGPIPE in libsphinxclient</para></listitem>
|
|
<listitem><para>fixed #506, better MySQL protocol commands support in SphinxQL (PING etc)</para></listitem>
|
|
<listitem><para>fixed #509, indexing ranged results from stored procedures</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
<sect1 id="rel099"><title>Version 0.9.9-release, 02 dec 2009</title>
|
|
<itemizedlist>
|
|
<listitem><para>added Open, Close, Status calls to libsphinxclient (C API)</para></listitem>
|
|
<listitem><para>added automatic persistent connection reopening to PHP, Python APIs</para></listitem>
|
|
<listitem><para>added 64-bit value/range filters, fullscan mode support to SphinxSE</para></listitem>
|
|
<listitem><para>MAJOR CHANGE, our IANA assigned ports are 9312 and 9306 respectively (goodbye, trusty 3312)</para></listitem>
|
|
<listitem><para>MAJOR CHANGE, erroneous filters now fail with an error (were silently ignored before)</para></listitem>
|
|
<listitem><para>optimized unbuffered .spa writes on merge</para></listitem>
|
|
<listitem><para>optimized 1-keyword queries ranking in extended2 mode</para></listitem>
|
|
<listitem><para>fixed #441 (IO race in case of highly conccurent load on a preopened)</para></listitem>
|
|
<listitem><para>fixed #434 (distrubuted indexes were not searchable via MySQL protocol)</para></listitem>
|
|
<listitem><para>fixed #317 (indexer MVA progress counter)</para></listitem>
|
|
<listitem><para>fixed #398 (stopwords not removed from search query)</para></listitem>
|
|
<listitem><para>fixed #328 (broken cutoff)</para></listitem>
|
|
<listitem><para>fixed #250 (now quoting paths w/spaces when installing Windows service)</para></listitem>
|
|
<listitem><para>fixed #348 (K-list was not updated on merge)</para></listitem>
|
|
<listitem><para>fixed #357 (destination index were not K-list-filtered on merge)</para></listitem>
|
|
<listitem><para>fixed #369 (precaching .spi files over 2 GBs)</para></listitem>
|
|
<listitem><para>fixed #438 (missing boundary proximity matches)</para></listitem>
|
|
<listitem><para>fixed #371 (.spa flush in case of files over 2 GBs)</para></listitem>
|
|
<listitem><para>fixed #373 (crashes on distributed queries via mysql proto)</para></listitem>
|
|
<listitem><para>fixed critical bugs in hit merging code</para></listitem>
|
|
<listitem><para>fixed #424 (ordinals could be misplaced during indexing in case of bitfields etc)</para></listitem>
|
|
<listitem><para>fixed #426 (failing SE build on Solaris; thanks to Ben Beecher)</para></listitem>
|
|
<listitem><para>fixed #423 (typo in SE caused crash on SHOW STATUS)</para></listitem>
|
|
<listitem><para>fixed #363 (handling of read_timeout over 2147 seconds)</para></listitem>
|
|
<listitem><para>fixed #376 (minor error message mismatch)</para></listitem>
|
|
<listitem><para>fixed #413 (minus in SphinxQL)</para></listitem>
|
|
<listitem><para>fixed #417 (floats w/o leading digit in SphinxQL)</para></listitem>
|
|
<listitem><para>fixed #403 (typo in SetFieldWeights name in Java API)</para></listitem>
|
|
<listitem><para>fixed index rotation vs persistent connections</para></listitem>
|
|
<listitem><para>fixed backslash handling in SphinxQL parser</para></listitem>
|
|
<listitem><para>fixed uint unpacking vs. PHP 5.2.9 (possibly other versions)</para></listitem>
|
|
<listitem><para>fixed #325 (filter settings send from SphinxSE)</para></listitem>
|
|
<listitem><para>fixed #352 (removed mysql wrapper around close() in SphinxSE)</para></listitem>
|
|
<listitem><para>fixed #389 (display error messages through SphinxSE status variable)</para></listitem>
|
|
<listitem><para>fixed linking with port-installed iconv on OS X</para></listitem>
|
|
<listitem><para>fixed negative 64-bit unpacking in PHP API</para></listitem>
|
|
<listitem><para>fixed #349 (escaping backslash in query emulation mode)</para></listitem>
|
|
<listitem><para>fixed #320 (disabled multi-query route when select items differ)</para></listitem>
|
|
<listitem><para>fixed #353 (better quorum counts check)</para></listitem>
|
|
<listitem><para>fixed #341 (merging of trailing hits; maybe other ranking issues too)</para></listitem>
|
|
<listitem><para>fixed #368 (partially; @field "" caused crashes; now resets field limit)</para></listitem>
|
|
<listitem><para>fixed #365 (field mask was leaking on field-limited terms)</para></listitem>
|
|
<listitem><para>fixed #339 (updated debug query dumper)</para></listitem>
|
|
<listitem><para>fixed #361 (added SetConnectTimeout() to Java API)</para></listitem>
|
|
<listitem><para>fixed #338 (added missing fullscan to mode check in Java API)</para></listitem>
|
|
<listitem><para>fixed #323 (added floats support to SphinxQL)</para></listitem>
|
|
<listitem><para>fixed #340 (support listen=port:proto syntax too)</para></listitem>
|
|
<listitem><para>fixed #332 (\r is legal SphinxQL space now)</para></listitem>
|
|
<listitem><para>fixed xmlpipe2 K-lists</para></listitem>
|
|
<listitem><para>fixed #322 (safety gaps in mysql protocol row buffer)</para></listitem>
|
|
<listitem><para>fixed #313 (return keyword stats for empty indexes too)</para></listitem>
|
|
<listitem><para>fixed #344 (invalid checkpoints after merge)</para></listitem>
|
|
<listitem><para>fixed #326 (missing CLOCK_xxx on FreeBSD)</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
<sect1 id="rel099rc2"><title>Version 0.9.9-rc2, 08 apr 2009</title>
|
|
<itemizedlist>
|
|
<listitem><para>added IsConnectError(), Open(), Close() calls to Java API (bug #240)</para></listitem>
|
|
<listitem><para>added <link linkend="conf-read-buffer">read_buffer</link>, <link linkend="conf-read-unhinted">read_unhinted</link> directives</para></listitem>
|
|
<listitem><para>added checks for build options returned by mysql_config (builds on Solaris now)</para></listitem>
|
|
<listitem><para>added fixed-RAM index merge (bug #169)</para></listitem>
|
|
<listitem><para>added logging chained queries count in case of (optimized) multi-queries</para></listitem>
|
|
<listitem><para>added <link linkend="sort-expr">GEODIST()</link> function</para></listitem>
|
|
<listitem><para>added <link linkend="ref-searchd">--status switch to searchd</link></para></listitem>
|
|
<listitem><para>added MySpell (OpenOffice) affix file support (bug #281)</para></listitem>
|
|
<listitem><para>added <link linkend="conf-odbc-dsn">ODBC support</link> (both Windows and UnixODBC)</para></listitem>
|
|
<listitem><para>added support for @id in IN() (bug #292)</para></listitem>
|
|
<listitem><para>added support for <link linkend="api-func-setselect">aggregate functions</link> in GROUP BY (namely AVG, MAX, MIN, SUM)</para></listitem>
|
|
<listitem><para>added <link linkend="sphinxse-snippets">MySQL UDF that builds snippets</link> using searchd</para></listitem>
|
|
<listitem><para>added <link linkend="conf-write-buffer">write_buffer</link> directive (defaults to 1M)</para></listitem>
|
|
<listitem><para>added <link linkend="conf-xmlpipe-fixup-utf8">xmlpipe_fixup_utf8</link> directive</para></listitem>
|
|
<listitem><para>added suggestions sample</para></listitem>
|
|
<listitem><para>added microsecond precision int64 timer (bug #282)</para></listitem>
|
|
<listitem><para>added <link linkend="conf-listen-backlog">listen_backlog directive</link></para></listitem>
|
|
<listitem><para>added <link linkend="conf-max-xmlpipe2-field">max_xmlpipe2_field</link> directive</para></listitem>
|
|
<listitem><para>added <link linkend="sphinxql">initial SphinxQL support</link> to mysql41 handler, SELECT .../SHOW WARNINGS/STATUS/META are handled</para></listitem>
|
|
<listitem><para>added support for different network protocols, and mysql41 protocol</para></listitem>
|
|
<listitem><para>added <link linkend="api-func-setrankingmode">fieldmask ranker</link>, updated SphinxSE list of rankers</para></listitem>
|
|
<listitem><para>added <link linkend="conf-mysql-ssl">mysql_ssl_xxx</link> directives</para></listitem>
|
|
<listitem><para>added <link linkend="ref-searchd">--cpustats (requires clock_gettime()) and --status switches</link> to searchd</para></listitem>
|
|
<listitem><para>added performance counters, <link linkend="api-func-status">Status()</link> API call</para></listitem>
|
|
<listitem><para>added <link linkend="conf-overshort-step">overshort_step</link> and <link linkend="conf-stopword-step">stopword_step</link> directives</para></listitem>
|
|
<listitem><para>added <link linkend="extended-syntax">strict order operator</link> (aka operator before, eg. "one << two << three")</para></listitem>
|
|
<listitem><para>added <link linkend="ref-indextool">indextool</link> utility, moved --dumpheader there, added --debugdocids, --dumphitlist options</para></listitem>
|
|
<listitem><para>added own RNG, reseeded on @random sort query (bug #183)</para></listitem>
|
|
<listitem><para>added <link linkend="extended-syntax">field-start and field-end modifiers support</link> (syntax is "^hello world$"; field-end requires reindex)</para></listitem>
|
|
<listitem><para>added MVA attribute support to IN() function</para></listitem>
|
|
<listitem><para>added <link linkend="sort-expr">AND, OR, and NOT support</link> to expressions</para></listitem>
|
|
<listitem><para>improved logging of (optimized) multi-queries (now logging chained query count)</para></listitem>
|
|
<listitem><para>improved handshake error handling, fixed protocol version byte order (omg)</para></listitem>
|
|
<listitem><para>updated SphinxSE to protocol 1.22</para></listitem>
|
|
<listitem><para>allowed phrase_boundary_step=-1 (trick to emulate keyword expansion)</para></listitem>
|
|
<listitem><para>removed SPH_MAX_QUERY_WORDS limit</para></listitem>
|
|
<listitem><para>fixed CLI search vs documents missing from DB (bug #257)</para></listitem>
|
|
<listitem><para>fixed libsphinxclient results leak on subsequent sphinx_run_queries call (bug #256)</para></listitem>
|
|
<listitem><para>fixed libsphinxclient handling of zero max_matches and cutoff (bug #208)</para></listitem>
|
|
<listitem><para>fixed Java API over-64K string reads (eg. big snippets) in Java API (bug #181)</para></listitem>
|
|
<listitem><para>fixed Java API 2nd Query() after network error in 1st Query() call (bug #308)</para></listitem>
|
|
<listitem><para>fixed typo-class bugs in SetFilterFloatRange (bug #259), SetSortMode (bug #248)</para></listitem>
|
|
<listitem><para>fixed missing @@relaxed support (bug #276), fixed missing error on @nosuchfield queries, documented @@relaxed</para></listitem>
|
|
<listitem><para>fixed UNIX socket permissions to 0777 (bug #288)</para></listitem>
|
|
<listitem><para>fixed xmlpipe2 crash on schemas with no fields, added better document structure checks</para></listitem>
|
|
<listitem><para>fixed (and optimized) expr parser vs IN() with huge (10K+) args count</para></listitem>
|
|
<listitem><para>fixed double EarlyCalc() in fullscan mode (minor performance impact)</para></listitem>
|
|
<listitem><para>fixed phrase boundary handling in some cases (on buffer end, on trailing whitespace)</para></listitem>
|
|
<listitem><para>fixes in snippets (aka excerpts) generation</para></listitem>
|
|
<listitem><para>fixed inline attrs vs id64 index corruption</para></listitem>
|
|
<listitem><para>fixed head searchd crash on config re-parse failure</para></listitem>
|
|
<listitem><para>fixed handling of numeric keywords with leading zeroes such as "007" (bug #251)</para></listitem>
|
|
<listitem><para>fixed junk in SphinxSE status variables (bug #304)</para></listitem>
|
|
<listitem><para>fixed wordlist checkpoints serialization (bug #236)</para></listitem>
|
|
<listitem><para>fixed unaligned docinfo id access (bug #230)</para></listitem>
|
|
<listitem><para>fixed GetRawBytes() vs oversized blocks (headers with over 32K charset_table should now work, bug #300)</para></listitem>
|
|
<listitem><para>fixed buffer overflow caused by too long dest wordform, updated tests</para></listitem>
|
|
<listitem><para>fixed IF() return type (was always int, is deduced now)</para></listitem>
|
|
<listitem><para>fixed legacy queries vs. special chars vs. multiple indexes</para></listitem>
|
|
<listitem><para>fixed write-write-read socket access pattern vs Nagle vs delays vs FreeBSD (oh wow)</para></listitem>
|
|
<listitem><para>fixed exceptions vs query-parser issue</para></listitem>
|
|
<listitem><para>fixed late calc vs @weight in expressions (bug #285)</para></listitem>
|
|
<listitem><para>fixed early lookup/calc vs filters (bug #284)</para></listitem>
|
|
<listitem><para>fixed emulated MATCH_ANY queries (empty proximity and phrase queries are allowed now)</para></listitem>
|
|
<listitem><para>fixed MATCH_ANY ranker vs fields with no matches</para></listitem>
|
|
<listitem><para>fixed index file size vs inplace_enable (bug #245)</para></listitem>
|
|
<listitem><para>fixed that old logs were not closed on USR1 (bug #221)</para></listitem>
|
|
<listitem><para>fixed handling of '!' alias to NOT operator (bug #237)</para></listitem>
|
|
<listitem><para>fixed error handling vs query steps (step failure was not reported)</para></listitem>
|
|
<listitem><para>fixed querying vs inline attributes</para></listitem>
|
|
<listitem><para>fixed stupid bug in escaping code, fixed EscapeString() and made it static</para></listitem>
|
|
<listitem><para>fixed parser vs @field -keyword, foo|@field bar, "" queries (bug #310)</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
<sect1 id="rel099rc1"><title>Version 0.9.9-rc1, 17 nov 2008</title>
|
|
<itemizedlist>
|
|
<listitem><para>added <link linkend="conf-min-stemming-len">min_stemming_len</link> directive</para></listitem>
|
|
<listitem><para>added <link linkend="api-func-isconnecterror">IsConnectError()</link> API call (helps distingusih API vs remote errors)</para></listitem>
|
|
<listitem><para>added duplicate log messages filter to searchd</para></listitem>
|
|
<listitem><para>added --nodetach debugging switch to searchd</para></listitem>
|
|
<listitem><para>added blackhole agents support for debugging/testing (<link linkend="conf-agent-blackhole">agent_blackhole</link> directive)</para></listitem>
|
|
<listitem><para>added <link linkend="conf-max-filters">max_filters</link>, <link linkend="conf-max-filter-values">max_filter_values</link> directives (were hardcoded before)</para></listitem>
|
|
<listitem><para>added int64 expression evaluation path, automatic inference, and BIGINT() enforcer function</para></listitem>
|
|
<listitem><para>added crash handler for debugging (<link linkend="conf-crash-log-path">crash_log_path</link> directive)</para></listitem>
|
|
<listitem><para>added MS SQL (aka SQL Server) source support (Windows only, <link linkend="conf-mssql-winauth">mssql_winauth</link> and <link linkend="conf-mssql-unicode">mssql_unicode</link> directives)</para></listitem>
|
|
<listitem><para>added indexer-side column unpacking feature (<link linkend="conf-unpack-zlib">unpack_zlib</link>, <link linkend="conf-unpack-mysqlcompress">unpack_mysqlcompress</link> directives)</para></listitem>
|
|
<listitem><para>added nested brackers and NOTs support to <link linkend="extended-syntax">query language</link>, rewritten query parser</para></listitem>
|
|
<listitem><para>added persistent connections support (<link linkend="api-func-open">Open()</link> and <link linkend="api-func-close">Close()</link> API calls)</para></listitem>
|
|
<listitem><para>added <link linkend="conf-index-exact-words">index_exact_words</link> feature, and exact form operator to query language ("hello =world")</para></listitem>
|
|
<listitem><para>added status variables support to SphinxSE (SHOW STATUS LIKE 'sphinx_%')</para></listitem>
|
|
<listitem><para>added <link linkend="conf-max-packet-size">max_packet_size</link> directive (was hardcoded at 8M before)</para></listitem>
|
|
<listitem><para>added UNIX socket support, and multi-interface support (<link linkend="conf-listen">listen</link> directive)</para></listitem>
|
|
<listitem><para>added star-syntax support to <link linkend="api-func-buildexcerpts">BuildExcerpts()</link> API call</para></listitem>
|
|
<listitem><para>added inplace inversion of .spa and .spp (<link linkend="conf-inplace-enable">inplace_enable</link> directive, 1.5-2x less disk space for indexing)</para></listitem>
|
|
<listitem><para>added builtin Czech stemmer (morphology=stem_cz)</para></listitem>
|
|
<listitem><para>added <link linkend="sort-expr">IDIV(), NOW(), INTERVAL(), IN() functions</link> to expressions</para></listitem>
|
|
<listitem><para>added index-level early-reject based on filters</para></listitem>
|
|
<listitem><para>added MVA updates feature (<link linkend="conf-mva-updates-pool">mva_updates_pool</link> directive)</para></listitem>
|
|
<listitem><para>added select-list feature with computed expressions support (see <link linkend="api-func-setselect">SetSelect()</link> API call, test.php --select switch), protocol 1.22</para></listitem>
|
|
<listitem><para>added integer expressions support (2x faster than float)</para></listitem>
|
|
<listitem><para>added multiforms support (multiple source words in wordforms file)</para></listitem>
|
|
<listitem><para>added <link linkend="api-func-setrankingmode">legacy rankers</link> (MATCH_ALL/MATCH_ANY/etc), removed legacy matching code (everything runs on V2 engine now)</para></listitem>
|
|
<listitem><para>added <link linkend="extended-syntax">field position limit</link> modifier to field operator (syntax: @title[50] hello world)</para></listitem>
|
|
<listitem><para>added killlist support (<link linkend="conf-sql-query-killlist">sql_query_killlist</link> directive, --merge-killlists switch)</para></listitem>
|
|
<listitem><para>added on-disk SPI support (<link linkend="conf-ondisk-dict">ondisk_dict</link> directive)</para></listitem>
|
|
<listitem><para>added indexer IO stats</para></listitem>
|
|
<listitem><para>added periodic .spa flush (<link linkend="conf-attr-flush-period">attr_flush_period</link> directive)</para></listitem>
|
|
<listitem><para>added config reload on SIGHUP</para></listitem>
|
|
<listitem><para>added per-query attribute overrides feature (see <link linkend="api-func-setoverride">SetOverride()</link> API call); protocol 1.21</para></listitem>
|
|
<listitem><para>added signed 64bit attrs support (<link linkend="conf-sql-attr-bigint">sql_attr_bigint</link> directive)</para></listitem>
|
|
<listitem><para>improved HTML stripper to also skip PIs (<? ... ?>, such as <?php ... ?>)</para></listitem>
|
|
<listitem><para>improved excerpts speed (upto 50x faster on big documents)</para></listitem>
|
|
<listitem><para>fixed a short window of searchd inaccessibility on startup (started listen()ing too early before)</para></listitem>
|
|
<listitem><para>fixed .spa loading on systems where read() is 2GB capped</para></listitem>
|
|
<listitem><para>fixed infixes vs morphology issues</para></listitem>
|
|
<listitem><para>fixed backslash escaping, added backslash to EscapeString()</para></listitem>
|
|
<listitem><para>fixed handling of over-2GB dictionary files (.spi)</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
<sect1 id="rel0981"><title>Version 0.9.8.1, 30 oct 2008</title>
|
|
<itemizedlist>
|
|
<listitem><para>added configure script to libsphinxclient</para></listitem>
|
|
<listitem><para>changed proximity/quorum operator syntax to require whitespace after length</para></listitem>
|
|
<listitem><para>fixed potential head process crash on SIGPIPE during "maxed out" message</para></listitem>
|
|
<listitem><para>fixed handling of incomplete remote replies (caused over-degraded distributed results, in rare cases)</para></listitem>
|
|
<listitem><para>fixed sending of big remote requests (caused distributed requests to fail, in rare cases)</para></listitem>
|
|
<listitem><para>fixed FD_SET() overflow (caused searchd to crash on startup, in rare cases)</para></listitem>
|
|
<listitem><para>fixed MVA vs distributed indexes (caused loss of 1st MVA value in result set)</para></listitem>
|
|
<listitem><para>fixed tokenizing of exceptions terminated by specials (eg. "GPS AT&T" in extended mode)</para></listitem>
|
|
<listitem><para>fixed buffer overrun in stemmer on overlong tokens occasionally emitted by proximity/quorum operator parser (caused crashes on certain proximity/quorum queries)</para></listitem>
|
|
<listitem><para>fixed wordcount ranker (could be dropping hits)</para></listitem>
|
|
<listitem><para>fixed --merge feature (numerous different fixes, caused broken indexes)</para></listitem>
|
|
<listitem><para>fixed --merge-dst-range performance</para></listitem>
|
|
<listitem><para>fixed prefix/infix generation for stopwords</para></listitem>
|
|
<listitem><para>fixed ignore_chars vs specials</para></listitem>
|
|
<listitem><para>fixed misplaced F_SETLKW check (caused certain build types, eg. RPM build on FC8, to fail)</para></listitem>
|
|
<listitem><para>fixed dictionary-defined charsets support in spelldump, added \x-style wordchars support</para></listitem>
|
|
<listitem><para>fixed Java API to properly send long strings (over 64K; eg. long document bodies for excerpts)</para></listitem>
|
|
<listitem><para>fixed Python API to accept offset/limit of 'long' type</para></listitem>
|
|
<listitem><para>fixed default ID range (that filtered out all 64-bit values) in Java and Python APIs</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
<sect1 id="rel098"><title>Version 0.9.8, 14 jul 2008</title>
|
|
<bridgehead>Indexing</bridgehead>
|
|
<itemizedlist>
|
|
<listitem><para>added support for 64-bit document and keyword IDs, --enable-id64 switch to configure</para></listitem>
|
|
<listitem><para>added support for floating point attributes</para></listitem>
|
|
<listitem><para>added support for bitfields in attributes, <link linkend="conf-sql-attr-bool">sql_attr_bool</link> directive and bit-widths part in <link linkend="conf-sql-attr-uint">sql_attr_uint</link> directive</para></listitem>
|
|
<listitem><para>added support for multi-valued attributes (MVA)</para></listitem>
|
|
<listitem><para>added metaphone preprocessor</para></listitem>
|
|
<listitem><para>added libstemmer library support, provides stemmers for a number of additional languages</para></listitem>
|
|
<listitem><para>added xmlpipe2 source type, that supports arbitrary fields and attributes</para></listitem>
|
|
<listitem><para>added word form dictionaries, <link linkend="conf-wordforms">wordforms</link> directive (and spelldump utility)</para></listitem>
|
|
<listitem><para>added tokenizing exceptions, <link linkend="conf-exceptions">exceptions</link> directive</para></listitem>
|
|
<listitem><para>added an option to fully remove element contents to HTML stripper, <link linkend="conf-html-remove-elements">html_remove_elements</link> directive</para></listitem>
|
|
<listitem><para>added HTML entities decoder (with full XHTML1 set support) to HTML stripper</para></listitem>
|
|
<listitem><para>added per-index HTML stripping settings, <link linkend="conf-html-strip">html_strip</link>, <link linkend="conf-html-index-attrs">html_index_attrs</link>, and <link linkend="conf-html-remove-elements">html_remove_elements</link> directives</para></listitem>
|
|
<listitem><para>added IO load throttling, <link linkend="conf-max-iops">max_iops</link> and <link linkend="conf-max-iosize">max_iosize</link> directives</para></listitem>
|
|
<listitem><para>added SQL load throttling, <link linkend="conf-sql-ranged-throttle">sql_ranged_throttle</link> directive</para></listitem>
|
|
<listitem><para>added an option to index prefixes/infixes for given fields only, <link linkend="conf-prefix-fields">prefix_fields</link> and <link linkend="conf-infix-fields">infix_fields</link> directives</para></listitem>
|
|
<listitem><para>added an option to ignore certain characters (instead of just treating them as whitespace), <link linkend="conf-ignore-chars">ignore_chars</link> directive</para></listitem>
|
|
<listitem><para>added an option to increment word position on phrase boundary characters, <link linkend="conf-phrase-boundary">phrase_boundary</link> and <link linkend="conf-phrase-boundary-step">phrase_boundary_step</link> directives</para></listitem>
|
|
<listitem><para>added --merge-dst-range switch (and filters) to index merging feature (--merge switch)</para></listitem>
|
|
<listitem><para>added <link linkend="conf-mysql-connect-flags">mysql_connect_flags</link> directive (eg. to reduce indexing time MySQL network traffic and/or time)</para></listitem>
|
|
<listitem><para>improved ordinals sorting; now runs in fixed RAM</para></listitem>
|
|
<listitem><para>improved handling of documents with zero/NULL ids, now skipping them instead of aborting</para></listitem>
|
|
</itemizedlist>
|
|
<bridgehead>Search daemon</bridgehead>
|
|
<itemizedlist>
|
|
<listitem><para>added an option to unlink old index on succesful rotation, <link linkend="conf-unlink-old">unlink_old</link> directive</para></listitem>
|
|
<listitem><para>added an option to keep index files open at all times (fixes subtle races on rotation), <link linkend="conf-preopen">preopen</link> and <link linkend="conf-preopen-indexes">preopen_indexes</link> directives</para></listitem>
|
|
<listitem><para>added an option to profile searchd disk I/O, --iostats command-line option</para></listitem>
|
|
<listitem><para>added an option to rotate index seamlessly (fully avoids query stalls), <link linkend="conf-seamless-rotate">seamless_rotate</link> directive</para></listitem>
|
|
<listitem><para>added HTML stripping support to excerpts (uses per-index settings)</para></listitem>
|
|
<listitem><para>added 'exact_phrase', 'single_passage', 'use_boundaries', 'weight_order 'options to <link linkend="api-func-buildexcerpts">BuildExcerpts()</link> API call</para></listitem>
|
|
<listitem><para>added distributed attribute updates propagation</para></listitem>
|
|
<listitem><para>added distributed retries on master node side</para></listitem>
|
|
<listitem><para>added log reopen on SIGUSR1</para></listitem>
|
|
<listitem><para>added --stop switch (sends SIGTERM to running instance)</para></listitem>
|
|
<listitem><para>added Windows service mode, and --servicename switch</para></listitem>
|
|
<listitem><para>added Windows --rotate support</para></listitem>
|
|
<listitem><para>improved log timestamping, now with millisecond precision</para></listitem>
|
|
</itemizedlist>
|
|
<bridgehead>Querying</bridgehead>
|
|
<itemizedlist>
|
|
<listitem><para>added extended engine V2 (faster, cleaner, better; SPH_MATCH_EXTENDED2 mode)</para></listitem>
|
|
<listitem><para>added ranking modes support (V2 engine only; <link linkend="api-func-setrankingmode">SetRankingMode()</link> API call)</para></listitem>
|
|
<listitem><para>added quorum searching support to query language (V2 engine only; example: "any three of all these words"/3)</para></listitem>
|
|
<listitem><para>added query escaping support to query language, and <link linkend="api-func-escapestring">EscapeString()</link> API call</para></listitem>
|
|
<listitem><para>added multi-field syntax support to query language (example: "@(field1,field2) something"), and @@relaxed field checks option</para></listitem>
|
|
<listitem><para>added optional star-syntax ('word*') support in keywords, <link linkend="conf-enable-star">enable_star</link> directive (for prefix/infix indexes only)</para></listitem>
|
|
<listitem><para>added full-scan support (query must be fully empty; can perform block-reject optimization)</para></listitem>
|
|
<listitem><para>added COUNT(DISTINCT(attr)) calculation support, <link linkend="api-func-setgroupdistinct">SetGroupDistinct()</link> API call</para></listitem>
|
|
<listitem><para>added group-by on MVA support, <link linkend="api-func-setarrayresult">SetArrayResult()</link> PHP API call</para></listitem>
|
|
<listitem><para>added per-index weights feature, <link linkend="api-func-setindexweights">SetIndexWeights()</link> API call</para></listitem>
|
|
<listitem><para>added geodistance support, <link linkend="api-func-setgeoanchor">SetGeoAnchor()</link> API call</para></listitem>
|
|
<listitem><para>added result set sorting by arbitrary expressions in run time (eg. "@weight+log(price)*2.5"), SPH_SORT_EXPR mode</para></listitem>
|
|
<listitem><para>added result set sorting by @custom compile-time sorting function (see src/sphinxcustomsort.inl)</para></listitem>
|
|
<listitem><para>added result set sorting by @random value</para></listitem>
|
|
<listitem><para>added result set merging for indexes with different schemas</para></listitem>
|
|
<listitem><para>added query comments support (3rd arg to <link linkend="api-func-query">Query()</link>/<link linkend="api-func-addquery">AddQuery()</link> API calls, copied verbatim to query log)</para></listitem>
|
|
<listitem><para>added keyword extraction support, <link linkend="api-func-buildkeywords">BuildKeywords()</link> API call</para></listitem>
|
|
<listitem><para>added binding field weights by name, <link linkend="api-func-setfieldweights">SetFieldWeights()</link> API call</para></listitem>
|
|
<listitem><para>added optional limit on query time, <link linkend="api-func-setmaxquerytime">SetMaxQueryTime()</link> API call</para></listitem>
|
|
<listitem><para>added optional limit on found matches count (4rd arg to <link linkend="api-func-setlimits">SetLimits()</link> API call, so-called 'cutoff')</para></listitem>
|
|
</itemizedlist>
|
|
<bridgehead>APIs and SphinxSE</bridgehead>
|
|
<itemizedlist>
|
|
<listitem><para>added pure C API (libsphinxclient)</para></listitem>
|
|
<listitem><para>added Ruby API (thanks to Dmytro Shteflyuk)</para></listitem>
|
|
<listitem><para>added Java API</para></listitem>
|
|
<listitem><para>added SphinxSE support for MVAs (use varchar), floats (use float), 64bit docids (use bigint)</para></listitem>
|
|
<listitem><para>added SphinxSE options "floatrange", "geoanchor", "fieldweights", "indexweights", "maxquerytime", "comment", "host" and "port"; and support for "expr:CLAUSE"</para></listitem>
|
|
<listitem><para>improved SphinxSE max query size (using MySQL condition pushdown), upto 256K now</para></listitem>
|
|
</itemizedlist>
|
|
<bridgehead>General</bridgehead>
|
|
<itemizedlist>
|
|
<listitem><para>added scripting (shebang syntax) support to config files (example: #!/usr/bin/php in the first line)</para></listitem>
|
|
<listitem><para>added unified config handling and validation to all programs</para></listitem>
|
|
<listitem><para>added unified documentation </para></listitem>
|
|
<listitem><para>added .spec file for RPM builds</para></listitem>
|
|
<listitem><para>added automated testing suite</para></listitem>
|
|
<listitem><para>improved index locking, now fcntl()-based instead of buggy file-existence-based</para></listitem>
|
|
<listitem><para>fixed unaligned RAM accesses, now works on SPARC and ARM</para></listitem>
|
|
</itemizedlist>
|
|
<bridgehead id="rel098-fixes-since-rc2">Changes and fixes since 0.9.8-rc2</bridgehead>
|
|
<itemizedlist>
|
|
<listitem><para>added pure C API (libsphinxclient)</para></listitem>
|
|
<listitem><para>added Ruby API</para></listitem>
|
|
<listitem><para>added SetConnectTimeout() PHP API call</para></listitem>
|
|
<listitem><para>added allowed type check to UpdateAttributes() handler (bug #174)</para></listitem>
|
|
<listitem><para>added defensive MVA checks on index preload (protection against broken indexes, bug #168)</para></listitem>
|
|
<listitem><para>added sphinx-min.conf sample file</para></listitem>
|
|
<listitem><para>added --without-iconv switch to configure</para></listitem>
|
|
<listitem><para>removed redundant -lz dependency in searchd</para></listitem>
|
|
<listitem><para>removed erroneous "xmlpipe2 deprecated" warning</para></listitem>
|
|
<listitem><para>fixed EINTR handling in piped read (bug #166)</para></listitem>
|
|
<listitem><para>fixup query time before logging and sending to client (bug #153)</para></listitem>
|
|
<listitem><para>fixed attribute updates vs full-scan early-reject index (bug #149)</para></listitem>
|
|
<listitem><para>fixed gcc warnings (bug #160)</para></listitem>
|
|
<listitem><para>fixed mysql connection attempt vs pgsql source type (bug #165)</para></listitem>
|
|
<listitem><para>fixed 32-bit wraparound when preloading over 2 GB files</para></listitem>
|
|
<listitem><para>fixed "out of memory" message vs over 2 GB allocs (bug #116)</para></listitem>
|
|
<listitem><para>fixed unaligned RAM access detection on ARM (where unaligned reads do not crash but produce wrong results)</para></listitem>
|
|
<listitem><para>fixed missing full scan results in some cases</para></listitem>
|
|
<listitem><para>fixed several bugs in --merge, --merge-dst-range</para></listitem>
|
|
<listitem><para>fixed @geodist vs MultiQuery and filters, @expr vs MultiQuery</para></listitem>
|
|
<listitem><para>fixed GetTokenEnd() vs 1-grams (was causing crash in excerpts)</para></listitem>
|
|
<listitem><para>fixed sql_query_range to handle empty strings in addition to NULL strings (Postgres specific)</para></listitem>
|
|
<listitem><para>fixed morphology=none vs infixes</para></listitem>
|
|
<listitem><para>fixed case sensitive attributes names in UpdateAttributes()</para></listitem>
|
|
<listitem><para>fixed ext2 ranking vs. stopwords (now using atompos from query parser)</para></listitem>
|
|
<listitem><para>fixed EscapeString() call</para></listitem>
|
|
<listitem><para>fixed escaped specials (now handled as whitespace if not in charset)</para></listitem>
|
|
<listitem><para>fixed schema minimizer (now handles type/size mismatches)</para></listitem>
|
|
<listitem><para>fixed word stats in extended2; stemmed form is now returned</para></listitem>
|
|
<listitem><para>fixed spelldump case folding vs dictionary-defined character sets</para></listitem>
|
|
<listitem><para>fixed Postgres BOOLEAN handling </para></listitem>
|
|
<listitem><para>fixed enforced "inline" docinfo on empty indexes (normally ok, but index merge was really confused)</para></listitem>
|
|
<listitem><para>fixed rare count(distinct) out-of-bounds issue (it occasionaly caused too high @distinct values)</para></listitem>
|
|
<listitem><para>fixed hangups on documents with id=DOCID_MAX in some cases</para></listitem>
|
|
<listitem><para>fixed rare crash in tokenizer (prefixed synonym vs. input stream eof)</para></listitem>
|
|
<listitem><para>fixed query parser vs "aaa (bbb ccc)|ddd" queries</para></listitem>
|
|
<listitem><para>fixed BuildExcerpts() request in Java API</para></listitem>
|
|
<listitem><para>fixed Postgres specific memory leak</para></listitem>
|
|
<listitem><para>fixed handling of overshort keywords (less than min_word_len)</para></listitem>
|
|
<listitem><para>fixed HTML stripper (now emits space after indexed attributes)</para></listitem>
|
|
<listitem><para>fixed 32-field case in query parser</para></listitem>
|
|
<listitem><para>fixed rare count(distinct) vs. querying multiple local indexes vs. reusable sorter issue</para></listitem>
|
|
<listitem><para>fixed sorting of negative floats in SPH_SORT_EXTENDED mode</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
<sect1 id="rel097"><title>Version 0.9.7, 02 apr 2007</title>
|
|
<itemizedlist>
|
|
<listitem><para>added support for <option>sql_str2ordinal_column</option></para></listitem>
|
|
<listitem><para>added support for upto 5 sort-by attrs (in extended sorting mode)</para></listitem>
|
|
<listitem><para>added support for separate groups sorting clause (in group-by mode)</para></listitem>
|
|
<listitem><para>added support for on-the-fly attribute updates (PRE-ALPHA; will change heavily; use for preliminary testing ONLY)</para></listitem>
|
|
<listitem><para>added support for zero/NULL attributes</para></listitem>
|
|
<listitem><para>added support for 0.9.7 features to SphinxSE</para></listitem>
|
|
<listitem><para>added support for n-grams (alpha, 1-grams only for now)</para></listitem>
|
|
<listitem><para>added support for warnings reported to client</para></listitem>
|
|
<listitem><para>added support for exclude-filters</para></listitem>
|
|
<listitem><para>added support for prefix and infix indexing (see <option>max_prefix_len</option>, <option>max_infix_len</option>)</para></listitem>
|
|
<listitem><para>added <option>@*</option> syntax to reset current field to query language</para></listitem>
|
|
<listitem><para>added removal of duplicate entries in query index order</para></listitem>
|
|
<listitem><para>added PHP API workarounds for PHP signed/unsigned braindamage</para></listitem>
|
|
<listitem><para>added locks to avoid two concurrent indexers working on same index</para></listitem>
|
|
<listitem><para>added check for existing attributes vs. <option>docinfo=none</option> case</para></listitem>
|
|
<listitem><para>improved groupby code a lot (better precision, and upto 25x times faster in extreme cases)</para></listitem>
|
|
<listitem><para>improved error handling and reporting</para></listitem>
|
|
<listitem><para>improved handling of broken indexes (reports error instead of hanging/crashing)</para></listitem>
|
|
<listitem><para>improved <option>mmap()</option> limits for attributes and wordlists (now able to map over 4 GB on x64 and over 2 GB on x32 where possible)</para></listitem>
|
|
<listitem><para>improved <option>malloc()</option> pressure in head daemon (search time should not degrade with time any more)</para></listitem>
|
|
<listitem><para>improved <filename>test.php</filename> command line options</para></listitem>
|
|
<listitem><para>improved error reporting (distributed query, broken index etc issues now reported to client)</para></listitem>
|
|
<listitem><para>changed default network packet size to be 8M, added extra checks</para></listitem>
|
|
<listitem><para>fixed division by zero in BM25 on 1-document collections (in extended matching mode)</para></listitem>
|
|
<listitem><para>fixed <filename>.spl</filename> files getting unlinked</para></listitem>
|
|
<listitem><para>fixed crash in schema compatibility test</para></listitem>
|
|
<listitem><para>fixed UTF-8 Russian stemmer</para></listitem>
|
|
<listitem><para>fixed requested matches count when querying distributed agents</para></listitem>
|
|
<listitem><para>fixed signed vs. unsigned issues everywhere (ranged queries, CLI search output, and obtaining docid)</para></listitem>
|
|
<listitem><para>fixed potential crashes vs. negative query offsets</para></listitem>
|
|
<listitem><para>fixed 0-match docs vs. extended mode vs. stats</para></listitem>
|
|
<listitem><para>fixed group/timestamp filters being ignored if querying from older clients</para></listitem>
|
|
<listitem><para>fixed docs to mention <option>pgsql</option> source type</para></listitem>
|
|
<listitem><para>fixed issues with explicit '&' in extended matching mode</para></listitem>
|
|
<listitem><para>fixed wrong assertion in SBCS encoder</para></listitem>
|
|
<listitem><para>fixed crashes with no-attribute indexes after rotate</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
<sect1 id="rel097rc2"><title>Version 0.9.7-rc2, 15 dec 2006</title>
|
|
<itemizedlist>
|
|
<listitem><para>added support for extended matching mode (query language)</para></listitem>
|
|
<listitem><para>added support for extended sorting mode (sorting clauses)</para></listitem>
|
|
<listitem><para>added support for SBCS excerpts</para></listitem>
|
|
<listitem><para>added <option>mmap()ing</option> for attributes and wordlist (improves search time, speeds up <option>fork()</option> greatly)</para></listitem>
|
|
<listitem><para>fixed attribute name handling to be case insensitive</para></listitem>
|
|
<listitem><para>fixed default compiler options to simplify post-mortem debugging (added <option>-g</option>, removed <option>-fomit-frame-pointer</option>)</para></listitem>
|
|
<listitem><para>fixed rare memory leak</para></listitem>
|
|
<listitem><para>fixed "hello hello" queries in "match phrase" mode</para></listitem>
|
|
<listitem><para>fixed issue with excerpts, texts and overlong queries</para></listitem>
|
|
<listitem><para>fixed logging multiple index name (no longer tokenized)</para></listitem>
|
|
<listitem><para>fixed trailing stopword not flushed from tokenizer</para></listitem>
|
|
<listitem><para>fixed boolean evaluation</para></listitem>
|
|
<listitem><para>fixed pidfile being wrongly <option>unlink()ed</option> on <option>bind()</option> failure</para></listitem>
|
|
<listitem><para>fixed <option>--with-mysql-includes/libs</option> (they conflicted with well-known paths)</para></listitem>
|
|
<listitem><para>fixes for 64-bit platforms</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
<sect1 id="rel097rc"><title>Version 0.9.7-rc1, 26 oct 2006</title>
|
|
<itemizedlist>
|
|
<listitem><para>added alpha index merging code</para></listitem>
|
|
<listitem><para>added an option to decrease <option>max_matches</option> per-query</para></listitem>
|
|
<listitem><para>added an option to specify IP address for searchd to listen on</para></listitem>
|
|
<listitem><para>added support for unlimited amount of configured sources and indexes</para></listitem>
|
|
<listitem><para>added support for group-by queries</para></listitem>
|
|
<listitem><para>added support for /2 range modifier in charset_table</para></listitem>
|
|
<listitem><para>added support for arbitrary amount of document attributes</para></listitem>
|
|
<listitem><para>added logging filter count and index name</para></listitem>
|
|
<listitem><para>added <option>--with-debug</option> option to configure to compile in debug mode</para></listitem>
|
|
<listitem><para>added <option>-DNDEBUG</option> when compiling in default mode</para></listitem>
|
|
<listitem><para>improved search time (added doclist size hints, in-memory wordlist cache, and used VLB coding everywhere)</para></listitem>
|
|
<listitem><para>improved (refactored) SQL driver code (adding new drivers should be very easy now)</para></listitem>
|
|
<listitem><para>improved exceprts generation</para></listitem>
|
|
<listitem><para>fixed issue with empty sources and ranged queries</para></listitem>
|
|
<listitem><para>fixed querying purely remote distributed indexes</para></listitem>
|
|
<listitem><para>fixed suffix length check in English stemmer in some cases</para></listitem>
|
|
<listitem><para>fixed UTF-8 decoder for codes over U+20000 (for CJK)</para></listitem>
|
|
<listitem><para>fixed UTF-8 encoder for 3-byte sequences (for CJK)</para></listitem>
|
|
<listitem><para>fixed overshort (less than <option>min_word_len</option>) words prepended to next field</para></listitem>
|
|
<listitem><para>fixed source connection order (indexer does not connect to all sources at once now)</para></listitem>
|
|
<listitem><para>fixed line numbering in config parser</para></listitem>
|
|
<listitem><para>fixed some issues with index rotation</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
<sect1 id="rel096"><title>Version 0.9.6, 24 jul 2006</title>
|
|
<itemizedlist>
|
|
<listitem><para>added support for empty indexes</para></listitem>
|
|
<listitem><para>added support for multiple sql_query_pre/post/post_index</para></listitem>
|
|
<listitem><para>fixed timestamp ranges filter in "match any" mode</para></listitem>
|
|
<listitem><para>fixed configure issues with --without-mysql and --with-pgsql options</para></listitem>
|
|
<listitem><para>fixed building on Solaris 9</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
<sect1 id="rel096rc1"><title>Version 0.9.6-rc1, 26 jun 2006</title>
|
|
<itemizedlist>
|
|
<listitem><para>added boolean queries support (experimental, beta version)</para></listitem>
|
|
<listitem><para>added simple file-based query cache (experimental, beta version)</para></listitem>
|
|
<listitem><para>added storage engine for MySQL 5.0 and 5.1 (experimental, beta version)</para></listitem>
|
|
<listitem><para>added GNU style <filename>configure</filename> script</para></listitem>
|
|
<listitem><para>added new searchd protocol (all binary, and should be backwards compatible)</para></listitem>
|
|
<listitem><para>added distributed searching support to searchd</para></listitem>
|
|
<listitem><para>added PostgreSQL driver</para></listitem>
|
|
<listitem><para>added excerpts generation</para></listitem>
|
|
<listitem><para>added <option>min_word_len</option> option to index</para></listitem>
|
|
<listitem><para>added <option>max_matches</option> option to searchd, removed hardcoded MAX_MATCHES limit</para></listitem>
|
|
<listitem><para>added initial documentation, and a working <filename>example.sql</filename></para></listitem>
|
|
<listitem><para>added support for multiple sources per index</para></listitem>
|
|
<listitem><para>added soundex support</para></listitem>
|
|
<listitem><para>added group ID ranges support</para></listitem>
|
|
<listitem><para>added <option>--stdin</option> command-line option to search utility</para></listitem>
|
|
<listitem><para>added <option>--noprogress</option> option to indexer</para></listitem>
|
|
<listitem><para>added <option>--index</option> option to search</para></listitem>
|
|
<listitem><para>fixed UTF-8 decoder (3-byte codepoints did not work)</para></listitem>
|
|
<listitem><para>fixed PHP API to handle big result sets faster</para></listitem>
|
|
<listitem><para>fixed config parser to handle empty values properly</para></listitem>
|
|
<listitem><para>fixed redundant <code>time(NULL)</code> calls in time-segments mode</para></listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
|
|
</appendix>
|
|
|
|
</book>
|