unaccent
unaccent is a text search dictionary that removes accents (diacritical marks) from lexemes. It is a filtering dictionary, which means its output is always passed to the next dictionary (if any), unlike the usual behavior of dictionaries. This allows accent-independent processing for full-text search.
The current implementation of unaccent cannot be used as a normalization dictionary for a thesaurus dictionary.
This module is considered "trusted", that is, it can be installed by non-superusers who have CREATE privilege on the current database.
#43.1. Configuration
The unaccent dictionary accepts the following options:
• RULES is the base name of the file containing the list of translation rules. This file must be stored in $SHAREDIR/tsearch_data/. Its name must end with .rules (not included in the RULES parameter).
The rules file has the following format:
• Each line represents a pair consisting of an accented character and an unaccented character. The first character will be translated to the second. For example:
À A
Á A
 A
à A
Ä A
Å A
Æ AE
The two characters must be separated by whitespace, and any leading or trailing whitespace on a line will be ignored.
• Alternatively, if a line contains only a single character, instances of that character will be deleted; this is useful in languages where accents are represented as separate characters.
• In practice, each "character" can be any string that does not contain whitespace, so the unaccent dictionary can also be used for other types of string substitution beyond removing diacritical marks.
• As with other text search configuration files, rules files must be stored in UTF-8 encoding. When loaded, the data is automatically converted to the current database encoding. Any lines containing characters that cannot be translated are silently ignored, so rules files can contain rules that are not applicable to the current encoding.
2. Usage
Installing the unaccent extension creates a text search template called unaccent and a dictionary called unaccent based on the former. The unaccent dictionary has the default parameter setting RULES='unaccent', which causes the dictionary to use the standard unaccent.rules file. If you wish to modify this parameter, you can:
mydb=## ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
Or create new dictionaries based on the template.
To test the dictionary, you can try:
test=## select ts_lexize('unaccent','Hôtel');
ts_lexize
-----------
{Hotel}
(1 row)
Here is an example showing how to insert the unaccent dictionary into a text search configuration:
test=## CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
test=## ALTER TEXT SEARCH CONFIGURATION fr
ALTER MAPPING FOR hword, hword_part, word
WITH unaccent, french_stem;
test=## select to_tsvector('fr','Hôtels de la Mer');
to_tsvector
-------------------
'hotel':1 'mer':4
(1 row)
test=## select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels');
?column?
----------
t
(1 row)
test=## select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));
ts_headline
------------------------
<b>Hôtel</b> de la Mer
(1 row)
3. unaccent
The unaccent() function removes accents (diacritical marks) from a given string. Basically, it is a wrapper around the unaccent dictionary, but it can be used outside of the normal text search context.
unaccent([dictionary regdictionary, ] string text) returns text
If the dictionary parameter is omitted, the text search dictionary named unaccent in the same schema as the unaccent() function is used.
For example:
SELECT unaccent('unaccent', 'Hôtel');
SELECT unaccent('Hôtel');