Skip to main content
Version: 1.0.16

bloom

bloom provides a Bloom filter-based index access method.

A Bloom filter is a space-efficient data structure that is used to test whether an element is a member of a set.

In the context of an index access method, it can quickly exclude non-matching tuples through signatures whose size is determined at index creation time.

The signature is a lossy representation of the indexed attributes, and therefore may produce false positives, meaning it is possible to report that an element is in the set when it is not. Therefore, index search results must be rechecked using the actual attribute values from the heap items. Larger signatures reduce the probability of false positives and decrease the number of unnecessary heap accesses, but they obviously make the index larger and slower to scan.

This type of index is most useful when a table has many attributes and queries may test any combination of them. Traditional btree indexes are faster than bloom indexes, but require many btree indexes to support all possible queries, whereas with bloom indexes only one is needed. However, note that bloom indexes only support equality queries, while btree indexes can also perform inequality and range searches.

1. Parameters

bloom indexes accept the following parameters in their WITH clause:

length The number of bits per signature (index entry), which is rounded up to the nearest multiple of 16. The default is 80 bits, and the maximum is 4096 bits.

col1 — col32 The number of bits generated from each indexed column. Each parameter name indicates the index column number it controls. The default is 2 bits, and the maximum is 4095 bits. Parameters for index columns that are not actually used are ignored.

2. Examples

Here is an example of creating a bloom index:

test=## CREATE INDEX bloomidx ON tbloom USING bloom (i1,i2,i3)

test-## WITH (length=80, col1=2, col2=2, col3=4);

CREATE INDE

This index is created with 80-bit signatures, where attributes i1 and i2 are mapped to 2 bits, and attribute i3 is mapped to 4 bits. The length, col1, and col2 specifications can be omitted since they all have default values.

Here is a more complete example of bloom index definition and usage, also compared with an equivalent btree. The bloom index is smaller and more efficient than a btree index.

test-## (random() * 1000000)::int as i1,

test-## (random() * 1000000)::int as i2,

test-## (random() * 1000000)::int as i3,

test-## (random() * 1000000)::int as i4,

test-## (random() * 1000000)::int as i5,

test-## (random() * 1000000)::int as i6

test-## FROM

test-## generate_series(1,10000000);

SELECT 10000000

A sequential scan on this large table takes a long time:

test=## EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------

Gather (cost=1000.00..127220.72 rows=250 width=24) (actual time=414.282..415.411 rows=0 loops=1)

Workers Planned: 2

Workers Launched: 2

-> Parallel Seq Scan on tbloom (cost=0.00..126195.72 rows=104 width=24) (actual time=410.361..410.361 rows=0 loop

s=3)

Filter: ((i2 = 898732) AND (i5 = 123451))

Rows Removed by Filter: 3333333

Planning Time: 0.176 ms

Execution Time: 415.484 ms

(8 rows)

Even if a btree index is defined, the result will still be a sequential scan:

test=## CREATE INDEX btreeidx ON tbloom (i1, i2, i3, i4, i5, i6);

CREATE INDEX

test=## SELECT pg_size_pretty(pg_relation_size('btreeidx'));

pg_size_pretty

----------------

386 MB

(1 row)


test=## EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------

Gather (cost=1000.00..127195.10 rows=1 width=24) (actual time=220.290..221.186 rows=0 loops=1)

Workers Planned: 2

Workers Launched: 2

-> Parallel Seq Scan on tbloom (cost=0.00..126195.00 rows=1 width=24) (actual time=216.318..216.319 rows=0 loops=

3)

Filter: ((i2 = 898732) AND (i5 = 123451))

Rows Removed by Filter: 3333333

Planning Time: 0.165 ms

Execution Time: 221.241 ms

(8 rows)

For handling this type of search, bloom is better than btree:

test=## CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6);

CREATE INDEX

test=## SELECT pg_size_pretty(pg_relation_size('bloomidx'));

pg_size_pretty

----------------

153 MB

(1 row)

Now, the main problem with btree search is: when the search conditions do not constrain the leading index columns, btree is inefficient.

A better strategy for btree is to create a separate index on each column. Then the planner will choose a plan like this:

test=## CREATE INDEX btreeidx1 ON tbloom (i1);
CREATE INDEX

test=## CREATE INDEX btreeidx2 ON tbloom (i2);
CREATE INDEX

test=## CREATE INDEX btreeidx3 ON tbloom (i3);
CREATE INDEX

test=## CREATE INDEX btreeidx4 ON tbloom (i4);
CREATE INDEX

test=## CREATE INDEX btreeidx5 ON tbloom (i5);
CREATE INDEX

test=## CREATE INDEX btreeidx6 ON tbloom (i6);
CREATE INDEX

test=## EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------

Bitmap Heap Scan on tbloom (cost=9.28..13.29 rows=1 width=24) (actual time=0.659..0.664 rows=0 loops=1)

Recheck Cond: ((i5 = 123451) AND (i2 = 898732))

-> BitmapAnd (cost=9.28..9.28 rows=1 width=0) (actual time=0.657..0.660 rows=0 loops=1)

-> Bitmap Index Scan on btreeidx5 (cost=0.00..4.51 rows=10 width=0) (actual time=0.627..0.627 rows=5 loops=

1)

Index Cond: (i5 = 123451)

-> Bitmap Index Scan on btreeidx2 (cost=0.00..4.52 rows=11 width=0) (actual time=0.025..0.025 rows=10 loops

=1)

Index Cond: (i2 = 898732)

Planning Time: 0.416 ms

Execution Time: 3.074 ms

(9 rows)

Although this query runs faster than with any single index, we pay a price in index size.

Each single-column btree index occupies 2 MB, so the total space exceeds 12 MB, which is 8 times the space used by the bloom index.

3 Limitations

  • The module only includes operator classes for int4 and text.

  • Search only supports the = operator. However, future support could be added for arrays with union and intersection operations.

  • The bloom access method does not support UNIQUE indexes.

  • The bloom access method does not support searching for NULL values.