67

I'm preparing to deploy a Rails app on Heroku that requires full text search. Up to now I've been running it on a VPS using MySQL with Sphinx.

However, if I want to use Sphinx or Solr on Heroku, I'd need to pay for an add-on.

I notice that PostgreSQL (the DB used on Heroku) has built-in full text search capability.

Is there a reason I couldn't use Postgres's full-text search? Is it slower than Sphinx or is there some other major limitation?

Ethan
  • 52,651
  • 60
  • 180
  • 231

6 Answers6

75

Edit, 2016 — Why not both?

If you're interested in Postgres vs. Lucene, why not both? Check out the ZomboDB extension for Postgres, which integrates Elasticsearch as a first-class index type. Still a fairly early project but it looks really promising to me.

(Technically not available on Heroku, but still worth looking at.)


Disclosure: I'm a cofounder of the Websolr and Bonsai Heroku add-ons, so my perspective is a bit biased toward Lucene.

My read on Postgres full-text search is that it is pretty solid for straightforward use cases, but there are a number of reasons why Lucene (and thus Solr and ElasticSearch) is superior both in terms of performance and functionality.

For starters, jpountz provides a truly excellent technical answer to the question, Why is Solr so much faster than Postgres? It's worth a couple of reads through to really digest.

I also commented on a recent RailsCast episode comparing relative advantages and disadvantages of Postgres full-text search versus Solr. Let me recap that here:

Pragmatic advantages to Postgres

  • Reuse an existing service that you're already running instead of setting up and maintaining (or paying for) something else.
  • Far superior to the fantastically slow SQL LIKE operator.
  • Less hassle keeping data in sync since it's all in the same database — no application-level integration with some external data service API.

Advantages to Solr (or ElasticSearch)

Off the top of my head, in no particular order…

  • Scale your indexing and search load separately from your regular database load.
  • More flexible term analysis for things like accent normalizing, linguistic stemming, N-grams, markup removal… Other cool features like spellcheck, "rich content" (e.g., PDF and Word) extraction…
  • Solr/Lucene can do everything on the Postgres full-text search TODO list just fine.
  • Much better and faster term relevancy ranking, efficiently customizable at search time.
  • Probably faster search performance for common terms or complicated queries.
  • Probably more efficient indexing performance than Postgres.
  • Better tolerance for change in your data model by decoupling indexing from your primary data store

Clearly I think a dedicated search engine based on Lucene is the better option here. Basically, you can think of Lucene as the de facto open source repository of search expertise.

But if your only other option is the LIKE operator, then Postgres full-text search is a definite win.

Community
  • 1
  • 1
Nick Zadrozny
  • 7,676
  • 30
  • 37
  • 6
    In Postgres 9.x you can speed up LIKE searches using a tri-gram index – a_horse_with_no_name Jun 04 '12 at 07:08
  • 8
    Thanks, O nameless equine, that's interesting. Looks like pg_trgm with `LIKE` is a not-unreasonable quick and dirty search. – Nick Zadrozny Jun 04 '12 at 14:35
  • 1
    What's the difference between Websolr and Bonsai? – Ethan Jun 05 '12 at 01:00
  • 30
    Under the advantages to PostgreSQL you are leaving out the most critical one by far. Your search index never goes out of sync because you can update it with a trigger. This is a MASSIVE advantage over every separate search solution where each update to your data requires syncing those changes with the standalone search engine. With a trigger, the index syncing is totally automatic whenever the data changes even if it changes from another code base outside of your rails app (Go, Node, Java, another Rails app). HUUUUUGE pragmatic win. – brightball Aug 21 '14 at 17:59
  • More often than not you are going to want to stick with Postgres if already familiar with DB. Don't think too hard. Unless you enjoy massive maintenance nightmares. –  Jan 23 '16 at 16:59
  • 1
    TheOldHag + aramisbear: just edited to add a note about ZomboDB, which integrates the utility of ES while letting Postgres handle integration/synchronization. Looks pretty neat. – Nick Zadrozny Jan 28 '16 at 21:57
  • I'm making my little contribution: don't know how it was made before, but now, you can separate index and search load in different roles with multiple nodes in a ElasticSearch cluster. It's also very easy to maintain and the little REST API is a plus if you want to expose your data to multiple remote services. This is usually these kind of criterias I look for, to take a decision – Alex Apr 04 '17 at 10:05
  • @Alex I'd be interested to learn more about what you're referring to about separating indexing and search. Some sharding topologies can help with that by helping to split a write vs. read indices. Otherwise my understanding is that indexing happens on all replicas of a shard in parallel, more or less. – Nick Zadrozny Apr 13 '17 at 16:42
22

Since I just went through the effort of comparing elastic search (1.9) against postgres FTS, I figured I should share my results since they're somewhat more current than the ones @gustavodiazjaimes cites.

My main concern with postgres was that it did not have faceting built in, but that's trivial to build yourself, here's my example (in django):

results = YourModel.objects.filter(vector_search=query)
facets = (results
    .values('book')
    .annotate(total=Count('book'))
    .order_by('book'))

I'm using postgres 9.6 and elastic-search 1.9 (through haystack on django). Here's a comparison between elasticsearch and postgres across 16 various types of queries.

    es_times  pg_times  es_times_faceted  pg_times_faceted
0   0.065972  0.000543          0.015538          0.037876
1   0.000292  0.000233          0.005865          0.007130
2   0.000257  0.000229          0.005203          0.002168
3   0.000247  0.000161          0.003052          0.001299
4   0.000276  0.000150          0.002647          0.001167
5   0.000245  0.000151          0.005098          0.001512
6   0.000251  0.000155          0.005317          0.002550
7   0.000331  0.000163          0.005635          0.002202
8   0.000268  0.000168          0.006469          0.002408
9   0.000290  0.000236          0.006167          0.002398
10  0.000364  0.000224          0.005755          0.001846
11  0.000264  0.000182          0.005153          0.001667
12  0.000287  0.000153          0.010218          0.001769
13  0.000264  0.000231          0.005309          0.001586
14  0.000257  0.000195          0.004813          0.001562
15  0.000248  0.000174          0.032146          0.002246
                  count      mean       std       min       25%       50%       75%       max
es_times           16.0  0.004382  0.016424  0.000245  0.000255  0.000266  0.000291  0.065972
pg_times           16.0  0.000209  0.000095  0.000150  0.000160  0.000178  0.000229  0.000543
es_times_faceted   16.0  0.007774  0.007150  0.002647  0.005139  0.005476  0.006242  0.032146
pg_times_faceted   16.0  0.004462  0.009015  0.001167  0.001580  0.002007  0.002400  0.037876

In order to get postgres to these speeds for faceted searches I had to use an GIN index on the field with a SearchVectorField, which is django specific but I'm sure other frameworks have a similar vector type.

One other consideration is that pg 9.6 now supports phrase matching, which is huge.

My take away is that postgres is for most cases going to be preferrable as it offers:

  1. simpler stack
  2. no search backend api wrapper dependencies to contend with (thinking-sphinx, django-sphinx, haystack etc.). These can be a drag since they might not support the features your search back-end does (e.g. haystack faceting/aggregates).
  3. has similar performance and features (for my needs)
yekta
  • 3,133
  • 2
  • 32
  • 48
  • For those of us who don't use Django, could you update your answer to include the actual Postgres queries that were performed to get the facets? Did you manage to run your main query and the facet query in one fetch, or did you have to do separate queries? – ccleve Mar 17 '19 at 04:22
  • if I'm reading your comparison right, on equivalent hardware pg was 20x faster than es? – davidtgq Apr 06 '19 at 13:31
16

I found this amazing comparison and want to share it:

Full Text Search In PostgreSQL

Time to Build Index LIKE predicate -- none
PostgreSQL / GIN -- 40 min
Sphinx Search -- 6 min
Apache Lucene -- 9 min
Inverted index -- high

Index Storage LIKE predicate -- none
PostgreSQL / GIN -- 532 MB
Sphinx Search -- 533 MB
Apache Lucene -- 1071 MB
Inverted index -- 101 MB

Query Speed LIKE predicate -- 90+ seconds
PostgreSQL / GIN -- 20 ms
Sphinx Search -- 8 ms
Apache Lucene -- 80 ms
Inverted index -- 40 ms

bendiy
  • 517
  • 1
  • 7
  • 14
gustavodiazjaimes
  • 2,163
  • 1
  • 17
  • 12
3

Postgres's full text search has amazing capabilities in the areas of stemming, ranking/boosting, synonym handling, fuzzy searches among others - but no support for faceted search.

So, if Postgres is already in your stack and you don't need faceting, better try it out to avail the HUGE benefit of ease of keeping indices in sync and maintaining sleek stack, before looking out for Lucene based solutions - at least if all your app is not based on search.

Devi
  • 3,919
  • 1
  • 27
  • 26
  • 1
    Faceting can be achieved on your own though, you *probably* don't need it as a "feature" from your FTS, I addressed this in my answer above with an example. – yekta Jul 12 '17 at 07:16
2

Some fresher results for synthetic customer data (10 million records).

enter image description here

Eugene Lycenok
  • 327
  • 2
  • 10
1

Postgresql's FTS function is mature and fairly fast at lookups. It's worth a look for sure.

Scott Marlowe
  • 7,036
  • 3
  • 20
  • 20