6

I have a simple count query that can use Index Only Scan, but it still take so long in PostgresQL!

I have a cars table with 2 columns type bigint and active boolean, I also have a multi-column index on those columns

CREATE TABLE cars
(
id BIGSERIAL NOT NULL
    CONSTRAINT cars_pkey PRIMARY KEY ,
type BIGINT NOT NULL ,
name VARCHAR(500) NOT NULL ,
active            BOOLEAN DEFAULT TRUE NOT NULL,
created_at        TIMESTAMP(0) WITH TIME ZONE default NOW(),
updated_at        TIMESTAMP(0) WITH TIME ZONE default NOW(),
deleted_at        TIMESTAMP(0) WITH TIME ZONE
);
CREATE INDEX cars_type_active_index ON cars(type, active);

I inserted some test data with 950k records, type=1 have 600k records

INSERT INTO cars (type, name) (SELECT 1, 'car-name' FROM generate_series(1,600000));
INSERT INTO cars (type, name) (SELECT 2, 'car-name' FROM generate_series(1,200000));
INSERT INTO cars (type, name) (SELECT 3, 'car-name' FROM generate_series(1,100000));
INSERT INTO cars (type, name) (SELECT 4, 'car-name' FROM generate_series(1,50000));

Let 's run VACUUM ANALYZE and force PostgresQL to use Index Only Scan

VACUUM ANALYSE;
SET enable_seqscan = OFF;
SET enable_bitmapscan = OFF;

OK, I have a simple query on type and active

EXPLAIN (VERBOSE, BUFFERS, ANALYSE) 
SELECT count(*) 
FROM cars 
WHERE type = 1 AND active = true;

Result:

Aggregate  (cost=24805.70..24805.71 rows=1 width=0) (actual time=4460.915..4460.918 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=2806
->  Index Only Scan using cars_type_active_index on public.cars (cost=0.42..23304.23 rows=600590 width=0) (actual time=0.051..2257.832 rows=600000 loops=1)
        Output: type, active
        Index Cond: ((cars.type = 1) AND (cars.active = true))
        Filter: cars.active
        Heap Fetches: 0
        Buffers: shared hit=2806
Planning time: 0.213 ms
Execution time: 4461.002 ms
(11 rows)

Look at the query explain result,

  • It used Index Only Scan, with index only scan, depending on visibilities map, PostgresQL sometime need to fetch Table Heap to check for visibility of the tuple, But I already run VACUUM ANALYZE so you can see Heap fetch = 0, so reading the index is enough for answer this query.

  • The size of the index is quite small, it can all fit on the Buffer cache (Buffers: shared hit=2806), PostgresQL does not need to fetch pages from disk.

From there, I can't understand why PostgresQL take that long (4.5s) to answer the query, 1M records is not a big number of records, everything is already cached on memory, and the data on index is visible, it does not need to fetch Heap.

PostgreSQL 9.5.10 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4enter code here.9.2-10) 4.9.2, 64-bit

I tested it on docker 17.09.1-ce, Macbook pro 2015.

I am still new to PostgresQL and trying to map my knowledge with the real cases. Thanks so much,

a_horse_with_no_name
  • 440,273
  • 77
  • 685
  • 758
Rey
  • 83
  • 9
  • A partial index might be more efficient for this query: `CREATE INDEX cars_type_active_index ON cars(type) where active;` – a_horse_with_no_name May 02 '18 at 11:37
  • 2
    yes, create a partial index can help a little bit but not much, because almost 99% of the table is active. But I am more curious about why it Index Only Scan is slow like that, when everything is already cache on memory and data is all new. – Rey May 02 '18 at 13:20
  • Here is the explain result with partial index: ```Aggregate (cost=24099.88..24099.89 rows=1 width=0) (actual time=4565.551..4565.554 rows=1 loops=1) Output: count(*) Buffers: shared hit=7250 -> Index Scan using cars_type_on_active_index on public.cars (cost=0.42..22606.33 rows=597423 width=0) (actual time=0.025..2324.765 rows=600000 loops=1) Output: id, type, name, active, created_at, updated_at, deleted_at Index Cond: (cars.type = 1) Buffers: shared hit=7250 Planning time: 0.095 ms Execution time: 4565.599 ms``` – Rey May 03 '18 at 02:54

1 Answers1

0

It seems like I found the reason, it not about PostgresQL problems, it 's because of running in docker. When I run directly in my mac, the time will be around 100ms which is fast enough.

Another thing I figured out is the reason why PostgresQL still use seq scan instead of index only scan (that why I have to disable seq_scan and bitmapscan in my test):

  • The size of table is not so big compare to the size of the index, if I add more columns to the table or length of columns is longer, the bigger size of the table, the more chance index can be use.
  • random_page_cost value by default is 4, my disk is quite fast so I can set it to 1-2, it will help the psql's explainer estimate cost more correctly.
Rey
  • 83
  • 9