What's new in the Postgres 16 query planner

londons_explore · on Feb 9, 2024

I really wish the postgres query planner would gain the ability to replan a query mid way through execution...

Frequently the most pathological queries (ie. the dreadfully slow ones) are because the query planner didn't have some knowledge required of the data distribution and couldn't accurately estimate the cost of some approach to planning the query. This can easily have a 1000x impact on execution time (ie. 1s rather than 1ms).

You will never have 100% accurate table stats - there is always some odd joint distribution you will not capture.

So instead, allow the query to start, and if progress isn't as fast as the planner expects, feed current progress info back to the planner (pages scanned, tuples matching), and replan with that new data. If the updated plan shows it's quicker to discard current results and restart with a new approach, do that.

Unfortunately, postgres does streaming queries (ie. the first results of a query are sent back to the client before the query is done), which means that significant infrastructural changes would be needed to allow a midway change of plan. Any new plan would need to keep track of which results had already been sent to the client so they aren't resent. Postgres also allows a client to, midway through a query, request that the query reverse direction and re-return previous results in reverse order. That adds a lot of complexity.

davidrowley · on Feb 9, 2024

(blog author and Postgres committer here) I personally think this would be nice to have. However, I think the part about sending of tuples to the client is even more tricky than you've implied above. It's worse because a new plan does not even guarantee that the same tuples are returned. e.g. if you wrote: SELECT * FROM table LIMIT 10, there's no ORDER BY so which tuples are returned is non-deterministic. It may be easier to do something like queue up X tuples and just start sending those out when the queue is full. When the queue is full, we can say that it's too late to replan and we're locked into the current plan. People can set X to what they want to increase the time that the plan can change at the expense of using more memory and higher latency to get the first tuple.

fuy · on Feb 9, 2024

Or it could work in a way that the Planner has access to data about previous runs of each query, and it can use this data to change plans that were proven bad during execution. This way, the first execution would be slow, but Planner could self-learn and better next time. SQL Server has a bunch of similar features in its query optimizer https://learn.microsoft.com/en-us/sql/relational-databases/p....

I'm not sure Postgres has infrastructure to do that, though, because it doesn't have shared plan cache, for example.

londons_explore · on Feb 9, 2024

Also, many queries might be so slow they never complete, and therefore never populate the cache. (think those queries run by a data scientist with 50 JOIN's)

hyperpape · on Feb 9, 2024

I'm sure there are reasons the implementation might not be easy, but conceptually this seems fixable. You just need a lower bound of "how bad can this plan be?" and that doesn't require completing the query, just observing that it's been running for a long time and/or used unreasonable amounts of resources.

Also, is the problem with the 50 join query that the planner screws it up, or that it's fundamentally doing too much?

davidrowley · on Feb 9, 2024

You'd still need analyze to gather table statistics to have the planner produce plans prior to getting any feedback from the executor. So, before getting feedback, the quality of the plans needn't be worse than they are today.

indymike · on Feb 9, 2024

> 50 JOINS

And no indexes.

londons_explore · on Feb 9, 2024

For many queries, even setting X=1 would probably have big benefits. If it takes far longer than expected to find the first result, it's probably time for a new plan.

Implementing only the X=1 case would also dramatically simplify the design of such a feature. Limiting it to read only queries would also make everything much simpler.

davidrowley · on Feb 10, 2024

I agree. The primary area where bad estimates bite us is estimating some path will return 1 row. When we join that Nested Loop looks like a great option. What could be faster to join to 1 row?! It just does not go well when 1 row turns into more than 1. We'd realise that the 1-row estimate we wrong by the time we got to row 2, so queuing 1 row would likely cover the majority of cases.

sgift · on Feb 9, 2024

I have no knowledge how common queries with ORDER BY vs no ORDER BY are, but that sounds like a first implementation which only works if ORDER BY is present would be easier and still useful? Or do you think that's not common enough to justify the effort?

davidrowley · on Feb 9, 2024

You have to remember that because the query has an ORDER BY, it does not mean the rows come out in a deterministic order. There'd need to be at least an ORDER BY column that provably contains unique values. Of course, you could check for that, but then I don't think that's the end of the complexity. Things like SKIP LOCKED skip over rows which we can't immediately lock. If the first time we couldn't lock the lowest order row and output the 2nd, then aborted, replanned, then next time the 1st row wasn't locked, we'd then output the rows in the wrong order. It's probably possible to figure out all these cases and not do it when there's some hazard, but it sounds very tricky and bug-prone to me.

rst · on Feb 9, 2024

However, at least some ORDER BY queries will have a final sorting step before the client sees the first row. (If you're sorting on a column with an index, it may be able to produce the data in sorted order; if not, though, the final sort seems unavoidable.)

Which means that the capacity to buffer up data until it's all produced is present. But it might still be awkward to make it the default.

davidrowley · on Feb 10, 2024

I think the first step to making improvements in this area is to have the planner err on the side of caution more often. Today it's quite happy to join using a Nested Loop when it thinks the outer side of the join contains a single row. We have various means in the planner on how certain we might be that the 1 row thing will hold true during execution. An equality condition on a column with a unique index, is, for example, a way we could be certain of getting <= 1 row. If for example, the selectivity estimate concludes 1 row will match for some WHERE clause containing several columns with independent statistics, then the certainty level goes down. It seems silly not to swap the join order and switch to a hash join for this. Best case, we have to build a hash table to store 1 row. Probing that won't be very expensive and could even be optimised further to skip hashing if we continually probe the same bucket for N probes. The cost of additional rows over the estimated 1 row scales much more linearly than the quadratic scaling we'd have gotten with Nested Loop. I imagine a setting which controls how much risk the planner is willing to take would allow users to maintain the status quo of the current costing model. I'd imagine not many would want that, however.

mgsouth · on Feb 9, 2024

SELECT x, random() FROM foo ORDER BY x;

SELECT foo.x, bar.y FROM foo WHERE bar.z > random() ORDER BY foo.x;

SELECT x FROM foo ORDER BY x; -- Meanwhile, concurrent updates are happening to foo, so re-running the query gets a different result set.

"We choose to do this thing and the others, not because they are easy, but because we ask 'how hard can it be?'"

nextaccountic · on Feb 9, 2024

> It may be easier to do something like queue up X tuples and just start sending those out when the queue is full. When the queue is full, we can say that it's too late to replan and we're locked into the current plan. People can set X to what they want to increase the time that the plan can change at the expense of using more memory and higher latency to get the first tuple.

Maybe warrant a new SQL syntax, like

    select * from table limit 10 queue X

sega_sai · on Feb 9, 2024

I think another way to think about it is to allow 'long planning' queries. I.e. where it is allowed to spend a second, or maybe a few seconds choosing the best plan. That may involve collecting more statistics or running a query for a little bit.

davidrowley · on Feb 9, 2024

I've considered things like this before but not had time to take it much beyond that. The idea was that the planner could run with all expensive optimisations disabled on first pass, then re-run if the estimated total cost of the plan was above some threshold with more expensive optimisations enabled. It does seem pretty silly to worry about producing a plan in a millisecond for say, an OLAP query that's going to take 6 hours to complete. On the other hand, we don't want to slow down the planner too much for a query that executes in 0.1 milliseconds.

There'd be a few hurdles to get over before we could get such a feature. The planner currently has a habit of making changes to the parsed query, so we'd either need to not do that, or make a copy of it before modifying it. The former would be best.

TylerE · on Feb 9, 2024

I don't think it's sop muhch a matter of "expensive" as "unknowable".

For instance, if you have a left join (and let's say it can't use an index for whatever reason) - the optimal plan will probably be different if almost every row has a matching row(s) in the join table than if only a few do.

davidrowley · on Feb 10, 2024

I think the join search would remain at the same level of exhaustiveness for all levels of optimisation. I imagined we'd maybe want to disable optimisations that apply more rarely or are most expensive to discover when the planner "optimisation level" was set to lower settings. I suppose that would be things like LEFT JOIN removals and self-join removals. However, PostgreSQL does not have very many expensive optimisations that rarely help, so having an optimisation level might be more of a way of introducing more new optimisations that help fewer queries. Because we don't have a plan cache, there's been a focus on keeping the planner lean and having it not go to too much effort to optimise queries that are poorly written, for example.

mkl · on Feb 9, 2024

> Postgres also allows a client to, midway through a query, request that the query reverse direction and re-return previous results in reverse order.

What is that useful for?

jasonjayr · on Feb 9, 2024

Paginating with an open cursor, perhaps?

londons_explore · on Feb 9, 2024

Exactly. Web services of the 90's would let a user run a query which returns say 1 million rows, but view them in pages of perhaps 10 rows with a next page/previous page link. The query runs only once, and the million row result set is never created in full - a bit more of it is generated each time the user clicks next page. And if the user clicks previous page, the query needs to be reversible to regenerate previous results.

Very few people use that functionality today to my knowledge. Keeping a query 'open' uses hundreds of megabytes of RAM on the server, so most applications want the query done and ended right away.

codeflo · on Feb 9, 2024

If the sort order isn't fully determined by a query, can the query plan influence the result order? If so, what you're suggesting might be nearly impossible. The new query wouldn't be able to just skip the first N results, it would have to match each individual row against a dictionary of previously sent ones.

davidrowley · on Feb 10, 2024

> If the sort order isn't fully determined by a query, can the query plan influence the result order?

Yes. When running a query, PostgreSQL won't make any effort to provide a stable order of rows beyond what's specified in the ORDER BY.

Sesse__ · on Feb 9, 2024

You may be interested in this paper (and the papers it references): https://arxiv.org/pdf/1902.08291

Someone · on Feb 9, 2024

> and if progress isn't as fast as the planner expects, feed current progress info back to the planner (pages scanned, tuples matching), and replan with that new data.

That would require keeping track of those stats in every query execution. That has a price that may or may not be worth it.

And yes, you could make that behavior an option, but, for better or for worse, PostgreSQL tends to be opposed to having queries indicate how it should do its work.

mike_hock · on Feb 9, 2024

Alternatively, associate some confidence value with the statistics and make conservative choices when confidence is low and/or the expected difference is small.

Sometimes a sequential scan is faster than index lookups for each item. But the sequential scan is a risky gamble, whereas the index lookups have robust performance characteristics. It's not always clear which choice is the conservative one, but often it is.

davidrowley · on Feb 10, 2024

Yeah, this is similar to some semi-baked ideas I was talking about in https://www.postgresql.org/message-id/CAApHDvo2sMPF9m=i+YPPU...

I think it should always be clear which open would scale better for additional rows over what the estimated row count is. We should always know this because we already cost for N rows, so it's possible to cost for N+1 rows and use the additional costs to calculate how the plan choice will scale when faced with more rows than expected.

darksaints · on Feb 9, 2024

What I think could potentially be done is allow threshold-based alternate plans. For a pseudo example, “if subquery A returns 8 records or fewer, use this plan for subquery B, else use that plan.” It’s an explicit admission that the query planner doesn’t have enough information to make a good decision up front, but can easily be made at a later point in time while in the middle of execution.

davidrowley · on Feb 10, 2024

Yes, I think so too. There is some element of this idea in the current version of PostgreSQL. However, it does not go as far as deferring the decision until execution. It's for choosing the cheapest version of a subplan once the plan has been generated for the next query level up. See fix_alternative_subplan() in setrefs.c. Likely it would be possible to expand that and have the finish plan contain the alternative and switch between them accordingly to which one is cheaper for the number of rows that previous executions have seen. Maybe tagging on some additional details about how many rows is the crossover point where one becomes cheaper than the other so that the executor can choose without having to think too hard about it would be a good idea.

wiradikusuma · on Feb 9, 2024

I use this tool to visualize my queries: https://explain.dalibo.com/ (there's also https://www.pgexplain.dev/, last time the output was less nice, but now both look the same)

klysm · on Feb 9, 2024

The tool is great and I use it, but I don't really have a deep enough understanding to know how to fix issues in my approach from what looks bad in the plan.

davidrowley · on Feb 9, 2024

It's pretty hard to tell if a plan is good or bad from EXPLAIN without using the ANALYZE option. With EXPLAIN ANALYZE you can see where the time is being spent, so can you get an idea of which part of the plan you should focus on.

To know if it's a bad plan, it does take quite a bit of knowledge as you need to know what alternative plans could have been used instead. It takes quite a bit of time to learn that stuff. You need to know what PostgreSQL is capable of. Some computer science knowledge helps here as you'll know, for example, when a hash join is a good way to join a table vs a nested loop.

As for fixing plan you've identified as bad, that also takes quite a bit of experience. If you understand the EXPLAIN ANALYZE output well, that's a good start. Looking for places where the estimated rows differ from the actual can be key. Having an understanding of how Postgres performed the row estimations helps. That's not something that comes easily without looking at the source code, unfortunately. Understanding tools that you have to change the plan is useful. Perhaps that's CREATE STATISTICS, or adjusting the stats targets on existing single column stats. Or maybe creating a new index. Having a test environment that allows you to experiment is very useful too.

cnity · on Feb 9, 2024

To be honest, you can get quite a lot of mileage out of EXPLAIN by focusing on which indices the plan selected (or did not select). For this reason it's really worth understanding your indices and what value they add (or remove!).

dewey · on Feb 9, 2024

There's also https://www.pgmustard.com, which gives you a bit more hints and information on the possible optimizations.

uudecoded · on Feb 9, 2024

I read your profile and see that you are a CTO of a fintech. Given that, by what method do you navigate that tool's [explain.dalibo.com] assertion of "It is recommended not to send any critical or sensitive information"?

Is there an explain plan sanitizer that is helpful for this situation?

Defman · on Feb 9, 2024

You can download the whole visualizer as a simple html file and use it this way. No need to obfuscate or sanitize anything at all.

https://github.com/dalibo/pev2

derhuerst · on Feb 9, 2024

I do the same, but because it became tedious to insert the query & query plan manually, I wrote pev2-cli [1].

[1] https://github.com/derhuerst/pev2-cli

uudecoded · on Feb 9, 2024

Thank you for sharing - I misunderstood - this is a javascript visualizer, and has no additional analytic capability beyond visualization, or did I miss something?

Defman · on Feb 9, 2024

Yeah, you’re right. This is the visualizer that’s linked in the parent comment. I just wanted to point out that you can use it locally without sending your explain plan to anyone. Sorry for the confusion, I’ll edit my original comment.

williamdclt · on Feb 9, 2024

Whatever the domain, a query isn't necessarily critical or sensitive. It only is if it contains personal information (eg querying by a bank account number or a name), or if the query itself is part of your competitive advantage (unlikely)

fabian2k · on Feb 9, 2024

Query planner improvements are always welcome, it's a very important part of the DB. Though of course most of the time you notice it is when it's not doing what you want ;-).

One part of this I found rather frustrating is the JIT in newer Postgres versions. The heuristics on when to use appear not robust at all to me. I've seen this for a rather typical ORM-generated query that is pretty straightforward, but pulls in a lot of other tables via joins. It runs in a few milliseconds without the JIT, but the JIT spent 1-1.5 seconds doing its thing on top of that and makes it incredibly slow for tiny amounts of data.

I know now to just disable the JIT, but this feature can give a pretty terrible impression to users that don't know enough yet to figure out why it's slow. I like Postgres a lot, but enabling the JIT just seems far too dangerous as a default setting to me.

davidrowley · on Feb 9, 2024

> One part of this I found rather frustrating is the JIT in newer Postgres versions. The heuristics on when to use appear not robust at all to me.

(Author of the blog here and Postgres committer). I very much agree that the code to decide if JIT should be used or not needs work. For PG16, it only takes into account the estimated total cost of the plan and does not take into account how many expressions need to be compiled. It's quite fast to compile a few expressions, but if you're querying a partitioned table with hundreds of partitions and the plan contains all those partitions, then the JIT compiler has a lot of work to do. The number of expressions is not considered. A colleague and I do have some code to improve this. Unsure if it'll make PG17 at this stage.

dur-randir · on Feb 9, 2024

If it's not ready for everyone, probably it shouldn't have been made a default, don't you think?

davidrowley · on Feb 9, 2024

Perhaps, but it might be harsh to say it was the wrong decision when it was made as partitioned tables are far more optimised than when JIT was first worked on. It seems to me, most of the people that have issues with slow JIT times are having these issues with partitioned tables and JIT is slow due to having to compile large numbers of expressions. However, maybe this is the place for me to find out that's not always the case. The JIT costing is likely to get an overhaul soon, and if all goes to plan there JIT will be considered per plan node rather than per plan.

Sesse__ · on Feb 9, 2024

FWIW, I've seen planning+query times triple with JIT on, with no partitioned tables involved. It just takes forever to JIT sometimes. (This was with Postgres 13 and 14, IIRC.)

Update: I checked some old IRC logs, and found a query that took 1476 ms without JIT and 8754 ms with JIT on. And that is execution time, not planning time!

davidrowley · on Feb 10, 2024

Was there an indication of the number of functions compiled? There is work ongoing in this area, so feedback on this topic is very welcome on the PostgreSQL mailing lists.

Sesse__ · on Feb 10, 2024

Not that I can remember, no. I don't have the details anymore; I delegated it to others to report to -perform, and then they never did. :-)

fuy · on Feb 9, 2024

One other thing about JIT that I feel is pretty crazy is that the generated code is not cached. I mean it's the most expensive part of the query execution a lot of the time, how come it's not cached? I couldn't find good reasons for this looking through Postgres mailing lists discussion around JIT.

Disabling JIT is the way to go for OLTP workloads.

davidrowley · on Feb 9, 2024

There's some information about why that does not happen in https://www.postgresql.org/message-id/20211104234742.ao2qzqf...

In particular:

> The immediate goal is to be able to generate JITed code/LLVM-IR that doesn't > contain any absolute pointer values. If the generated code doesn't change > regardless of any of the other contents of ExprEvalStep, we can still cache > the JIT optimization / code emission steps - which are the expensive bits.

A colleague is working on getting this patch into shape. So we might see some caching work get done after the relative pointer work is in.

SigmundA · on Feb 9, 2024

Unlike say MSSQL or Oracle PG does not cache plans at all. I think this is mostly due to its multiprocess architecture vs just sharing in memory plans between threads. In MSSQL a plan can take a while to optimize including jitting if needed but it doesn't matter that much because all plans are cached so when that statement comes in again the plan is ready to go.

nextaccountic · on Feb 9, 2024

> I think this is mostly due to its multiprocess architecture vs just sharing in memory plans between threads

You can share stuff with a multiprocess architecture just fine (either through IPC or just plain shared memory + synchronization)

It's true that threads share memory by default, but processes can opt into sharing memory if they wish. And it appears that Postgres already makes use of shared memory for some things

https://www.instaclustr.com/blog/postgresql-docker-and-share...

https://stackoverflow.com/questions/32930787/understanding-p...

(random links from Google just to illustrate the point)

SigmundA · on Feb 9, 2024

PG shares data between process not code as far as I know and definetly not any plans. Sharing jitted code is not straight forward as pointers will be different per process.

fakwandi_priv · on Feb 9, 2024

https://www.postgresql.org/docs/current/plpgsql-implementati...

Seems like these will get cached at some point.

SigmundA · on Feb 9, 2024

Only within a session unless something has changed there, that means no sharing between clients or even from one connection to the next form the same client.

MSSQL caches plans globally and can be used across sessions and connections if the statement text is the same.

Sesse__ · on Feb 9, 2024

I believe the JIT is pretty much a failure, yes. It was well-meant, but LLVM just isn't the right tool for this. I've turned it off globally. (I don't use any ORMs, so it's not simply about strange query patterns.)

Query parallelization, on the other hand, can actually be useful—and most importantly, rarely hurts.

aidos · on Feb 9, 2024

We hit a curios bug recently on production with the JIT.

I had apt updated a couple of packages and then all of a sudden a bigger query we run every 5 minutes was failing. Or rather, Postgres just started silently hanging up the connection mid query execution with even putting anything in the logs.

Took me a while of running manually in explain to see that the variations of the query that ended up using the JIT broke while those that didn’t were ok. Disabled the JIT and everything was ok again.

czl · on Feb 9, 2024

Did you try using prepared statement so the compilation is done once and compiled results are reused each time that query is run?

dur-randir · on Feb 9, 2024

There's a separate can of worms with prepared statements. Two main are:

- parameters are opaque to planner, so it prefers (or is even forced to?) to choose generic vs specific plans

- it doesn't play nice with pg_bouncer in transaction mode

jeltz · on Feb 9, 2024

Pgbouncer supports prepared statements these days.

fabian2k · on Feb 9, 2024

As far as I understand prepared statements don't help here as the JIT output is not saved but generated for each execution. In this case I'm also using an ORM (EF Core) which doesn't expose the ability to prepare statements.

trollied · on Feb 9, 2024

Most optimisers cache the execution plan based on a hash of the query, hence reuse when using prepared statements vs not.

Oracle has an option to detect literals (CURSOR_SHARING) and essentially replace them with binds internally, in order to increase performance and stop the query pool getting filled up with the same statement(s).

davidrowley · on Feb 9, 2024

Postgres does not currently reuse JIT-compiled code. JIT will run each execution of the query. This may change in the next few years, likely starting with tuple deforming, as that's fairly reusable, per table for any query.

SigmundA · on Feb 9, 2024

Only in the same session...

dur-randir · on Feb 9, 2024

I disabled JIT after it became default for our installation (~1TB data). Nice try, useful sometimes, but as a default? No, thanks.

twic · on Feb 9, 2024

I'd be interested to know how often these changes have an effect in real queries. The "Use Limit instead of Unique to implement DISTINCT, when possible" change in particular feels like it would only apply to very silly queries.

Do the PostgreSQL developers have any source of information about this?

yxhuvud · on Feb 9, 2024

I expect it will have effect fairly often - DISTINCT is something less experienced developers often add to fix their bad queries, and generally the first thing I do when I start to improve performance is to rewrite it to not have to do that. So if these improvements to DISTINCT make it more robust to bad queries, then a lot is gained.

It probably won't fix all issues, but any improvements are welcome.

davidrowley · on Feb 9, 2024

(Author of the blog and that feature here) This one did crop up on the pgsql-hackers mailing list. I very much agree that it's unlikely to apply very often, but the good thing was that detecting when it's possible is as simple as checking if a pointer is NULL. So, it's very simple to detect, most likely does not apply very often, but can provide significant performance increases when it can be applied.

maximegarcia · on Feb 9, 2024

That would be nice to also optimize SELECT DISTINCT foo FROM bar. It is usually very poor on big tables and we have to do recursive CTE. This comes a lot with admin builders for filters (<select >).

davidrowley · on Feb 10, 2024

SELECT DISTINCT has seen quite a bit of work over the past few years. As of PG15, SELECT DISTINCT can use parallel query. I imagine that might help for big tables.

I assume the recursive CTEs comment is skip scanning using an index and looking for the first value higher than the previously seen value?

Certainly skip scans would be nice. There has been some work in this area, but not recently. As far as I recall some other infrastructure needed to go in first to make it easier for the query planner to understand when skip scanning would be useful.

Sesse__ · on Feb 9, 2024

The problem is that ORMs have a habit of making very silly queries, and developers insist they cannot write SQL to fix that, because it is somehow impure :-) I doubt this is a very _common_ issue, but I'm not surprised if it shows up every now and then.

946789987649 · on Feb 9, 2024

> developers insist they cannot write SQL to fix that, because it is somehow impure :-)

Because then they lose many of the benefits of why they used an ORM in the first place. Though I am a big fanboy of JOOQ for exactly this reason.

joachimma · on Feb 9, 2024

Where I used to work we allowed duplicate email addresses in the user table for legacy reasons, but we did not want any new entered in the db, so we ran a "select distinct email from users where email = ?" query before creation of new users. I don't think we had more than a 100 rows with the same email though. Most of the duplicates were test users which could have been removed, but I digress.

devit · on Feb 9, 2024

I think it would be really nice to have a "strict mode" (for app testing), where PostgreSQL returns an error if an index would improve the query asymptotically and it doesn't exist (only based on the query itself, not statistics).

And a "CREATE INDICES FOR <sql>" command to create the indices (for app upgrades), plus an automatic index creation mode (for interactive and development use).

In general, the system should be architected so that asymptotically suboptimal execution never happens.

riku_iki · on Feb 9, 2024

Why wouldnt they implement hints..

davidrowley · on Feb 9, 2024

There is a pg_hint_plan extension. I think the danger with hints is that they might only be correct when written. If the table sizes or data skew changes, they might make things worse. I don't have a link to hand, but last time I recall a discussion on hints there was no general objection to them, providing the implementation could be done in a way that didn't force the planner's hand too strongly and still allowed it to adapt to the underlying data changing. For example, indicating there's a correlation between two columns, rather than specifying a given predicate matches 10 rows.

viraptor · on Feb 9, 2024

>I think the danger with hints is that they might only be correct when written.

Not "correct when written", but "scaling as written". That means if you force the execution that scales linearly or quadratically, that's what you get all the time. If the row number increases, you know what will happen. You can monitor that ahead of time and plan for the increase.

On the other hand without the hints, you don't know when and how the plan will change without testing. At some random point postgres can decide to do something terribly stupid and at that point you get to figure out what happened and how to fix that in an emergency mode. Do you know how to adjust the right statistics? Do you need to change the indexes? Do you know how long that will take?

appplication · on Feb 9, 2024

I had this happen for the first time to some prod jobs the other day in spark. We made a pretty normal update to a join with an additional condition, our integration tests which run local Spark succeeded. But something about it running on the cluster… it was generating a completely different query plan than it ran locally. We eventually had to rewrite the whole query to work around it because it was trying to broadcast a 3TB table and couldn’t be talked out of it.

davidrowley · on Feb 9, 2024

There certainly are valid reasons for this. For example, adding a join condition with an OR clause. The only join operator that supports non-equi joins is Nested Loop. If you went from a Hash or Merge join to that, then you'd likely notice some performance degradation. If you have a link to anywhere you've asked for help on this, then I'd be interested to see more details.

appplication · on Feb 9, 2024

I see you’re definitely familiar with the space. I think the condition was using ‘or array_contains’ in the join.

This was really the only resource I found acknowledging it. It sounds like it has do with presumption of nulls (e.g. spark can’t assume they won’t be there) but it would be great to be able to say “don’t worry spark I promise there are no nulls/if there are just disregard”) https://kb.databricks.com/sql/disable-broadcast-when-broadca...

The way we go around this feels so brutish. Literally just did two separate joins and then unioned the results. The recommendation to use ‘not exists’ couldn’t be applied as array_contains must be using ‘in’ under the hood and couldn’t be changed.

davidrowley · on Feb 10, 2024

UNION is certainly one way to eliminate the OR condition.

In theory, a hash join is possible with a condition like `ON t1.a = t2.a OR t1.b = t2.b`, but Hash Join would need to build two hash tables and only probe the 2nd one if the first lookup found nothing. I don't know if there are any RDBMSs that allow multiple hash tables in a hash join.

appplication · on Feb 10, 2024

Yeah it feels like for some arbitrarily sized array contains join having something like a bloom filter involved could help reduce the search space. But I’m not a db engineer so perhaps the specific implementation details of that would turn out to make it a bad idea.

magicalhippo · on Feb 9, 2024

We don't use PostgresSQL (yet), but have had issues with random production outages due to planner suddenly deciding to do dumb stuff like full table scans instead of using an index.

To avoid having to sprinkle hints all over, we've added a background job that forces recalculation of statistics once a week or so...

riku_iki · on Feb 9, 2024

> I think the danger with hints is that they might only be correct when written. If the table sizes or data skew changes, they might make things worse.

they will work in prod in the way engineer is expecting. Current planner also can change its mood in unpredictable way and often generates sub-optimal plans for complex queries, because can't reason about what specific subquery will return exactly, and you learn about it when queries start work very slow in production in the middle of the night.

davidrowley · on Feb 9, 2024

(Postgres committer and blog author here) Personally, I don't have any objection to hints. The resolution of any statistics is never going to be high enough to always be accurate enough for all cases. I think it would be good to give DBAs a better way to coax the planner into making or not making a certain decision. It would also be nice if the planner was a little more risk-averse. Currently, it's happy to do things like Nested Loop join because it thinks some complex WHERE clause will only match 1 row. Nested Loop works best for that, but if there are 2 rows, then generally, any other join type is better, especially so when the inner side of the join is expensive.

tanelpoder · on Feb 9, 2024

One way to look at this is that the most accurate way to "estimate" how fast a certain plan would run, is to actually run it on the full dataset. But that obviously doesn't make sense, as the optimizer is expected to come up with a plan in matter of milliseconds (or less for simple queries) and you don't want your "optimizer stats" to be as big as the whole dataset itself. So optimizer has limited information, by design, and it has to come up with _something_ in a very short amount of time.

I don't know much about Postgres optimizer, but I imagine that in addition to table/column stats, is also uses structural info as its inputs, like existence of (enabled & valid) constraints for example. If the optimizer knows that some column never has NULLs or is guaranteed to be unique, all kinds of transformation & shortcuts become possible.

(There are plenty of large big-vendor ERP/CRM/etc apps out there that do not use DB constraints for the sake of "portability"... not fun to work with these).

Izkata · on Feb 9, 2024

> I don't know much about Postgres optimizer, but I imagine that in addition to table/column stats, is also uses structural info as its inputs, like existence of (enabled & valid) constraints for example.

Here's one that surprised me when I found out about it years ago, because I'd never really given it thought: There's a correlation statistic on columns for how well the values in that column match the row order on disk, which can influence a few different things.

In my case a query that retrieved a ton of data with an ORDER BY was using a sort and taking like two hours to run (a data source for an ETL process) - turned out because of a really bad correlation postgres was refusing to use the index, because the random access would be even slower, so it did a table scan then sort. After figuring this out and discovering the CLUSTER command (reorders the data on disk to match an index), it did an index scan and didn't need to sort at the end, was able to start streaming results immediately, and finished the entire query in like ten minutes.

Just a nice example of where the obvious "use query hints to make it use the index" would have been the worst option, instead figuring out why postgres didn't want to use it and fixing that resulted in something much better.

emmettmoore · on Feb 9, 2024

> Just a nice example of where the obvious "use query hints to make it use the index" would have been the worst option, instead figuring out why postgres didn't want to use it and fixing that resulted in something much better.

I come to the opposite conclusion. Clustering a table results in an access exclusive lock, and due to MVCC the ordering isn’t permanent.

Here, you as the engineer know you’d like to use a sorted index to stream results out even if the overall query end to end is slower due to the I/O cost. In my opinion there should be a way to express this within the query.

Izkata · on Feb 9, 2024

I did test that and like I said, it was strictly worse - went from something like a 2 hour runtime to 5+ hours. Random disk access and not being able to take advantage of the disk cache really is that bad.

Streaming results doesn't mean anything if the total runtime is that much worse, it just means the overall system will take hours longer to complete. And the good version doesn't need to continuously run CLUSTER, the correlation just has to be high enough for it to be the better choice, so we settled on running it once a week - only takes like 2 minutes to run the CLUSTER.

You should be thinking of it the other way around: my final 10 minute result was the ideal situation I wanted, but when circumstances were bad for it, the postgres query planner was smart enough to tell it wouldn't work and switch to the 2-hour plan instead of blindly following the original plan and taking 5 hours.

davidrowley · on Feb 9, 2024

> you don't want your "optimizer stats" to be as big as the whole dataset itself. So optimizer has limited information, by design, and it has to come up with _something_ in a very short amount of time.

This is very true. PostgreSQL does not do any proactive plan caching, so it's important that the planner remains fast. It is possible to adjust the number of stats targets to control the size of the histograms and most common values list. Upping that can be useful for OLAP-type workloads.

> I imagine that in addition to table/column stats, is also uses structural info as its inputs, like existence of (enabled & valid) constraints for example.

Yes. Foreign key constraints are used to assist with join selectivity estimations. PG17 (when released) should be able to make more use of NOT NULL constraints to improve plans.

hans_castorp · on Feb 9, 2024

I think Oracle style hints are not a good thing to have - especially because you have to change the query itself which sometimes isn't possible in a production environment. Additionally, for me they quite frequently made things worse after minor Oracle upgrades.

I would prefer having "externally attached" hints for a query (e.g. identified by it's queryid) like Oracle's stored outlines.

Sesse__ · on Feb 9, 2024

I do wonder if one could eventually just turn off nestloops in such a case (e.g. inner side contains a seqscan), like the JOB paper recommended. Yes, it will have marginally higher estimated cost, but the upside is _much_ safer query plans when the statistics are off.

davidrowley · on Feb 9, 2024

That could be useful if there was a way to just disable non-parameterized nested loop, however enable_nestloop=0 also disables parameterized nested loops. Parameterized nested loops are useful to avoid sorting or hashing some large relation when only a small subset of that relation is likely to have a join partner. This is even more true when you consider that since PG14, Memoize exists to act as a cache between Nested Loop and its inner subnode to cache previously looked-up values.

It's also important to consider that with enable_nestloop=0, when Nested Loop must be used (e.g for a CROSS JOIN) that the cost penalty that's added to reduce the chances of Nested Loop being used can dilute the costs so much that the query planner can then go on to make poor subsequent choices later in planning due to the costs for each method of implementing the subsequent operation being so relatively close to each other than the slightly cheaper one might not even be considered. See add_path() and STD_FUZZ_FACTOR. So, running enable_nestloop=0 in production is not without risk.

Sesse__ · on Feb 9, 2024

Yeah, nestloop with a cheap inner path (e.g. a lookup into a unique index) should be just fine, so I don't think nestloops as a whole should be banned. (Also, I believe Postgres is pretty much the only place I've seen the concept of a parameterized path described; it's not talked much about in academia, although it is probably really hard to make an index-aware System R planner without it.)

I wondered whether it would be possible just to add a fixed fuzz to every row estimate, say five rows. It would essentially mean you can never get this issue of a small undercount causing a plan disaster. Overestimating slightly is basically never a big issue as far as I know.

(I should perhaps have considered this when I was actually making a query planner in a previous life, but there were more than enough other things to worry about :-) )

davidrowley · on Feb 9, 2024

It's a bit complex to explain here, but I describe an idea I've been considering in https://www.postgresql.org/message-id/CAApHDvo2sMPF9m=i+YPPU...

Sesse__ · on Feb 9, 2024

Yes, a lot of this seems interesting to go into. I really hope that at some point, someone would find the resources to just try a lot of dumb stuff and see what works in practice. I mean, what we have right now (multiply selectivities together as if they were independent) is also pretty dumb, and there's no good reason why it should be preferred over everything else.

Another avenue is of course trying to avoid the issue to begin with, e.g. through the recent “translation grids” of Müller and Moerkotte for better join selectivities. But I doubt anyone is going to be finding a silver bullet for this anytime soon, so reducing plan risk somehow seems very worthwhile.

davidrowley · on Feb 10, 2024

> I mean, what we have right now (multiply selectivities together as if they were independent) is also pretty dumb

Yeah, I think it was probably a mistake to always assume there's zero correlation between columns, but what value is better to use as a default? At least extended statistics allows the correlations of multiple columns to be gathered now. That probably means we'd be less likely to reconsider changing the default assumption of zero correlation when multiplying selectivities.

Sesse__ · on Feb 10, 2024

Yeah, I don't have all the answers, my point is that it would be worthwhile to try similarly dumb stuff and see if it works just as well and could be more robust in some places :-)

Multi-selectivities are good, but IIRC they can't be specified across tables and thus across joins, right? Out of curiosity; how do you reconcile multiple selectivities? If you have a multi-column histogram on (a,b) and one on (b,c) and one on (c), and you need to figure out the selectivity of WHERE a=? AND b=? AND c=? from those three? I looked into this at some point, and academia presented me with a nightmare of second-order cone programming and stuff. :-) (I never implemented any of it before leaving the database world.)

davidrowley · on Feb 10, 2024

> Multi-selectivities are good, but IIRC they can't be specified across tables and thus across joins, right?

Yeah, no extended statistics for join quals yet.

> how do you reconcile multiple selectivities?

Looking at https://doxygen.postgresql.org/extended__stats_8c.html#a3f10... it seems the aim is to find the stats that cover the largest number of clauses tiebreaking on the statistics with the least number of keys. For your example both of those are the same, so it seems that which stats are applied is down to the order the stats appear in the stats list. That list is ordered by OID, which does not seem ideal as a dump and restore could result in the stats getting a different OID. Seems sorting that list by statistics name might be better. That's what we do for triggers, which seems like a good idea as it gives the user some ability to control the trigger fire order.

Sesse__ · on Feb 10, 2024

OK, so basically ad-hoc/greedy. That can get you into, well, issues :-) There are methods where you first adjust away impossible statistics (if you have quals a,b, then P(a)P(b) >= P(a AND b), but if your estimates come from multiple places this might not hold[1] -- so you need to fudge one or more of those before proceeding), and then combine them optimally using Newton-Raphson. It's pretty neat. But it does require a matrix library and stuff :-)

[1] Similarly, you can sometimes guarantee stuff like P(a AND b) <= 1/n due to a unique multi-column index, which can also inform some of your sub-estimates.

emmettmoore · on Feb 9, 2024

What is the JOB paper? A quick google search didn’t find it, but I’d be interesting to see the recommendations

Sesse__ · on Feb 9, 2024

https://www.vldb.org/pvldb/vol9/p204-leis.pdf

clhodapp · on Feb 9, 2024

It would be neat if you could at least provide expressions (that can't hit any actual tables) to compute bounds for how many rows are expected to come back from any particular row source.

tanelpoder · on Feb 9, 2024

Indeed, hints are super-useful (essential!) for applying quick fixes when something unexpected suddenly happens in the optimizer's magic. Or you just want to instruct/nudge the optimizer towards doing the right thing, if you know the shape of your data and optimizer can't see it or doesn't act on it correctly for some reason. The downside is that people who don't really know what exactly they want to achieve, will start applying incomplete sets of hints in random locations, based on Internet searches. And sometimes you'd even get lucky, that single index hint makes the problem go away - for a while!

And few people tend to remove hints after DB version upgrades, where the optimizer magic (or your table stats) have improved. Now you're limiting optimizer's choices. That's been a problem (by now) for decades in the Oracle world with lots of legacy SQL code full of random hints where some of them aren't even valid anymore, but others still are - and limit optimizer's choices.

At least Oracle folks had enough at some point and introduced an "optimizer_ignore_hints" parameter [1], so all legacy hints that were added 20 years ago just get ignored - and the modern optimizer does a much better job getting things right.

I do regularly use hints in SQL tuning and troubleshooting experiments, just to verify and prove that a better plan is theoretically and physically possible - and if yes, then go from there. When the (deliberately placed) hints make the query faster, the next step is to check the hinted, faster plan's optimizer cost estimate and drill down from there: Why did optimizer think that the other plan was cheaper or why did the optimizer think that the faster plan was more expensive. Then you end up with measuring row-count misestimates, etc...

But yes, plenty of people (including myself) have made legacy Oracle apps run much more efficiently and faster just by globally telling the DB to stop paying attention to all the old random hints lingering on and gathering object stats using the modern settings & defaults (doesn't always work though).

[1] https://docs.oracle.com/en/database/oracle/oracle-database/1...

davidrowley · on Feb 9, 2024

> At least Oracle folks had enough at some point and introduced an "optimizer_ignore_hints" parameter [1], so all legacy hints that were added 20 years ago just get ignored - and the modern optimizer does a much better job getting things right.

I think the general attitude in the Postgres community is been from a purist point of view. When you have a codebase around 40 years old, you do have to think carefully about what you put into it, as it might not be that easy to take it out again. However, yes, I do think hints would be useful for Postgres, providing they're done well. It would be good to at least have something to assist with selectivity estimations. Those are at least not directly forcing the planner into a single choice. New planner/executor smarts, such as something like Memoize added in PG14 could still be considered after upgrading an older pre-PG14 application with such hints, but perhaps not if the hint told the planner that it must nested loop join these two tables.

alexey-salmin · on Feb 9, 2024

If not hints and manual plans then we need a way to "freeze" automatic optimizer plans however optimal they are after each tested release.

The optimizer killing your heavily-loaded production because it randomly switched a frequent query to a bad plan is pretty annoying. People mostly put up with this because for a single-instance DB you can easily maintain 4x spare CPU capacity but it's impossible at scale.

davidrowley · on Feb 9, 2024

It sounds like it would have to be an opt-in feature which could be applied per query, as otherwise wouldn't it be equally as annoying if the planner didn't adapt to the table data changing?

What may be better is if the executor provided feedback to the planner to tell it the new plan was worse than the old one. With that, you might be able to recover much more quickly and less likely to get a midnight phonecall. The tricky part would be when should the planner then retry the new plan again. Also, other factors that influence the plan's execution time such as locking would be a cause of noise for any sub-system that was monitoring this.

alexey-salmin · on Feb 9, 2024

> wouldn't it be equally as annoying if the planner didn't adapt to the table data changing?

Not really, because at scale tables don't change suddenly unless you release something -- and that will re-generate the plans after the perf testing.

Whatever changes slowly accumulate over time will be covered in the next release which normally should be very frequent, daily or more.

winrid · on Feb 9, 2024

> It sounds like it would have to be an opt-in feature which could be applied per query

That's called a hint :)

alexey-salmin · on Feb 9, 2024

Not exactly because a hint contains information that may become obsolete. A directive that freezes a known good auto-plan until the next release will not become obsolete as long as you keep releases rolling.

masklinn · on Feb 9, 2024

What I’d like is the ability to hand postgres an execution plan, whether that is a plan I retrieved from Postgres’s own planner, or one I wrote by hand (possibly a mix of the two).

Give the API a scary name if you want, but getting 3 Am calls because the planner suddenly decided to go off the rails or seeing pages of convoluted SQL because in (current version) it’s what it takes to get the plan you need is not fun.

It’s long past time for database developers to accept that the high level is not always the right solution, it should be that in 95 or 99% of cases, and the better it is the better my life is, but sometimes you got to write exactly what you need.

Most language designers are fine with it and will allow either embedding or calling into lower-level language (down to hand-crafted assembly), it would be great if database devs could get on with that program.

winrid · on Feb 10, 2024

Store hints and index definitions in same place and read from it for migrations. Problem solved in most practical scenarios.

hibikir · on Feb 9, 2024

The one I'd love to tell the planner is that a table holds transactions in time, and that it should not expect that today's data is empty because it was empty 10 hours ago. It's an extremely common pattern, it makes any statistics gathering based on percentage of data changed dubious pretty quickly, and harms a whole lot of real queries, because in data like this, people care the most about the recent data.

There are ways to organize data to minimize the issue, but it'd be so much nicer if we could just teach the optimizer that this is the way the data is shaped.

davidrowley · on Feb 9, 2024

> The one I'd love to tell the planner is that a table holds transactions in time, and that it should not expect that today's data is empty because it was empty 10 hours ago. It's an extremely common pattern, it makes any statistics gathering based on percentage of data changed dubious pretty quickly, and harms a whole lot of real queries, because in data like this, people care the most about the recent data.

It's not a hint, but PostgreSQL does have something that can help with cases like that.

In some cases, to obtain selectivity estimates, the planner will probe a btree index to find the actual lower and/or upper bound. For this to apply, a btree index must exist and you have to be using indexes >, >=, < or <= operator. The planner will probe the index if the query is comparing the indexed column to a value that's known the planner if that value falls on the first or last histogram bucket. This can help when your statistics are slightly out of date and you're querying for some column which stores a monotonically increasing or decreasing value.

dagss · on Feb 9, 2024

I feel the best abstraction for hints would be to declare on tables how large you expect them to be -- and even throw errors if query plans with a good scaling cannot be found.

Say I could declare "assume this table will grow very large", "assume this table will be a small enum table".

And then it would use that information instead of actual table size to guide planning AND throw an error for any query doing a full table scan on a declared-to-be-large table -- so that missing indices can be detected instantly, not after running in prod for some days/weeks.

Google Data Store has this property and it is a joy to work with for a backend developer.

What I am usually after is NOT the fastest plan, but the most consistent and robust plan across test and prod environments.

barrkel · on Feb 9, 2024

Table row count is a small part of it, what matters is cardinality and fanout from joins after predicates have been pushed down as far as they can. (Assuming there are sane indexes.)

If the database can't see that your predicates will restrict the set of rows at a certain point in the join graph, it is likely to decide to join too much too early with huge table scans.

Bad join order and join strategy is at the heart of most bad plans once you already have indexes in place that cover the expected joins and lookups.

Semaphor · on Feb 9, 2024

Oh, that sounds really cool. I like declaring expected size, but "throw on certain behaviors" would be something I’d love in MS SQL.

cryptonector · on Feb 9, 2024

IMO hints need to be provided out of band, that is, not in the SQL query itself. To do this it is necessary to have a way to address every table source, in every sub-query, then one can have hints as a pile of {table source, hint}. Not that this solves the problem of hints rotting, but being able to separate them from the text of the query at least keeps the query clean, and makes is possible to have different sets of hints for different contexts and different RDBMS versions.

pmontra · on Feb 9, 2024

I guess that the solution to this problem can be automated. The DB or an extension to the DB or application code can run the query without hints sometimes and compare the result with the version with hints. If the hinted version is still faster, good. If it is slower, it's time to tell the DBA. Or switch to the unhinted query automatically if it's faster for a large enough number of times.

somat · on Feb 9, 2024

I suspect the ideological problem with hints is that if the planner is producing a poor query, then the correct place to fix that is in the planner.

While I agree with this viewpoint, The problem is that most people don't want to be a Postgress dev, To actually enable people to fix the planner it would have to be exposed as a runtime service. And unless there was a lot of diligence the planner script would quickly degrade into an unmaintainable mess(low blow: just like most schemas.)

williamdclt · on Feb 9, 2024

I don't think anybody disagrees that the correct place to fix that is the planner, people just want an escape hatch so that when poor queries happen they can do _something_ instead of waiting for a fix to be written, a release to include it, and upgrading their database. It's a pretty reasonable ask, I think!

sa46 · on Feb 9, 2024

Related discussion

Why PostgreSQL doesn't have query hints

https://news.ycombinator.com/item?id=2179433 (60 comments, 2011)

The official stance from the Postgres wiki https://wiki.postgresql.org/wiki/OptimizerHintsDiscussion:

> We are not interested in implementing hints in the exact ways they are commonly implemented on other databases.

> Problems with existing Hint systems: Poor application code maintainability, Interference with upgrades, Encouraging bad DBA habits, Does not scale with data size

I don't fault their stance, but it's frustrating when Postgres picks a stupid plan and can't be convinced to do something reasonable.

s08148692 · on Feb 9, 2024

> when Postgres picks a stupid plan and can't be convinced to do something reasonable.

In my experience it can always be convinced to make a reasonable plan, but it's not always trivial. Sometimes it's just adding an index, sometimes it's entirely rewriting a query

CubsFan1060 · on Feb 9, 2024

Agreed. I've found this setting to be difficult: https://postgresqlco.nf/doc/en/param/default_statistics_targ...

Too low == bad query plan. Oddly, too high == bad query plan.

Izkata · on Feb 9, 2024

There's also "the query plan you want won't do what you expect, so I'm giving you the next best thing until you tell me the right way to store the data": https://news.ycombinator.com/item?id=39311536

andybak · on Feb 9, 2024

A friend of mine is a Microsoft DBA for mid-sized companies and was proclaiming how you can't do anything serious with Postgres. He said he was shocked to discover it didn't even have a query planner.

Leaving mocking him to one side for a moment - is there any plausibility to his broader claim that MSSQL can handle things at a scale where Postgres would be a poor choice? My gut instinct is that this is nonsense but I'm not a DBA by a long stretch.

danpalmer · on Feb 9, 2024

Yes there is. If what you need is a database that will do pretty much anything well enough, then MSSQL and Oracle are going to manage it. They solve this by throwing money and hardware (more money) at the problem until it works. There's some clever stuff happening in there of course, but fundamentally they've just had much more engineering work over a long time. They can scale out more than Postgres can reasonably do.

That said, Postgres is catching up, and arguably MySQL/MariaDB has always had a good story here. Scale-out options are improving all the time. The landscape has also changed in other ways too, now you can easily have a multi-terabyte Postgres cluster on a small number of machines serving large traffic volumes, and then put your "big data" into a more specialist database. The old world of shoving everything on MSSQL/Oracle may be a bit old-school.

SigmundA · on Feb 9, 2024

I develop for MSSQL extensively and PG is missing some things that can be a little surprising.

He might have been referring to the fact that PG doesn't cache query plans or have way to lock them in. PG replans for every statement unless you manually do prepared statements and that only works per connection. MSSQL will cache plans and reuse them and has done this for a very long time. Consequently the planner can take more time planning. Also MSSQL has hints and you can even lock a plan.

PG really needs hints, optimizers are great and all but sometimes I know better and I want to make sure it listens to me.

Also PG has no true clustered indexes all tables are heaps which is something most use all the time in MSSQL, usually your primary key is also set as the clustered index so that the table IS the index and any lookup on the key has no indirection. Interesting SQLite is the opposite tables always have clustered index whether you make one or not, MSSQL give you the choice for heap or index organized tables.

davidrowley · on Feb 10, 2024

> Also PG has no true clustered indexes all tables are heaps which is something most use all the time in MSSQL, usually your primary key is also set as the clustered index so that the table IS the index and any lookup on the key has no indirection.

This is true, but I believe if you have an index-organised table then subsequent indexes would have to reference the primary key. With PostgreSQL, indexes can effectively point to the record in the heap by using the block number and item pointer in the block. This should mean, in theory, index lookups are faster in PostgreSQL than non-clustered index lookups in SQL Server.

I was wondering, is there a concept of Bitmap Index Scans in SQL Server?

PostgreSQL is able to effectively "bitwise" AND and bitwise OR bitmap index scan results from multiple indexes to obtain the subset of tuple identifier (ctids) or blocks (in lossy mode) that should be scanned in the heap. Effectively, that allows indexes on a single column to be used when the query has a condition with an equality condition on multiple columns which are indexed individually. Does SQL Server allow this for heap tables? In theory, supporting both allows DBAs to choose, but I wonder how well each is optimised. It may lead to surprises if certain query plan shapes are no longer possible when someone switches to an IOT.

SigmundA · on Feb 10, 2024

>This is true, but I believe if you have an index-organised table then subsequent indexes would have to reference the primary key. With PostgreSQL, indexes can effectively point to the record in the heap by using the block number and item pointer in the block. This should mean, in theory, index lookups are faster in PostgreSQL than non-clustered index lookups in SQL Server.

Yes this is a tradeoff, if you primarily access by the clustered index its faster, if you access by secondary it slightly slower.

Clustered indexes work very well for tables that have a single access pattern, they can also save significant space and I/O since data is not copied in both heap and index.

Having the choice is great, Postgresql forces heaps and SQLite forces clustered, MSSQL lets you do either based on your design choice.

>I was wondering, is there a concept of Bitmap Index Scans in SQL Server?

Yes you can see the description here: https://learn.microsoft.com/en-us/sql/relational-databases/s...

sgarland · on Feb 9, 2024

> Also PG has no true clustered indexes all tables are heaps which is something most use all the time in MSSQL

You can rewrite a table in PG to be clustered [0], but it a. locks the table b. is a one-shot, so you have to periodically redo it

> Interesting SQLite is the opposite tables always have clustered index whether you make one or not

AFAICT [1] you have to explicitly make a table `WITHOUT ROWID` to get a Clustered Index in SQLite.

> usually your primary key is also set as the clustered index so that the table IS the index and any lookup on the key has no indirection.

The main problem with this is there is a tendency by people unfamiliar with proper schema design (so, most) to use a UUID – usually v4 – as the PK. This causes no end of performance issues for RDBMS with and without clustered index, but since InnoDB also uses a clustered index, and MySQL is the most-installed RDBMS (modulo SQLite), it happens a lot.

This isn't the fault of the RDBMS; clustered index has some great advantages, as you point out. But it's the reality of the current situation.

[0]: https://www.postgresql.org/docs/current/sql-cluster.html

[1]: https://www.sqlite.org/withoutrowid.html

SigmundA · on Feb 9, 2024

>You can rewrite a table in PG to be clustered [0], but it a. locks the table b. is a one-shot, so you have to periodically redo it

That is not a true clustered index or index organized table, the table is still a heap and there is still a secondary index that is the b-tree. With a real clustered index the table itself is the b-tree there is no heap and no secondary index unless you have other indexed fields as obviously there can be only one clustered index.

This may be a terminology thing Oracle calls them index organized tables while MSSQL calls them clustered indexes while PG uses clustered for something that is not actually an index.

>AFAICT [1] you have to explicitly make a table `WITHOUT ROWID` to get a Clustered Index in SQLite.

The Row ID IS the clustered index unless you specifically say it's one of your columns. In SQLite there are no heaps only b-tree indexes.

>The main problem with this is there is a tendency by people unfamiliar with proper schema design (so, most) to use a UUID – usually v4 – as the PK. This causes no end of performance issues for RDBMS with and without clustered index, but since InnoDB also uses a clustered index, and MySQL is the most-installed RDBMS (modulo SQLite), it happens a lot.

Since most of your key are indexed typically this is an issue either way since that will cause fragmentation in clustered or non clustered indexes on a random UUID requiring page splits etc.

sgarland · on Feb 9, 2024

> That is not a true clustered index or index organized table, the table is still a heap and there is still a secondary index that is the b-tree.

Fair point.

The speed increase is still decent [0], although if you don't need most of the tuples (and ignoring TOAST-ed columns), you could achieve similar results with a simple REINDEX.

> Since most of your key are indexed typically this is an issue either way since that will cause fragmentation in clustered or non clustered indexes on a random UUID requiring page splits etc.

Agreed. I keep trying to convince Postgres fanboys that it matters, to relatively little success. I think the tide is slowly turning as "thought leaders" have begun publishing blog posts on this.

[0]: https://gist.github.com/stephanGarland/0ba0d6348ca0dedae8b4c...

fabian2k · on Feb 9, 2024

Postgres has a query planner, I mean this entire post is about improvements to it. So I think there either was some miscommunication or your friend doesn't know anything about Postgres.

There are very large Postges databases that seem to work fine, so Postgres can certainly scale. But SQL Server also has some features that Postgres doesn't, and if those are important for you it might work better for your use case. They are in the end different databases with different strengths and weaknesses.

mmcgaha · on Feb 9, 2024

I have used both for OLTP and data warehousing and both are fine.

I started writing this to say that I would recommend my company move to Postgres if it weren't for vendor provided applications that required SQL Server but then I realized how much work it would be for me to replace the things MS includes like reporting services, integration services, jobs, AD integration, service broker (notify/listen lacks message types). I don't use analysis services any more but when I did that would have been hard to replace too.

This stuff is how they get you. I have no clue how long it would take me to replace all of this but it would not be a good ROI to spend a year replacing what you already have.

PartiallyTyped · on Feb 9, 2024

AWS' Aurora seems to be handling things pretty well tbh and is meant as a drop-in replacement for Postgresql and MySQL.

fuy · on Feb 9, 2024

Aurora is using native Postgres planner, I believe, probably with some minor enhancements.

CubsFan1060 · on Feb 9, 2024

There is this extension which gives you some flexibility: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

thaumasiotes · on Feb 9, 2024

> He said he was shocked to discover it didn't even have a query planner.

How did he discover that?

jhoechtl · on Feb 9, 2024

Why is this released by citusdata instead on postgresql.org? Is this a paid feature only or an open source addition?

Sesse__ · on Feb 9, 2024

Because the poster (who also wrote some of the optimizations in question) works for Citus Data.

davidrowley · on Feb 9, 2024

Just to clarify. I'm the author of the blog. I work for Microsoft in the Postgres open-source team. All the work mentioned in the blog is in PostgreSQL 16, which is open-source.

jhoechtl · on Feb 9, 2024

Answers it nicely, thank you!

evntdrvn · on Feb 9, 2024

thanks for the work that your team does!

jhoechtl · on Feb 9, 2024

There was no subtone in my question, I just wanted to know if this is a paid only feature.

Diggsey · on Feb 9, 2024

But when is it going to be able to use indexes to speed up `IS NOT DISTINCT FROM` queries? ;)