That doesn't address anything in the second half of the post though, starting wi...

sitharus · on Jan 10, 2025

This is confusing when you know that NULLs are not comparable, but it makes some sense if you consider the result of distinct/union as the output of a GROUP BY. You can consider everything that's NULL to be part of the same group, all the values are unknown.

So NULLs are not comparable but they are part of the same set.

adammarples · on Jan 10, 2025

If nulls are distinct then group by should not group them together, this just ignores the problem. Why does group by treat them as equal?

getnormality · on Jan 10, 2025

It doesn't treat them as equal, it treats them as one group. It does this because the result is more useful.

It is not the case that nulls are always the same as one another. It is also not the case that nulls are always distinct from each other. Thus, the normal rule of grouping, that same values are combined and distinct values make different groups, does not apply. Another principle is needed to determine the result of grouping with nulls.

Logic which allows for an unknown value can't be shoehorned into always giving definite true or false answers.

rawgabbit · on Jan 10, 2025

Nulls are not necessarily distinct.

I believe this confusion is confusing the tool with the thing being measured. For simplicity, I will use the analogy of a record (stored as a row in the database) as an observation in a scientific experiment. If the tool was able to record a value, I enter a value like 579.13. If the tool was not able to record a value, the tool will enter NULL. I make a total of one hundred observations. Of one hundred rows, some have values and some are NULL.

Are NULLs distinct values? No, they are simply a failure in measurement; it is like asking if all errors are distinct or the same. Are NULLS part of the same dataset? Yes, because they are all observations for the same scientific experiment. What does it mean when "select distinct ... " returns several rows for known/measurable values and but only one row for NULL? If this is confusing, the scientist can update the rows and substitute "UNKNOWN/ERROR" for every NULL. When you do "select distinct ...", you will get the same thing. It will return several rows for known/measurable values and but only one row for "UNKNOWN/ERROR".

sitharus · on Jan 11, 2025

This is where the foundation of a relational database semantics in set theory shows through. You can’t model the behaviour with pure boolean logic.

In the SQL spec by default unique indexes consider nulls distinct because you’re adding tuple to the relation, and this is done by equality.

When doing a select distinct or group by you’re not doing a boolean grouping, you’re doing a set projection. NULLs are considered part of the set of “unknown values”, so NULLs are grouped together but they’re still not equal to each other.

The behaviour is mathematically well defined, but it doesn’t match boolean logic.

I’ve been dealing with databases for well over 20 years now and I can only think of a couple of times when this behaviour wasn’t wanted, so I think it’s the right choice.

GuB-42 · on Jan 10, 2025

That's because "different" and "distinct" don't mean the same thing.

Two unknown values are assumed to be different, but they are not distinct from each other.

For example, take two boxes, in each box is a die, the value of the box is the value shown on the die inside. You don't know the value since you don't see the die, it may even change as you manipulate the box, so it is unknown, NULL in SQL. Because of that, you assume the two boxes have different values. They are, however, indistinguable, so, not distinct. All you need to know is that you hold two boxes with dices in it, which one you hold doesn't matter, and that's what "SELECT DISTINCT" tells you: that you have two boxes and it doesn't matter which is which, even though they have different values.

SigmundA · on Jan 10, 2025

>That's because "different" and "distinct" don't mean the same thing.

The literal definition distinct is:

>recognizably different in nature from something else of a similar type.

If you want to get down to it nothing is "equal" or the same.

Is a temperature measurement 25C the same as another of 25C? No these measurements are an approximation of the actual values which are actually not equal to each other they are distinct they have just been lumped into the same 25C group due to the resolution of measurement yet equality works just fine on that value in sql.

I have used SQL for a long time null handling is weird and inconsistent and a waste of time. For all the language bugs due to the existence of null at least I can count on null=null and not write garbage like value=param or (param is null and value is null)

buttercraft · on Jan 10, 2025

> The literal definition distinct is

Irrelevant. What matters is the meaning in the context of SQL.

> weird and inconsistent and a waste of time. For all the language bugs due to the existence of null

There are necessary, semantic cases that need to be dealt with. How else would you do it?

Also, it's really weird to use "bugs" to refer to well defined and well documented behavior.

munk-a · on Jan 10, 2025

I wanted to briefly reinforce this point with the fact that SQL has multiple equality operators - there is both `=` and `IS NOT DISTINCT FROM`. The later operator will treat null values as equal in alignment with the `DISTINCT` and `DISTINCT ON` operators.

It is extremely easy using partial uniques and the proper operators to treat nulls as non-distinct values and the approach we have allows some very important functionality that aligns with standard statistical data treatment which was influential to how SQL logic was originally designed.

dragonwriter · on Jan 10, 2025

> Is a temperature measurement 25C the same as another of 25C?

Yes, the measurements are the same.

The actual temperatures probably are not, but measurements are not the same as the thing measured.

SigmundA · on Jan 10, 2025

>Yes, the measurements are the same.

By the logic two unknown (null) measurements are the same regardless of the actual value which I agree with.

dragonwriter · on Jan 10, 2025

An unknown measurement isn't a measurement value its a statement of (lack of) knowledge about a measurement, that doesn't tell you what the measurement is. Knowledge about a measurement is as different from the measurement as the measurement itself is from the thing measured.

Whether two unknown measurements are the same is unknown.

SigmundA · on Jan 10, 2025

Whether two measurements of 25C are the same is unknown, these are just values recorded in a database. 25 is a value, null is a value.

The values in the db are the same in both cases which is what I would like my db language to deal with and not make assumptions about what that value actually means.

I see no value in treating null special when in comes to equality in a sql db, in fact it is a hinderance that it does so in my experience.

shawnz · on Jan 10, 2025

The SQL null is a database-specific keyword and not something that's part of the domain of your measurements. If you want some kind of sentinel value for your measurements that doesn't have the "unknown" meaning of SQL null, then you should use your own sentinel value and not reuse null for that purpose

SigmundA · on Jan 10, 2025

Sentinel values suck especially when the language already has a perfectly good one built in.

Is 0 for a temp measurement unknown sentinel or an actual measurement, how about 2,147,483,647 great probably not a measurement now its always included in greater than queries same with max negative with less than.

Null separates the value into its own distinct group and prevents it from being including in range queries due to it not being an actual numeric value while most languages still allow you to compare equality using standard operators.

Sum types would be great in sql but currently we get a union of the sql type and null, so null for sentinel values it is except for the having to using weird syntax to compare it.

Dylan16807 · on Jan 10, 2025

Null is not your value that the database is making assumptions about, it's the database's value that you are making assumptions about.

A real sum type would be nice, but when you're using null then you need to accept that null was not designed with your specific use case in mind.

dragonwriter · on Jan 10, 2025

> Null separates the value

NULL is not a value.

NULL is a statement that a value is not available or unspecified reasons.

If you want a particular value where a query would return NULL, it's your job to replace the NULLs with the contextually-appropriate value, e.g., using COALESCE(), to provide it.

It's a convenience shortcut to allow more complicated data models to be rpersented in simpler table structures than a fully normalized NULL-free data model would require, and to provide information about missing data (which can be used with things like COALESCE, where appropriate) when a more complex data model is simplified into a resultset via a query with JOINS, etc.

SigmundA · on Jan 11, 2025

>NULL is not a value.

I do not agree it is most certainly a value that is stored in the database.

>NULL is a statement that a value is not available or unspecified reasons.

Again I disagree, it is a value that denotes its value is of a different type than specified in the columns data type. The meaning of that value is for the user to decide. The system is literally storing a value that can be compared against using special equality syntax (IS NULL, IS DISTINCT FROM, etc).

The actual column definition is a sum type defining possible values in the column it is a constraint on the values:

columnA int null columnB int not null

If someone asks you what the value of a column in a result is when null do you say "I don't know" or do you say "null"?

For all these statements about what null means philosophically and the history about why it is treated the way it is in SQL there is little compelling argument to what value having the the equals operator always returns false when comparing nulls and instead one must use a separate syntax to properly compare null values for equality other than its for historical reasons and it changing it would be difficult.

This adds no value over typical programming languages where the normal equality operator can be used, its is weird and the source of confusion and even more bugs than your typical null handling creates.

I would prefer a database with fully fleshed out sum types rather than marking a column nullable, then a column could be marked as say a number + string + special sentinel type or whatever combination of types makes sense for my application, and if it that db did exist I am sure its equality operator would properly compare type and value to give a consistent binary result rather than the nonsense that is SQL null equality.

Your statements about using coalesce don't seem compelling to me, maybe I am a misunderstanding, to efficiently search for a value with a index you must use that value, I should be searching for with column IS NULL not Coalesce(column,[special value]) = [special value] which would be extremely inefficient.

Many languages have null coalescing operators and still use the standard equality operators for null. Coalesce to a special numerical value for a numerical column to represent a sentinel value is again a waste of time that again leads to strange greater than less than issues. Given a type system that allows nulls I would rather use IS NULL than coalesce that would be a further step backward but even better would be = NULL and get rid of the IS operator and its variants.

GuB-42 · on Jan 10, 2025

Changing the emphasis.

> recognizably different in nature from something else of a similar type.

But anyways, the point wasn't to justify the choices of SQL but rather as a way to make intuitive sense of its logic. SQL is one of the oldest and most successful programming languages in existence, we are not going to change it, and it is not going to disappear anytime soon, so we have to go with it, like it or not. There have been some attempts at alternatives, both at changing the paradigm (NoSQL) and at cleaning up the language, which, to be fair, would be a good thing, but without much success. The relational paradigm just works, and SQL is usable enough to make the cost of switching not worth it.

Edit:

And writing things like "value=param or (param is null and value is null)" is usually the sign of a poor understanding of the NULL logic. You are using it wrong basically. Sometimes, it is the right thing to do, but if that pattern starts appearing all over the place, it is usually a result of thinking "NULL is broken, I have to use this pattern to handle NULL properly". That's cargo culting, don't fix problems you don't understand by copy-pasting code you don't understand.

Note: this is not addressed to "you" in particular, there can be good reasons, no offense intended. But I think that in general, it is a code smell.

SigmundA · on Jan 10, 2025

If it is not recognizably different than it is the same in that context correct?

Two measurements of 25C are not recognizably different therefore they are equal, correct, regardless if the actual temperatures are not the same?

Two measurements of unknown are not recognizably different therefore they are equal in the context of the database.

Having null!=null has never been intuitive to me especially since every other programming language treats them equal. I am not hoping this gets changed, I know SQL is to far along for that, I can still complain about it and agree its wierd.

>And writing things like "value=param or (param is null and value is null)" is usually the sign of a poor understanding of the NULL logic.

It's needed with parametrized sql when your db doesn't support "is not distinct from" which is itself a silly way to just write '=' or '==' like a normal programming language. The distinct predict exist for this very reason to have yet another way to express equality that includes nulls: https://modern-sql.com/caniuse/T151

Aloisius · on Jan 10, 2025

MS SQL Server treats NULLs as indistinct for UNIQUE constraints, SELECT DISTINCT and for UNION.

Indeed, the sqlite page the pull quote is from says as much.

hans_castorp · on Jan 10, 2025

> MS SQL Server treats NULLs as indistinct for UNIQUE constraints

Postgres lets you control that behaviour when creating the constraint (or index)

swiftcoder · on Jan 10, 2025

Although only in relatively recent versions. I had to hack around this in TypeOrm, because their Postgres backend hasn't exposed the option yet.

marcosdumay · on Jan 10, 2025

Yes, ergonomics dictates some weird behavior for nulls.

Luckily, Postgres nowadays lets you declare the behavior of each null in unique constraints, like it should be. We can expect this to creep down to the other DBMSs with time.

Making nulls distinct on a "select distinct" or a "group by" would be completely useless.

Pxtl · on Jan 10, 2025

If you're compromising on your high-minded and unorthodox purist concept for ergonomics, you may as well finish the job of ergonomics and just use normal nulls where X=X is true.

AlienRobot · on Jan 10, 2025

It's possible that this is due to the underlying implementation.

In a unique column normally you'll have an index, so NULL becomes a special value in an index, but in SELECT DISTINCT you probably won't have an index, which means a full scan is performed, then every row has to be compared with every other row.

SoftTalker · on Jan 10, 2025

If you're including possibly NULL columns in a distinct or group by and you want to treat them in a particular way, use the COALESCE() or NVL() or whatever similar function to give a real value to the NULL for that purpose.

nilamo · on Jan 10, 2025

Isn't "select distinct" wildly frowned upon anyway? It's the same as "group by", but with less options...

mulmen · on Jan 10, 2025

It’s not and it’s not, respectively.

tanelpoder · on Jan 10, 2025

Yeah nothing wrong with “select distinct” itself if it’s used correctly for its intended reasons.

But when I see select distinct at the start of a big and complex query, I do immediately suspect that the developer might have missed some join condition down the line and “got too many rows” back from the query. And since the rows look like duplicates due to the missing join predicate, for a junior (or careless) developer, the quick “solution” is to just apply distinct in the top level query to get rid of these pesky duplicates, declare success and move on.