That doesn't address anything in the second half of the post though, starting with this pull quote:
> The fact that NULLs are distinct for UNIQUE columns but are indistinct for SELECT DISTINCT and UNION continues to be puzzling. It seems that NULLs should be either distinct everywhere or nowhere. And the SQL standards documents suggest that NULLs should be distinct everywhere. Yet as of this writing, no SQL engine tested treats NULLs as distinct in a SELECT DISTINCT statement or in a UNION.
This is confusing when you know that NULLs are not comparable, but it makes some sense if you consider the result of distinct/union as the output of a GROUP BY. You can consider everything that's NULL to be part of the same group, all the values are unknown.
So NULLs are not comparable but they are part of the same set.
It doesn't treat them as equal, it treats them as one group. It does this because the result is more useful.
It is not the case that nulls are always the same as one another. It is also not the case that nulls are always distinct from each other. Thus, the normal rule of grouping, that same values are combined and distinct values make different groups, does not apply. Another principle is needed to determine the result of grouping with nulls.
Logic which allows for an unknown value can't be shoehorned into always giving definite true or false answers.
I believe this confusion is confusing the tool with the thing being measured. For simplicity, I will use the analogy of a record (stored as a row in the database) as an observation in a scientific experiment. If the tool was able to record a value, I enter a value like 579.13. If the tool was not able to record a value, the tool will enter NULL. I make a total of one hundred observations. Of one hundred rows, some have values and some are NULL.
Are NULLs distinct values? No, they are simply a failure in measurement; it is like asking if all errors are distinct or the same. Are NULLS part of the same dataset? Yes, because they are all observations for the same scientific experiment. What does it mean when "select distinct ... " returns several rows for known/measurable values and but only one row for NULL? If this is confusing, the scientist can update the rows and substitute "UNKNOWN/ERROR" for every NULL. When you do "select distinct ...", you will get the same thing. It will return several rows for known/measurable values and but only one row for "UNKNOWN/ERROR".
This is where the foundation of a relational database semantics in set theory shows through. You can’t model the behaviour with pure boolean logic.
In the SQL spec by default unique indexes consider nulls distinct because you’re adding tuple to the relation, and this is done by equality.
When doing a select distinct or group by you’re not doing a boolean grouping, you’re doing a set projection. NULLs are considered part of the set of “unknown values”, so NULLs are grouped together but they’re still not equal to each other.
The behaviour is mathematically well defined, but it doesn’t match boolean logic.
I’ve been dealing with databases for well over 20 years now and I can only think of a couple of times when this behaviour wasn’t wanted, so I think it’s the right choice.
That's because "different" and "distinct" don't mean the same thing.
Two unknown values are assumed to be different, but they are not distinct from each other.
For example, take two boxes, in each box is a die, the value of the box is the value shown on the die inside. You don't know the value since you don't see the die, it may even change as you manipulate the box, so it is unknown, NULL in SQL. Because of that, you assume the two boxes have different values. They are, however, indistinguable, so, not distinct. All you need to know is that you hold two boxes with dices in it, which one you hold doesn't matter, and that's what "SELECT DISTINCT" tells you: that you have two boxes and it doesn't matter which is which, even though they have different values.
>That's because "different" and "distinct" don't mean the same thing.
The literal definition distinct is:
>recognizably different in nature from something else of a similar type.
If you want to get down to it nothing is "equal" or the same.
Is a temperature measurement 25C the same as another of 25C? No these measurements are an approximation of the actual values which are actually not equal to each other they are distinct they have just been lumped into the same 25C group due to the resolution of measurement yet equality works just fine on that value in sql.
I have used SQL for a long time null handling is weird and inconsistent and a waste of time. For all the language bugs due to the existence of null at least I can count on null=null and not write garbage like value=param or (param is null and value is null)
I wanted to briefly reinforce this point with the fact that SQL has multiple equality operators - there is both `=` and `IS NOT DISTINCT FROM`. The later operator will treat null values as equal in alignment with the `DISTINCT` and `DISTINCT ON` operators.
It is extremely easy using partial uniques and the proper operators to treat nulls as non-distinct values and the approach we have allows some very important functionality that aligns with standard statistical data treatment which was influential to how SQL logic was originally designed.
An unknown measurement isn't a measurement value its a statement of (lack of) knowledge about a measurement, that doesn't tell you what the measurement is. Knowledge about a measurement is as different from the measurement as the measurement itself is from the thing measured.
Whether two unknown measurements are the same is unknown.
Whether two measurements of 25C are the same is unknown, these are just values recorded in a database. 25 is a value, null is a value.
The values in the db are the same in both cases which is what I would like my db language to deal with and not make assumptions about what that value actually means.
I see no value in treating null special when in comes to equality in a sql db, in fact it is a hinderance that it does so in my experience.
The SQL null is a database-specific keyword and not something that's part of the domain of your measurements. If you want some kind of sentinel value for your measurements that doesn't have the "unknown" meaning of SQL null, then you should use your own sentinel value and not reuse null for that purpose
Sentinel values suck especially when the language already has a perfectly good one built in.
Is 0 for a temp measurement unknown sentinel or an actual measurement, how about 2,147,483,647 great probably not a measurement now its always included in greater than queries same with max negative with less than.
Null separates the value into its own distinct group and prevents it from being including in range queries due to it not being an actual numeric value while most languages still allow you to compare equality using standard operators.
Sum types would be great in sql but currently we get a union of the sql type and null, so null for sentinel values it is except for the having to using weird syntax to compare it.
NULL is a statement that a value is not available or unspecified reasons.
If you want a particular value where a query would return NULL, it's your job to replace the NULLs with the contextually-appropriate value, e.g., using COALESCE(), to provide it.
It's a convenience shortcut to allow more complicated data models to be rpersented in simpler table structures than a fully normalized NULL-free data model would require, and to provide information about missing data (which can be used with things like COALESCE, where appropriate) when a more complex data model is simplified into a resultset via a query with JOINS, etc.
I do not agree it is most certainly a value that is stored in the database.
>NULL is a statement that a value is not available or unspecified reasons.
Again I disagree, it is a value that denotes its value is of a different type than specified in the columns data type. The meaning of that value is for the user to decide. The system is literally storing a value that can be compared against using special equality syntax (IS NULL, IS DISTINCT FROM, etc).
The actual column definition is a sum type defining possible values in the column it is a constraint on the values:
columnA int null
columnB int not null
If someone asks you what the value of a column in a result is when null do you say "I don't know" or do you say "null"?
For all these statements about what null means philosophically and the history about why it is treated the way it is in SQL there is little compelling argument to what value having the the equals operator always returns false when comparing nulls and instead one must use a separate syntax to properly compare null values for equality other than its for historical reasons and it changing it would be difficult.
This adds no value over typical programming languages where the normal equality operator can be used, its is weird and the source of confusion and even more bugs than your typical null handling creates.
I would prefer a database with fully fleshed out sum types rather than marking a column nullable, then a column could be marked as say a number + string + special sentinel type or whatever combination of types makes sense for my application, and if it that db did exist I am sure its equality operator would properly compare type and value to give a consistent binary result rather than the nonsense that is SQL null equality.
Your statements about using coalesce don't seem compelling to me, maybe I am a misunderstanding, to efficiently search for a value with a index you must use that value, I should be searching for with column IS NULL not Coalesce(column,[special value]) = [special value] which would be extremely inefficient.
Many languages have null coalescing operators and still use the standard equality operators for null. Coalesce to a special numerical value for a numerical column to represent a sentinel value is again a waste of time that again leads to strange greater than less than issues. Given a type system that allows nulls I would rather use IS NULL than coalesce that would be a further step backward but even better would be = NULL and get rid of the IS operator and its variants.
> recognizably different in nature from something else of a similar type.
But anyways, the point wasn't to justify the choices of SQL but rather as a way to make intuitive sense of its logic. SQL is one of the oldest and most successful programming languages in existence, we are not going to change it, and it is not going to disappear anytime soon, so we have to go with it, like it or not. There have been some attempts at alternatives, both at changing the paradigm (NoSQL) and at cleaning up the language, which, to be fair, would be a good thing, but without much success. The relational paradigm just works, and SQL is usable enough to make the cost of switching not worth it.
Edit:
And writing things like "value=param or (param is null and value is null)" is usually the sign of a poor understanding of the NULL logic. You are using it wrong basically. Sometimes, it is the right thing to do, but if that pattern starts appearing all over the place, it is usually a result of thinking "NULL is broken, I have to use this pattern to handle NULL properly". That's cargo culting, don't fix problems you don't understand by copy-pasting code you don't understand.
Note: this is not addressed to "you" in particular, there can be good reasons, no offense intended. But I think that in general, it is a code smell.
If it is not recognizably different than it is the same in that context correct?
Two measurements of 25C are not recognizably different therefore they are equal, correct, regardless if the actual temperatures are not the same?
Two measurements of unknown are not recognizably different therefore they are equal in the context of the database.
Having null!=null has never been intuitive to me especially since every other programming language treats them equal. I am not hoping this gets changed, I know SQL is to far along for that, I can still complain about it and agree its wierd.
>And writing things like "value=param or (param is null and value is null)" is usually the sign of a poor understanding of the NULL logic.
It's needed with parametrized sql when your db doesn't support "is not distinct from" which is itself a silly way to just write '=' or '==' like a normal programming language. The distinct predict exist for this very reason to have yet another way to express equality that includes nulls: https://modern-sql.com/caniuse/T151
Yes, ergonomics dictates some weird behavior for nulls.
Luckily, Postgres nowadays lets you declare the behavior of each null in unique constraints, like it should be. We can expect this to creep down to the other DBMSs with time.
Making nulls distinct on a "select distinct" or a "group by" would be completely useless.
If you're compromising on your high-minded and unorthodox purist concept for ergonomics, you may as well finish the job of ergonomics and just use normal nulls where X=X is true.
It's possible that this is due to the underlying implementation.
In a unique column normally you'll have an index, so NULL becomes a special value in an index, but in SELECT DISTINCT you probably won't have an index, which means a full scan is performed, then every row has to be compared with every other row.
If you're including possibly NULL columns in a distinct or group by and you want to treat them in a particular way, use the COALESCE() or NVL() or whatever similar function to give a real value to the NULL for that purpose.
Yeah nothing wrong with “select distinct” itself if it’s used correctly for its intended reasons.
But when I see select distinct at the start of a big and complex query, I do immediately suspect that the developer might have missed some join condition down the line and “got too many rows” back from the query. And since the rows look like duplicates due to the missing join predicate, for a junior (or careless) developer, the quick “solution” is to just apply distinct in the top level query to get rid of these pesky duplicates, declare success and move on.
> The fact that NULLs are distinct for UNIQUE columns but are indistinct for SELECT DISTINCT and UNION continues to be puzzling. It seems that NULLs should be either distinct everywhere or nowhere. And the SQL standards documents suggest that NULLs should be distinct everywhere. Yet as of this writing, no SQL engine tested treats NULLs as distinct in a SELECT DISTINCT statement or in a UNION.