You sound like you mostly write ORM-like OLTP data retrieval. As somebody who's ...

petalmind · on April 14, 2023

I wonder what the problem is, how exactly is it falling apart: a) you need to write a lot of "clean up" sub-queries and fragments or b) the question you ask is more complicated and does not fit into the algebra of SQL.

If it's more the first: if your dataset was perfectly cleaned up (say, 3NF, with all data values cleaned up) — how longer would SQL go for you without falling apart?

As I write this comment I realize that it could be worded as "is the problem with the logic or the data"? Thanks,

orthoxerox · on April 14, 2023

I've had to help some of our data scientists debug their queries after we migrated to Hive 3, and it's mostly the way large queries are structured. You often want to add a single column to the table that is referenced by additional columns (like, convert date of birth to age, then create age buckets and an is_minor flag, calculate the age when they opened their first account).

CTEs are the best way to do this but get clunky and repetitive as you have to give each one a name. CROSS LATERAL JOIN is completely illegible. Something like LET column = expression would have improved that:

    FROM customers
    LET age = EXTRACT(YEAR FROM current_date() - dob)
    LET age_bucket = CASE ... END
    LET date_1st_ac = (SELECT MIN ...)
    ...
    SELECT ...

The other problem lies in the fact that an SQL query is one huge query. You have to take it apart to sanity-check the intermediate result: okay, let me quickly turn this CTE into the main query and run it, no, let me try again with another CTE, okay, found it, now I have to convert everything back to run the whole query again.

In comparison, Spark dataframes are all made equal, I can name any one of them, examine their schema, run them to look at the intermediate result, all without introducing optimization barriers.

vkazanov · on April 14, 2023

I'd say there are problems on 3 levels:

1. All the little things, inconsistencies and annoyances. You know, NULL behaviour, identifier usage, things that you just have to memorise (everything is non-orthogonal).

2. Language-level. No real abstraction, no code reuse, weird sublanguages for trivial stuff, no libraries/modules/packages.

3. Standard-level. I mean, this is not standard anybody can use. It is unreadable! I mean I can go a very long way if necessary - used to reading the original C++ and C stds. But ANSI SQL... A bottomless pit. Databases do implement subsets of something ANSI-ish but these vary a lot.

All of this results in templated queries, 1000s of lines of queries that are only slightly different and no chance for a truly db-agnostic language. I mean... Have you seen write-once 3000 LOC queries? This is what I can "fall apart"!

I like the original Codd's relational alebra, even love it. It is both simple and practical. Most databases use RelAlg-like internal representations that optimisers work on for a reason!

But we're stuck with SQL and it will never go away.

timacles · on April 15, 2023

SQL is just too old and too powerful to be replaced by something. There might be implementations that improve on some subset of it's functionality. But nothing can possibly cover everything it does without itself becoming kludgy feeling