SQL Pipe gives headaches but comes with benefits

14 March 2025

Databricks Runtime 16.2 introduced a new SQL pipeline syntax that reverse the traditional SQL query structure. This feature lets chain query operations using a pipe operator (|>), constructing queries step by step. While powerful, it initially feels awkward for those who have written SQL the same way for years.

In this article, let me explore why the new syntax can be a headache for experienced SQL users, why it is still worth using (especially for dynamic queries), and how to leverage it in an object-oriented way with Python.

Classic SQL

Pipe SQL

The “Headache” Factor

SQL users may be surprised when they encounter the pipeline syntax; instead of the familiar order, SELECT FROM WHERE a pipeline query starts with the FROM clause and chains the other parts after it. In other words, you write queries in the execution order (FROM, then WHERE, then SELECT, etc.), not in the traditional textual order.

For example, a standard SQL query:

SELECT 
* 
FROM 
users 
WHERE age > 30; 
It would be written in pipe form as: 
FROM users 
|> WHERE age > 30;

For anyone used to writing the SELECT clause first, this feels unnatural.

One Reddit user humorously dubbed it the “weird inverted Yoda form” of SQL. Just as Yoda in Star Wars speaks with reversed grammar, the pipeline syntax forces us to write the query backward compared to everyday SQL habits. It does make logical sense — databases internally process the FROM and WHERE before SELECT, so the pipeline mirrors that execution order.

However, it conflicts with how humans have been trained to write SQL for decades, hence the initial headache. Long-time SQL developers might need to unlearn old habits to get comfortable writing FROM first and SELECT last.

Why is it worthwhile despite the headache?

You gain flexibility in constructing queries on the fly by breaking a query into chained logical pieces.

-- 1. Define variables for parts of the query 
DECLARE base_table STRING DEFAULT 'users'; 
DECLARE age_filter STRING DEFAULT 'age  > 30'; 
 
-- 2. Build the pipeline query string using concatenation 
DECLARE pipeline_sql STRING; 
SET VARIABLE pipeline_sql =  
'FROM ' || base_table ||  
' |> WHERE ' || age_filter ||  
' |> SELECT name, age';  
 
-- You can inspect the constructed query 
SELECT pipeline_sql; -- FROM users |> WHERE age  > 30 |> SELECT name, age 
 
-- 3. Execute the dynamically constructed SQL query 

EXECUTE IMMEDIATE pipeline_sql;

In the code above, we first declare some SQL variables: base_table holds the table name (in this case, ‘’sales’’), and region_filter holds the filter condition.

We then construct a variable pipeline_sql by concatenating the parts into one string. The result of the concatenation would be a query string like:

FROM users  
|> WHERE age  > 30  
|> SELECT name, age

Object-oriented query building in Python

Instead of manually concatenating parts of a SQL string (and worrying about spaces, commas, and where to insert clauses), you can represent the query as an object in your code and produce the pipeline SQL from that object.

Consider using a simple Python class to build an SQL pipe query. Each part of the query can be an attribute, and the class can have a method to create and execute the pipeline string.

Here is an example:

%python 
class QueryBuilder: 
def __init__(self, table): 
        self.table = table      # Table name for FROM clause 
        self.filter = None      # WHERE condition (optional) 
        self.limit = None       # LIMIT value (optional) 
     
def set_filter(self, condition): 
"""Specify a WHERE condition.""" 
        self.filter = condition 
     
def set_limit(self, n): 
"""Specify a LIMIT for the query.""" 
        self.limit = n 
     
def build_pipeline_sql(self): 
"""Assemble the SQL pipeline string based on the set parts.""" 
query = f"FROM {self.table}" 
if self.filter: 
query += f" |> WHERE {self.filter}" 
if self.limit: 
query += f" |> LIMIT {self.limit}" 
return query 
     
def execute(self): 
"""Execute the built query using a Spark session (Databricks).""" 
        sql_query = self.build_pipeline_sql() 
print(f"Executing: {sql_query}") 
return spark.sql(sql_query) 
 
# Example usage: 
qb = QueryBuilder("users") 
qb.set_filter("age > 30") 
qb.set_limit(10) 
result_df = qb.execute() 
result_df.show() 
''' 
Executing: FROM users |> WHERE age > 30 |> LIMIT 10 
+-----+---+ 
| name|age| 
+-----+---+ 
|Alice| 31| 
|Cindy| 35| 
+-----+---+ 
'''

Conclusion

Databricks’s new SQL pipeline syntax initially feels awkward, particularly for experienced SQL users who used to write queries in the classic style.

The FROM first order can be a headache. It’s like learning to speak SQL in reverse. However, once the initial discomfort decreases, the benefits become clear.

Pipeline syntax enables more modular, dynamic query construction, which is incredibly useful where queries may need to be generated dynamically.

Good news ! You can download the code I have used in this article here.

Read previous

Feature Store at Scale with Azure Databricks Unity Catalog

2025-02-27 13:49:06

Addressing customer pain points Many organizations face the challenge of hyper-personalization at scale in their data and AI initiatives. Customers often ask: How can we optimize preparing, calculating, and storing data for hyper-personalization at scale? How can we optimize data preparation, computation, and storage for large-scale hyper-personalization? Let’s explore a real-world scenario where we […]

Use All Spark Workers' CPUs to read from a REST API

2025-04-02 12:55:12

Spark driver and workers: Recently, we explored how to create our data sources in Spark by using syntax like: spark.read.format("myrestdatasource") to read from a REST API. Now, we’ll examine how to use all worker CPUs to read simultaneously from a REST API. Key idea Spark will read in parallel if we define multiple […]