Feature Store at Scale with Azure Databricks Unity Catalog

27 February 2025

Addressing customer pain points

Many organizations face the challenge of hyper-personalization at scale in their data and AI initiatives.

Customers often ask:

How can we optimize preparing, calculating, and storing data for hyper-personalization at scale?
How can we optimize data preparation, computation, and storage for large-scale hyper-personalization?

Let’s explore a real-world scenario where we tackled this problem using Azure Databricks and Unity Catalog.

Key challenges of scale:

• 25,000+ features used for ML solutions.

• Years of feature history requiring storage and efficient querying.

• Multiple dimensions of feature calculation, such as

• Different time windows.

• Complex aggregation functions.

• Handling NULL values and edge cases.

Basics of feature engineering

At its core, feature engineering transforms raw data into meaningful inputs for machine learning models. While the basics are straightforward, scaling these operations to thousands of features and large datasets is the real challenge.

We identified the following critical needs to address this challenge:

• Scalability to handle massive datasets.

• Fast processing for real-time or near-real-time scenarios.

• Support for multiple programming languages.

• A robust data governance model.

• Integration with MLOps frameworks

Unity Catalog integration: a real game-changer

Unity Catalog became the cornerstone of our solution. It provided:

• Data and ML model governance with centralized access control.

• Auditing, lineage, and data discovery across Databricks workspaces.

• Traceability by creating robust links between data and ML assets.

This foundation allowed us to implement additional components, such as:

• Configuration tables for data preprocessing.

• Delta tables for intermediate datasets and ML results.

• Monitoring metadata for models and features.

• Naming conventions for experiments and schema

How does Elitmind maximize Feature Store benefits?

Our tailored implementation of a feature store delivers transformative benefits:

• Feature reusability: Consistent definitions for training and inference, reducing redundancy.

• Model deployment at scale: Efficiently operationalize large-scale ML models.

• Cost optimization: Balance performance and cost through strategic resource allocation.

• Discoverability: Automated lineage tracking helps teams understand feature origins and transformations effortlessly.

• Faster time-to-value: Accelerate your AI initiatives with ready-to-use feature

Performance optimization

Scaling feature engineering in the cloud comes with its challenges. We implemented several best practices to ensure optimal performance:

• Cluster optimization: Select appropriate clusters to avoid disk spill.

• Partition management: Optimize the number of shuffle partitions.

• Efficient DataFrame operations: Use withColumns() when adding multiple columns instead of chaining multiple withColumn() calls.

For example, we optimized the Azure Data Factory pipeline, triggering feature store calculations in one of our projects. By leveraging parallel feature calculations in Databricks clusters, we reduced the processing time for a large dataset from over 12 hours to just 1 hour and 27 minutes.

Addressing scale limitations

Even with Azure Databricks’ robust capabilities, certain limitations require innovative approaches. Elitmind’s expertise ensures:

• Feature table size management: Splitting features across multiple tables to handle the 10,000-column limit.

• Networking constraints resolution: Mitigating IP address limitations using serverless clusters. We implemented VNet configurations tailored to large-scale feature engineering workloads in one case.

• Azure quota management: Monitoring and adjusting quotas while minimizing cost impacts. We implemented a cost-aware cluster scaling strategy for a client with stringent cost requirements.

Advanced feature engineering

We implemented a metadata-driven approach to feature engineering, which makes this solution even more flexible. Through a user interface (CSV, JSON files, Databricks App, or PowerApp), we can control the way features are created. It allows us to define queries, filters, and aggregation functions.

Our expertise extends beyond basic feature definitions to advanced calculations such as rolling statistics. By generating rolling aggregates (e.g., sum, min, max, and custom trends) across multiple periods (e.g., 3, 6, 12 months), we enabled richer data insights for predictive modeling. This was achieved using automated configurations and workflows that minimized manual effort.

Why choose Elitmind?

Elitmind combines deep technical expertise with practical experience to deliver scalable, cost-effective solutions for metadata-driven feature engineering.

Our approach ensures:

• End-to-end support, from design to deployment.

• Alignment with best practices for Azure Databricks and Unity Catalog.

• Accelerated AI project timelines with measurable business outcomes.

Let’s discuss your needs

Ready to solve your feature engineering challenges? Contact our AI/ML Lead and Databricks Solutions Architect Champion, Przemysław Wiesiołek, to explore how Elitmind’s solutions can drive your hyper-personalization efforts.

Book a time in his calendar today and start transforming your data into actionable insights!

Is the sky the limit in the cloud? With Elitmind’s expertise, the possibilities are endless.

Read previous

Foundations of digital success: Expert insights from Elitmind's leaders

2025-02-20 12:46:55

As a leading partner of tech giants like Microsoft, Databricks, and Snowflake, Elitmind has a unique perspective on the evolving IT landscape. In an exclusive interview with Piotr Waszczuk from IT WIZ, Elitmind’s leadership – Robert Woźniak, CEO and Co-founder, and Maciej Chojnacki, Head of Delivery & Technology – share their insights on […]

SQL Pipe gives headaches but comes with benefits

2025-03-14 10:10:34

Databricks Runtime 16.2 introduced a new SQL pipeline syntax that reverse the traditional SQL query structure. This feature lets chain query operations using a pipe operator (|>), constructing queries step by step. While powerful, it initially feels awkward for those who have written SQL the same way for years. In this article, let […]