Snowpark-Optimized Warehouses: Production-Ready ML Training and Other Memory-Intensive Operations

With Snowpark, our customers have begun to leverage Snowflake for more complex data engineering and data science workloads using languages such as Java and Python. This new wave of developers using Snowflake often requires more flexibility in the underlying compute infrastructure to unlock memory-intensive operations on large data sets such as ML training.

To support these workloads in production, we’re excited to launch Snowpark-optimized warehouses in general availability in all Snowflake regions across AWS, Azure, and GCP. 

Demo: Running 200 forecasts in 10 minutes using XGBoost and Snowpark-optimized warehouses

Snowpark-optimized warehouses have compute nodes with 16x the memory and 10x the local cache compared with standard warehouses. The larger memory helps unlock memory-intensive use cases on large data sets such as ML training, ML inference, data exports from object storage, and other memory-intensive analytics that could not previously be accommodated in standard warehouses. 

As a result, data teams can now run end-to-end ML pipelines in Snowflake in a fully managed manner without having to use additional systems or move data across governance boundaries.

Snowpark-optimized warehouses also inherit all the benefits of Snowflake virtual warehouses:

  • Fully managed: Snowflake oversees the maintenance, security patching, tuning, and delivery of the latest performance enhancements transparently
  • Elastic: Elastic scaling of compute supports virtually any number of users, jobs, or data with multi-tenant security and resource isolation
  • Reliable: Industry-leading SLA is consistently upheld
  • Secure: Governance controls are applied across all workload without trade-offs

Since the new warehouse option was announced in public preview in November 2022, we’ve rolled out performance improvements, increased region availability, and made behind-the-scenes stability improvements.

The 10x larger local cache on each Snowpark-optimized warehouse node helps accelerate subsequent run execution through speedups when cached artifacts (Python packages, JARs, intermediate results, etc.) are reused across runs. With these performance improvements, Snowpark developers continue to get more out of each compute credit and more efficiently process large data sets. We have also invested in improving the performance of the most popular Python libraries by adding Joblib multiprocessing support in Snowpark for Python-stored procedures.

In addition to unlocking single-node ML training use cases, Snowpark-optimized warehouses also include optimizations for multi-node use cases. When UDFs are run on a warehouse with multiple nodes (size L or larger), Snowflake will leverage the full power of the warehouse by parallelizing computations through redistribution of rows between nodes in the warehouse. Statistics on UDF execution progress are used to optimize the distribution of work among compute nodes to optimize parallelism.

Since moving to public preview, we have seen the adoption of a variety of memory-intensive use cases by customers such as Spring Oaks Capital and Innovid.

Customer success stories

Spring Oaks Capital is a national financial technology company that focuses on the acquisition of consumer credit portfolios. The data science team evaluates millions of records to provide predictions that give their team the insights needed to optimize their debt pricing and purchasing strategies. One of their machine learning models runs every morning to provide call centers with prioritized call lists based on expected conversion. 

To ensure the highest levels of productivity with the latest set of features, Spring Oaks needs to compute large amounts of feature data reliably every morning. Watch an overview of the architecture that has given Spring Oaks 8x performance over the prior solution. 

Innovid, which powers advertising delivery, personalization, and measurement for the world’s largest brands, has also been using Snowpark-optimized warehouses. Innovid collects approximately 6 billion data points from over 1 billion ads each day. Using Snowpark-optimized warehouses, the data science team is able to process these very large data sets and train ML models to provide sophisticated solutions in cross-platform ad serving, data-driven creative, and converged TV measurements for their global client base. Read more about Innovid’s experience using Snowpark for ML.

How to get started

You can get started with Snowpark-optimized warehouses by following usage instructions in our documentation and quickstart guide, which includes step-by-step setup instructions and product details. We’re continuously looking for ways to improve, so if you have any questions or feedback about the product, make sure to let us know in the Snowflake Forums community

The post <strong>Snowpark-Optimized Warehouses: Production-Ready ML Training and Other Memory-Intensive Operations</strong> appeared first on Snowflake.