Introduction
Microsoft Fabric has emerged as a comprehensive platform not only for PowerBI and Data Analytics but also for building robust end-to-end data ingestion pipelines. The platform offers various capabilities including Data Pipelines and Spark Notebooks to construct sophisticated data workflows.
Most enterprise data pipelines incorporate Spark Notebook code for:
- Data ingestion using dynamic metadata-driven architecture
- Transformation processes for individual artifacts like Dimensions and Facts
- Complex data processing workflows
However, many data engineers encounter a common performance challenge: notebooks that execute quickly in interactive development sessions become significantly slower when deployed in Data Pipeline activities. This article shares practical insights and proven solutions to optimize notebook execution performance in Microsoft Fabric Data Pipelines.
The Performance Challenge
The Problem: Notebooks running in interactive sessions (development mode) execute faster than the same notebooks when integrated into Data Pipeline activities.
The Impact: This performance degradation can significantly affect:
- Overall pipeline execution time
- Resource utilization efficiency
- Operational costs
- Data delivery SLAs
Let’s explore four common scenarios with their root causes and practical solutions.
Test Environment Setup
For this analysis, I used two simple test notebooks:
- Notebook1 and Notebook2: Basic notebooks that print current timestamps
- Execution time in interactive mode: Milliseconds
- Purpose: Isolate performance issues from complex business logic
Scenario 1: Single Pipeline, Single Notebook Activity
Configuration
- Total Execution Time: ~41 seconds
- Spark Session Startup: ~20 seconds
- Notebook1 Activity: ~21 seconds
🔍 Issue Analysis
The notebook takes significantly longer to execute compared to interactive mode.
💡 Root Cause
When executing a Data Pipeline, Microsoft Fabric needs time to:
- Provision compute resources
- Initialize the Spark session
- Load necessary libraries and configurations
In interactive mode, the Spark session is already running and warm.
⚡ Solution
This is the baseline scenario with limited optimization opportunities. The Spark session startup time is unavoidable for the first execution.
Key Takeaway: Accept this as the baseline performance cost for pipeline execution.
Scenario 2: Single Pipeline, Multiple Notebook Activities
Configuration
Before Optimization:
- Total Execution Time: ~62 seconds
- Spark Session Startup: ~20 seconds (×2)
- Notebook1 Activity: ~21 seconds
- Notebook2 Activity: ~21 seconds
🔍 Issue Analysis
The pipeline creates a new Spark session for each notebook activity instead of reusing the existing session.
💡 Root Causes
- High concurrency setting disabled: The “pipeline running multiple notebooks” feature is not enabled
- Different Default Lakehouses: Notebooks have different default lakehouse configurations, forcing separate sessions
⚡ Solutions
Solution 1: Enable High Concurrency
Navigate to your Data Pipeline settings and enable the high concurrency option for running multiple notebooks.
Solution 2: Standardize Default Lakehouses
Ensure all notebooks in your pipeline use the same Default Lakehouse:
- Open each notebook
- Navigate to lakehouse settings
- Set the same default lakehouse for all notebooks
📊 Performance Improvement
After Optimization:
- Total Execution Time: ~42 seconds
- Performance Gain: ~20 seconds (32% improvement)
- Spark Session Reuse: ✅ Achieved
Scenario 3: Multiple Pipelines with Nested Notebook Activities
Configuration
Common Architecture:
- PL-MainPipeline: Orchestrates the entire workflow
- PL-Pipeline1: Executes Notebook1
- PL-Pipeline2: Executes Notebook2
Performance:
- Total Execution Time: ~62 seconds
- Issue: Spark session not reused across different pipeline contexts
🔍 Issue Analysis
Even with proper configuration from Scenario 2, execution time reverts to ~62 seconds because sub-pipelines operate in different execution contexts.
💡 Root Cause
Multiple sub-pipelines cannot directly share Spark sessions due to isolation boundaries between different pipeline execution contexts.
⚡ Solution: Session Tags
Implement session tags to enable Spark session sharing across pipeline boundaries:
- For each Notebook Activity:
- Navigate to:
Notebook Activity → Settings → Advanced Settings → Session Tag
- Set a consistent session tag (e.g., “SharedComputation”)
- Apply consistently: Use the same session tag across ALL notebook activities in ALL pipelines
🎯 Result
This configuration enables Data Pipelines to recognize and reuse existing Spark sessions, eliminating redundant session creation.
Scenario 4: Custom Spark Environments with Public Libraries
Configuration Challenge
Using custom Spark environments with public libraries that aren’t pre-installed.
🔍 Issue Analysis
Performance Impact Example:
- Standard session startup: 40–50 seconds
- With custom libraries: 4–5 minutes
- Root cause: Library installation during session initialization
💡 Strategic Approach
When to Use Custom Environments:
- ✅ Frequently used libraries across multiple notebooks
- ✅ Libraries required for most pipeline executions
- ✅ Consistent library versions needed
When to Use Inline Installation:
- ✅ Libraries used in only 1–2 notebooks
- ✅ Infrequently executed notebooks
- ✅ Experimental or temporary library needs
Best Practices Summary
🎯 Configuration Checklist
- Enable high concurrency for multiple notebook execution
- Standardize default lakehouses across all notebooks
- Implement consistent session tags for cross-pipeline sharing
- Optimize library installation strategy
📋 Monitoring & Validation
- Measure baseline performance before optimization
- Monitor Spark session reuse in pipeline logs
- Track execution time improvements
- Validate cost reduction metrics
🔧 Troubleshooting Tips
- Session not reusing: Check lakehouse consistency
- Cross-pipeline issues: Verify session tag implementation
- Library conflicts: Review custom environment configuration
- Performance regression: Audit recent configuration changes
Conclusion
Optimizing notebook execution in Microsoft Fabric Data Pipelines requires a systematic approach to session management and resource utilization. The strategies outlined in this article can deliver significant performance improvements:
- Immediate impact: 20–30% improvement with basic optimizations
- Advanced optimization: 65–70% improvement with comprehensive implementation
- Cost benefits: Reduced compute resource consumption and faster data delivery
The key is understanding how Spark sessions are created, managed, and shared across different pipeline contexts. By implementing these optimization strategies, data engineering teams can build more efficient, cost-effective data pipelines while maintaining reliability and scalability.
About This Analysis
All execution times are based on sample notebooks printing timestamps. Results were validated across multiple configurations and production environments. Individual results may vary based on specific use cases, data volumes, and infrastructure configurations.
Connect with me: arif.ansari@aewee.com, arif@zekibi.com
Consulting Services: Data Analytics & Engineering Solutions
LinkedIn: https://www.linkedin.com/in/arifhusen-ansari-8636b762
Company Page: https://www.linkedin.com/company/aewee
1 thought on “Performance Optimization for Notebook Execution in Microsoft Fabric Data Pipelines”
Pingback: Metadata-Driven Data Pipelines: Cut Ingestion Time by 90%