While working with many customers who have implemented Azure Synapse Analytics and data management landing zone, we have seen customers face a few challenges while enabling seamless hybrid connectivity across Azure Synapse services and data integration. This blog talks about a few nuances of Managed Virtual network and how the data flows end to end between Azure Synapse and on-premises.
Managed Virtual Network is a new capability that eases network configuration while creating services such as Azure Data Factory, Azure Synapse Analytics and Azure Purview. It builds on Private Links and Private Endpoints. Private Endpoints enables many Azure Services listed here to have a network interface in a customer Virtual Network and each Azure resource gets a private IP address from that Virtual Network address space. This provides secure access to specific Azure Resources over ExpressRoute/S2S VPN and within Azure.
With Managed VNet for Azure Synapse, there are a few additional aspects:
- Azure Synapse Managed private endpoints.
This manages the private endpoints from Synapse workspace to various other Azure Resources. For e.g to ADLS Gen2, Cosmos DB, Azure SQL Databases database. Traffic between Azure Synapse workspace and the Azure Resources traverses the Microsoft backbone network over private link.
- Managed Virtual Networks are not visible to users. They are managed by Azure Synapse workspace.
- Azure Synapse Private Link Hubs
Synapse provides a studio which is a web UI that can be accessed to create various artifacts such as notebooks, pipelines and sql scripts. Private Endpoints connections are added to Private Links hubs so that the Synapse Studio can be accessed over a private IP address.
- Private DNS
Azure Private DNS can manage the DNS name resolution for Azure Resources which have private endpoint connections. This is a key aspect for enabling hybrid connections from on-prem. More on this later.
Please review the below articles for more information.
- Azure Private Endpoint DNS Configuration for on-premises workloads.
- Data exfiltration protection for Azure Synapse Analytics workspaces – if configuring for Synapse.
- Review guidance on network security for Azure Synapse Analytics.
Managed Virtual Network in action
An example Managed Virtual Network is depicted below to illustrate a few key components and flows.
There are three main flows:
- Connectivity from On-prem to Azure Synapse Workspace.
Data Analysts and Data Engineering teams who are creating pipelines and notebooks connect to Synapse Studio, Serverless and Dedicated pools and securely access data stored on ADLS Gen2 and other Azure Data sources.
- Connectivity from Azure Synapse workspace to Azure Data lake Gen2.
Azure Synapse SQL Engines and Spark pools connect to ADLS Gen2 for data exploration and data processing.
- Connectivity from Azure Synapse Workspace to On-prem data sources
Azure Synapse Pipelines and Mapping Dataflows need to connect to On-prem data sources such as SAP, Files Servers, Oracle and SQL Server DBs etc. for ETL.
The above scenarios can get complicated for a very large team who might be in different geolocations, and accessing many environments (Dev,QA,Prod etc) and connecting to quite a few data lakes (may be 10s or even 100s). For the above scenarios to work seamlessly, there are a few infrastructure components that are required like on-prem DNS, a DNS server in Azure and Private DNS.
Before the above flows can be achieved, the following infrastructure components must be in place.
- ExpressRoute or Site to Site VPN connection established between on-premises and Azure Virtual network gateway.
- Private Link and private endpoint resources created. Synapse Workspace owner needs to initiate a private link to target resources for e.g Storage Accounts. The owner of the storage account needs to approve the connection. If approved the private endpoint is created.
- DNS Servers configured, and A record entries created. Three DNS servers are required as shown in the diagram above. A records can be created if there a only few Azure resources.
- On-prem DNS
This uses conditional forwarding for the Azure resource domains to custom DNS in Azure.
DNS conditional forwarding is required if 10s or 100s of Azure resource names need to be resolved. Instead of configuring an A record for each resource, conditional forwarding can simplify this. 10.15.0.4 is the private IP address of custom DNS in Azure VNet.
- Custom DNS or Proxy DNS
This is a custom DNS server(hosted in a VM) provisioned in Azure VNet that forwards DNS lookup queries to Azure DNS(184.108.40.206). This is needed because Azure DNS can resolve DNS lookup queries originating from the VNets that are linked to the private zones.
- Private DNS
A private DNS is created for each Azure resource type. This is needed for Azure DNS to resolve Azure domain names to private IP addresses.
Hybrid Connectivity end to end
Connectivity from On-prem to Azure Synapse Workspace
With the above infrastructure components and configuration in place, on-prem workstations and developer machines can resolve Azure Synapse Studio, SQL domain names and data lake endpoints to private IP addresses.
As you can see above when a on-prem machine connects to Synapse studio and user navigates to the data lake, both Studio and ADLS Gen2 domain names are resolved to private IP addresses.
Connectivity from Azure Synapse workspace to Azure Data lake Gen2
With Managed Private endpoints in Synapse Workspace, traffic between Synapse compute engines(SQL Serverless, Dedicated and Spark pools) and data lake traverses the Microsoft back bone to connect to private endpoints. In the example below, a notebook running on spark pool can connect to a data lake with a private endpoint.
Connectivity from Azure Synapse Workspace to On-prem data sources
With Managed Virtual Network, the pipeline integration runtime is provisioned in the Managed Virtual network and is managed(patching, NSGs, firewall) by Azure platform. In the connection tests below, the Synapse Pipeline can connect to on-prem file server using a self-hosted integration runtime(SHIR) that is hosted on—premises and to data lake private endpoint using Auto Resolve Integration Runtime which is hosted in the Managed Virtual Network.
Connection to an On-prem file server with self hosted integration runtime
Connection to Data Lake with Auto resolve integration runtime
Thus with Managed Virtual Networks and Private Endpoints, hybrid data management and data security is further simplified and enables a seamless data estate spanning on-premises and Azure.