Handling DNS Propagation Delays in Terraform

Introduction
Azure Private Endpoints have a sneaky timing problem in Terraform. The endpoint might exist, the DNS zone appears updated, and Terraform thinks everything is done and moves on as soon as the resource exists - but your system may still resolve the old address or not resolve it at all! That tiny DNS propagation delay can silently break automation.
For example, when I deployed an Azure Key Vault with public access disabled, Terraform immediately tried to create a secret (azurerm_key_vault_secret
) inside it, still resolving to the public address, so the request landed on the wrong endpoint and failed with a 403 Forbidden.
Sure, re-running the apply command or pipeline worked… but that was just duct tape. I wanted something smarter. My search led me deep into the wilderness of AzureRM provider internals and local-exec provisioners - which, surprise surprise, turned out to be just a fancier kind of duct tape that suited my needs.
Alternatives
Before I went all in on provisioners, I tested a couple of “cleaner” options.
time_sleep
There is a time provider that lets you wait between steps.
resource "time_sleep" "wait_for_dns" {
create_duration = "120s"
}
resource "azurerm_key_vault_secret" "this" {
name = "example-secret"
value = "example-value"
key_vault_id = azurerm_key_vault.this.id
depends_on = [time_sleep.wait_for_dns]
}
This works if you always know how long DNS will take. The problem is… you don’t. Sometimes it is 20 seconds, sometimes 3 minutes. Hardcoding timeouts felt flaky and slow, because you are always waiting for the maximum time.
AzAPI with retry
You can also use the AzAPI provider, which has a configurable retry mechanism. You can configure it to retry on error messages related to public access being disabled. Unfortunately, this requires more effort to figure out the correct resource type and provider resource to test if the data plane can be accessed via private access and to retry on errors.
During my testing it took a lot of effort even for a single resource, so it is not a generic approach. Still, it is probably the cleanest one, because you are using the native Terraform provider with the Azure API and I advise you to try this before going for other solutions.
AzApi and Key Vault Secrets
There is a caveat with Key Vault, which Microsoft is aware of and considers fine: you can actually create or update a secret through the management plane, bypassing firewall rules and data-plane RBAC. So anyone with roles like Owner or Contributor can update a secret from the public internet, even on a Key Vault with public access disabled. Imagine someone changing the connection string for your application, redirecting user data to a remote database.
So, in order to create a secret in a public-disabled Key Vault, you only need this and this way you don’t need to wait for private endpoint DNS to propagate… 🥺
resource "azapi_resource" "create_secret" {
type = "Microsoft.KeyVault/vaults/secrets@2023-02-01"
name = "example-secret"
parent_id = azurerm_key_vault.this.id
body = {
properties = {
value = "example-value"
}
}
}
Pipeline retry
Another trick is to catch error and retry the failing step in your CI/CD pipeline (Azure DevOps, GitHub Actions, etc). That is easy to do, but spreads the logic outside Terraform. I did not like the idea of having half the workaround in Terraform and half in the pipeline.
Provider
You could also solve this with a custom Terraform provider which will be the cleanest and the most generic, cross-platform approach. A provider could natively handle DNS resolution in Golang, retry until the record points to the expected IP, and expose it as just another Terraform dependency. The trade-off is effort vs. payoff.
For my use case, a self-contained module was the pragmatic sweet spot. But if you’re building something reusable for your org or the community, a provider might be worth the investment. This provider might be a good starting point: bendrucker/dns-validation
Design goals
Provisioners are often considered a “last resort” - they are imperative, fragile, and not really “Terraform.” And all of that is true. Everything depends on how good your script is and how your system handles it, unlike with providers.
For example, this solution will fail on MacOS as there I didn’t implement support for it and getent
which is Linux command will probably fail.
Before writing scripts, I set some constraints to avoid common problems with local-exec provisioners:
- Must run on both Linux and Windows build agents/laptops.
- Reusable - should be a Terraform module, not copy-paste snippets.
- No extra dependencies - should work with commands already available in CI/CD environments regardless of distro and PowerShell version…
- Works everywhere - Azure, AWS, on-premises, etc.
Solution
The idea is simple: wrap everything in a null_resource that runs a script, confirm that DNS resolves to the Private Endpoint IP, and only then let Terraform continue.
resource "null_resource" "this" {
triggers = var.triggers
provisioner "local-exec" {
interpreter = local.is_linux ? [] : ["PowerShell", "-Command", ""]
command = local.is_linux ? replace(local.linux_command, "\r\n", "\n") : local.windows_command
}
}
- triggers are evaluated each apply to rerun the
null_resource
when inputs change - The interpreter switches between Linux (plain shell), which is default behaviour so the array is empty and Windows (PowerShell).
- By isolating this in one place instead of using local-exec directly in azurerm resources, I could wrap it into a Terraform module that anyone can use.
Detecting the OS
The first problem: Terraform does not expose the operating system directly. But there is a neat trick:
On Linux, paths look like /home/runner/… On Windows, paths start with a drive letter like C:/…
So you can check if the root path starts with a drive letter. If it does not, you are on Linux:
locals {
is_linux = length(regexall("^[a-zA-Z]:", abspath(path.root))) == 0
}
This boolean decides which script to run.
Linux logic
On Linux, I used getent
because it’s pretty universal across multiple distros. The script basically says:
- Resolve the hostname.
- If it matches the expected IP, confirm it multiple times in a row (important if you’re hitting multiple DNS servers).
- If it does not match the expected IP, sleep a bit.
- Exit once it’s stable or timeout is hit.
Here’s the script:
linux_command = <<-EOT
timeout_seconds=${var.timeout_seconds}
sleep_seconds=${var.sleep_seconds}
while [ $timeout_seconds -gt 0 ]; do
ip=$(getent hosts "${local.dns_name}" | awk '{ print $1 }')
timeout_seconds=$((timeout_seconds - sleep_seconds))
if [ "$ip" = "${var.ip_address}" ]; then
for attempt in $(seq 1 ${var.successful_attempts}); do
ip=$(getent hosts "${local.dns_name}" | awk '{ print $1 }')
if [ "$ip" != "${var.ip_address}" ]; then
break
fi
if [ $attempt -eq ${var.successful_attempts} ]; then
sleep $sleep_seconds
exit 0
fi
sleep 1
done
fi
sleep $sleep_seconds
done
exit 1
EOT
It’s not bulletproof (with multiple DNS servers you can still hit a mix of responses), but it was good enough without installing extra tools to query each DNS server individually.
Windows logic
On Windows, the main headache was DNS caching. By default, PowerShell will just happily resolve from cache, so you keep hitting stale results.
The trick was to use Clear-DnsClientCache
before every lookup. Unlike ipconfig /flushdns
, this doesn’t require elevated rights.
windows_command = <<-EOT
$timeoutSeconds = ${var.timeout_seconds}
$sleepSeconds = ${var.sleep_seconds}
Do {
Clear-DnsClientCache
$ip = ((Resolve-DNSName -Name "${local.dns_name}" -DnsOnly) | Where-Object {$_.Type -eq "A"}).IPAddress
[int]$timeoutSeconds = [int]$timeoutSeconds - [int]$sleepSeconds
if ($ip -eq "${var.ip_address}") {
for ($attempt = 1; $attempt -le ${var.successful_attempts}; $attempt++) {
Clear-DnsClientCache
$ip = ((Resolve-DNSName -Name "${local.dns_name}" -DnsOnly) | Where-Object {$_.Type -eq "A"}).IPAddress
if ($ip -ne "${var.ip_address}") {
break
}
if ($attempt -eq ${var.successful_attempts}) {
Start-Sleep -Seconds $sleepSeconds
exit 0
}
Start-Sleep -Seconds 1
}
}
Start-Sleep -Seconds $sleepSeconds
} Until (0 -gt $timeoutSeconds)
exit 1
EOT
Variables
To keep it flexible, I exposed the basics as variables:
- dns_name → the hostname to resolve.
- ip_address → the expected Private Endpoint IP.
- timeout_seconds → how long to wait.
- sleep_seconds → how long to wait between retries.
- successful_attempts → how many consecutive good results before success.
That way I can tweak it per resource.
Dealing with outputs from resources
Unfortunately when using outputs from resources, sometimes you will get a hostname, sometimes url with trailing slash and sometimes without.
To simplify usage of the module I have added additional regex to the dns name. It trims all unnecessary elements like protocols, slashes and ports, to leave only the hostname, which is required during DNS resolution, so I can pass hostname/uri without removing anything on my own.
locals {
dns_name = regex("^(?:https?://)?(?:[^@\\n]+@)?([^:/\\n]+)", trimspace(var.dns_name))[0]
}
After wrapping everything into a module, the usage looks like this
provider "azurerm" {
features {}
subscription_id = var.subscription_id
}
data "azurerm_client_config" "current" {}
resource "random_string" "this" {
length = 8
special = false
upper = false
}
resource "azurerm_resource_group" "this" {
name = "rg-demo-${random_string.this.result}"
location = "polandcentral"
}
resource "azurerm_key_vault" "this" {
name = "kv-demo-${random_string.this.result}"
location = "polandcentral"
resource_group_name = azurerm_resource_group.this.name
tenant_id = data.azurerm_client_config.current.tenant_id
sku_name = "standard"
purge_protection_enabled = false
soft_delete_retention_days = 7
public_network_access_enabled = false
rbac_authorization_enabled = true
network_acls {
default_action = "Deny"
bypass = "None"
}
}
resource "azurerm_private_endpoint" "this" {
name = "pe-demo-${random_string.this.result}"
location = "polandcentral"
resource_group_name = "dns-demo-prd"
subnet_id = var.subnet_id
private_service_connection {
name = "psc-demo-${random_string.this.result}"
is_manual_connection = false
private_connection_resource_id = azurerm_key_vault.this.id
subresource_names = ["vault"]
}
private_dns_zone_group {
name = "demo-dns-zone-group-${random_string.this.result}"
private_dns_zone_ids = [var.private_dns_zone_id]
}
}
module "wait_for_dns" {
source = "../wait_for_dns"
triggers = {
ip = azurerm_private_endpoint.this.private_service_connection[0].private_ip_address
}
dns_name = azurerm_key_vault.this.vault_uri
ip_address = azurerm_private_endpoint.this.private_service_connection[0].private_ip_address
}
resource "azurerm_key_vault_secret" "this" {
name = "example-secret"
value = "example-value"
key_vault_id = azurerm_key_vault.this.id
depends_on = [module.wait_for_dns]
}
As you see I am passing dns_name
which is the hostname to be resolved, ip_address
which is target ip, and triggers
which tells terraform when to trigger null_resource with provisioner again - in this case, when the IP address of the private endpoint changes.
After that, simply add depends_on
with module reference to all depended resources
DNS Is Not Enough
When creating a resource with a private endpoint using the AzureRM provider in Terraform, simply waiting for DNS to resolve to the private IP is not sufficient.
The AzureRM provider uses Go’s default HTTP client, which caches connections in a pool keyed by the target address (domain name or IP). During resource creation, the client may initially resolve the resource FQDN to its public IP because the private endpoint is not yet ready. Later, when the provider attempts to create a secret or perform other operations, it may reuse an existing idle connection from the pool - still pointing to the public IP. This can result in 403 errors, even though DNS now correctly resolves to the private endpoint.
PS C:\Users\piotr> Get-NetTCPConnection
LocalAddress LocalPort RemoteAddress RemotePort State AppliedSetting OwningProcess
------------ --------- ------------- ---------- ----- -------------- -------------
192.168.50.11 52270 20.215.26.76 443 Established Internet 35728
192.168.50.11 52269 20.215.26.76 443 Established Internet 35728
# 20.215.26.76 is Key Vault public address. This connection was open during Key Vault provisioning.
The core issue here is connection reuse, not DNS resolution. Idle connections in Go’s HTTP client persist for 90 seconds by default (so if your DNS propagation takes less than 90 seconds, you will likely hit this issue) and do not automatically re-resolve the hostname. As a result, the provider continues using the old connection pointing to the public IP.
Depending on which resources and providers you are using, you may need to account for this by adjusting the module’s wait time to allow Go to dispose of the old connections before proceeding, or detect if the connection still exists but there is no proper way to kill this connection.
Unfortunately, fixing this at the provider level would require a major change to the AzureRM provider. For now, updates are being implemented on a per-resource basis, as shown in this pull request: PR #30352
Summary
Provisioners are duct tape, but sometimes duct tape is what keeps your pipeline moving. This cross-platform DNS-wait module saved me from flaky runs - and might save you too. Just don’t forget to look for cleaner options before reaching for it.
Let me know if you encounter similar issues with DNS propagation delays. I’m considering making a public provider for this workflow.
You can find example code here: terraform-example