High availability (HA) and disaster recovery (DR)
HCP Vault Dedicated's highly-available, single-tenant data plane architecture enable HCP-managed Vault Enterprise clusters to remain operational independent of the HCP Control Plane. This architecture is designed to maximize availability of clusters managed by HCP.
The sections below provide additional information around the high availability and disaster recover posture for HCP Vault Dedicated:
- 3-Node Highly Available Clusters
- Data Resiliency
- Encryption Key Ownership
- Platform Outages
- Support Coverage
- Incident Response
- Disaster Recovery Use Cases Not in Scope
3-node HA clusters
All production-tier HCP Vault Dedicated clusters (i.e. Starter, Standard, and Plus) consist of 3 highly available Vault nodes leveraging Vault Enterprise's performance standby capability. HCP has robust monitoring in place to regularly check the health of the cluster and recovery mechanisms to restore cluster availability in the event of a disruption.
Data resiliency
All HCP Vault Dedicated nodes have attached encrypted volumes. Automated snapshots are taken daily for production-grade clusters and stored in an encrypted blob storage in the control plane. Users can initiate more frequent snapshots with push-button deployment from the UI. Snapshots currently reside within the US only.
Encryption key ownership
A unique Key Management Service (KMS) cryptographic key is used for automatic unsealing of HCP Vault Dedicated clusters and encrypting all cluster snapshots. This key is managed in the organization's dedicated HCP tenant using the cloud provider’s KMS and is configured to be trusted by the HCP Vault Dedicated compute instances. This key is managed using carefully crafted, secure policies and all usage is audited. The key is not shared between clusters or tenants.
Platform outages
HCP platform outages do not impact the availability of running clusters. The HCP API and UI may be affected, but already running clusters will remain operational and serve client requests. During a platform outage, cluster management operations such as snapshots, API lock, and admin token generation will be unavailable.
Recommended practice for Vault administrators
The admin token generated in the HCP Portal should be used for initial configuration or emergency access only.
You can mitigate the risk of not being able to generate admin tokens during a platform outage by setting up appropriate authentication within Vault to generate tokens that provide the necessary administrative access.
Refer to Authentication method documentation for more information on configuring Vault auth methods.
Cluster deletion and snapshot restore
Once a cluster is deleted, all affiliated resources (including audit logs) are deleted with it, except snapshots which are retained for 30 days after deletion of a cluster. Snapshots are only retained for Starter, Standard, and Plus tiers. Snapshots can be used to recover a deleted cluster, including restoring to a different region. To request restore from a snapshot, please file a support ticket here.
Warning
Once a cluster is deleted in Azure, all affiliated resources are deleted with it, including audit logs and snapshots. It is currently not possible to restore a deleted cluster which was hosted in Azure.
Support coverage
Clusters are monitored 24/7 with on-call staff available to debug production cluster issues. All production-grade clusters are coupled with either Silver or Gold level support.
Incident response
Incident response times are stipulated in the support agreement of the SLA. HashiCorp will use commercially reasonable efforts to maximize the availability of HashiCorp Cloud Platform services, and provide uptime guarantees based on service level agreements (SLA). Audit logs include key metrics that capture activity and performance. You can view a full list of metrics here.
Disaster recovery use cases not in scope
While HCP Vault Dedicated covers a large amount of DR functionality through HA, we currently do not support cross-region failover or cross-region disaster recovery replication. Production-grade clusters are isolated to three nodes within the same region. If a cluster goes down due to an outage in a region, it cannot be automatically failed over to another region. DR replication is not available at this time.