Version: 2.0 (Production Ready)
Author: Sumon | Senior DevOps & Cloud Infrastructure Engineer
Platform: OpenStack (Kolla-Ansible Deployment)
Language: Simple English with Real-World Context
Focus: Practical Commands, Real Scenarios, No Theory Fluff
💡 Story from the Field: Last month, I deployed OpenStack using Kolla-Ansible for a client who needed to host 50+ microservices. The installation was smooth, but the real challenge started after: "How do I connect VMs to the internet?", "Why can't my developers create networks?", "How do I ensure database VMs get enough RAM?". This guide is born from those real conversations. Every command here has been tested in my lab and production environments. Let's walk through this together, step by step.
- Environment Setup and Admin Access
- Provider Network Configuration
- Self-Service Network Configuration
- Flavor Engineering and Resource Management
- Image Management with Glance
- Security Groups and KeyPair Setup
- Instance Launch and Daily Operations
- Floating IP and External Connectivity
- Block Storage Management with Cinder
- Daily Operations and Troubleshooting
- Backup and Maintenance Procedures
Imagine you just finished a Kolla-Ansible deployment. The terminal shows "Deployment successful!" but now what? You need to talk to OpenStack. This section is your first handshake with the cloud.
- Access to the controller node (SSH or console)
admin-openrc.shfile generated by Kolla-Ansible- Basic Linux command line knowledge
- Network connectivity to OpenStack endpoints
Step 1: Source the admin credentials
# Navigate to Kolla config directory (common locations)
cd /etc/kolla/
# OR for multi-node setups, check your deployment node
cd ~/kolla-ansible/
# Load admin environment variables
source admin-openrc.sh
# Verify environment is loaded
echo $OS_USERNAME
echo $OS_AUTH_URLStep 2: Check official documentation for your version
# Always verify your OpenStack version first
openstack --version
# Cross-check with official docs
# Visit: https://docs.openstack.org/release-notes/
# Search for your version (e.g., "2024.1 Caracal")
# Look for: "Client Compatibility" sectionStep 3: Test basic connectivity
# List all available OpenStack services
openstack endpoint list
# Check service health
openstack compute service list
openstack network agent list
openstack volume service list# Quick health check script (run manually, no automation)
openstack token issue
openstack catalog list
openstack extension list --networkWhat to look for:
- All services show
enabledand:-)status - No
DOWNstates in network agents - Token issue returns valid credentials
Morning check routine:
# 1. Source environment (always first step)
source /etc/kolla/admin-openrc.sh
# 2. Quick service health
openstack compute service list --long | grep -E "State|Status"
# 3. Check resource usage
openstack hypervisor stats show
openstack hypervisor uptime show <hypervisor-name>
# 4. Review recent events
openstack server list --all-projects --limit 10
openstack volume list --all-projects --limit 10When adding new admin users:
# Create new admin user
openstack user create --domain default --password-prompt new-admin
# Create admin project (if needed)
openstack project create --domain default --description "Admin Operations" admin-ops
# Assign admin role
openstack role add --project admin-ops --user new-admin admin
# Generate openrc file for new user
echo "export OS_USERNAME=new-admin" > new-admin-openrc.sh
echo "export OS_PASSWORD='their-password'" >> new-admin-openrc.sh
echo "export OS_PROJECT_NAME=admin-ops" >> new-admin-openrc.sh
echo "export OS_USER_DOMAIN_NAME=Default" >> new-admin-openrc.sh
echo "export OS_PROJECT_DOMAIN_NAME=Default" >> new-admin-openrc.sh
echo "export OS_AUTH_URL=http://<controller>:5000/v3" >> new-admin-openrc.sh
echo "export OS_IDENTITY_API_VERSION=3" >> new-admin-openrc.shRotate admin password (security best practice):
# Change password via CLI
openstack user set --password-prompt admin
# Update all openrc files with new password
# (Manual step - distribute securely to team)
# Verify new credentials work
source admin-openrc.sh
openstack token issueBackup environment configuration:
# Backup critical config files
tar -czf openstack-config-backup-$(date +%Y%m%d).tar.gz \
/etc/kolla/ \
/etc/ansible/ \
~/kolla-ansible/inventory/
# Store backup off-node
scp openstack-config-backup-*.tar.gz backup-server:/openstack-backups/🛠️ Field Tip: I keep a "first-aid" file on my laptop with common recovery commands. When a 3 AM alert comes, I don't want to search docs. I copy-paste from my tested commands. Build your own cheat sheet.
Your client says: "Our web servers need public IPs directly, no NAT." That's provider network. It maps VMs directly to your physical network. Think of it like giving each VM its own phone line instead of sharing an extension.
- Physical network interface configured on compute nodes (e.g.,
eth1,bond0) - Kolla
globals.ymlconfigured withprovider_networks - VLAN or flat network setup on physical switches
- Admin privileges in OpenStack
Before typing any command:
# 1. Identify your OpenStack version
openstack --version
# 2. Visit official Neutron documentation
# URL pattern: https://docs.openstack.org/neutron/<version>/admin/config-routing.html
# Example for Caracal: https://docs.openstack.org/neutron/caracal/admin/config-routing.html
# 3. Search for "provider networks" in docs
# Key sections to read:
# - "Provider network types"
# - "Flat network configuration"
# - "VLAN network configuration"
# 4. Check Kolla-Ansible specific docs
# URL: https://docs.openstack.org/kolla-ansible/latest/
# Search: "provider network configuration"Step 1: Verify physical network mapping in Kolla
# Check Kolla configuration (on deployment node)
grep -A 10 "provider_networks" /etc/kolla/globals.yml
# Expected output example:
# provider_networks:
# - network_type: "flat"
# interface: "eth1"
# ranges:
# - "192.168.10.100:192.168.10.200"Step 2: Create provider network (flat type example)
# Source admin environment
source /etc/kolla/admin-openrc.sh
# Create the external provider network
openstack network create \
--share \
--external \
--provider-physical-network provider \
--provider-network-type flat \
public-provider
# Create subnet with allocation pool
openstack subnet create \
--network public-provider \
--allocation-pool start=192.168.10.100,end=192.168.10.200 \
--dns-nameserver 8.8.8.8 \
--dns-nameserver 1.1.1.1 \
--gateway 192.168.10.1 \
--subnet-range 192.168.10.0/24 \
public-provider-subnetStep 3: For VLAN-based provider network
# Create VLAN provider network (example VLAN 100)
openstack network create \
--share \
--external \
--provider-physical-network provider \
--provider-network-type vlan \
--provider-segment 100 \
public-vlan-100
# Create corresponding subnet
openstack subnet create \
--network public-vlan-100 \
--allocation-pool start=10.100.1.50,end=10.100.1.200 \
--gateway 10.100.1.1 \
--subnet-range 10.100.1.0/24 \
public-vlan-100-subnetStep-by-step via Horizon:
1. Login to Horizon dashboard (http://<controller-ip>)
2. Navigate: Admin → Network → Networks
3. Click "Create Network"
4. Fill Network tab:
- Name: public-provider
- Admin State: UP ✓
- Create Subnet: ✓ (checked)
- External Network: ✓ (CRITICAL - must check)
- Shared: ✓
- Network Type: flat (or vlan)
- Physical Network: provider (must match Kolla config)
- (If VLAN) Segmentation ID: 100
5. Click Next for Subnet tab:
- Network Address: 192.168.10.0/24
- IP Version: IPv4
- Gateway IP: 192.168.10.1
- Disable Gateway: ✗ (unchecked)
6. Allocation Pools section:
- Add pool: 192.168.10.100,192.168.10.200
7. DNS Nameservers:
- Add: 8.8.8.8
- Add: 1.1.1.1
8. Click "Create"
Immediate verification after creation:
# List networks and check external flag
openstack network list --long | grep public-provider
# Expected: Router: External should show "True"
# Check subnet details
openstack subnet show public-provider-subnet
# Verify network agents are handling this network
openstack network agent list --network public-provider
# Test from compute node (SSH to compute node first)
# Check OVS bridges (if using OVS)
sudo ovs-vsctl show
# Look for br-ex bridge with physical interface
# Check Linux bridge (if using linuxbridge)
brctl showEnd-to-end connectivity test:
# 1. Create a test instance on provider network (see Section 7)
# 2. Once instance is running, get its IP
openstack server list
# 3. From your laptop (outside OpenStack), ping the instance
ping <instance-public-ip>
# 4. SSH to instance (if security group allows)
ssh -i your-key.pem ubuntu@<instance-public-ip>Launch instance directly on provider network:
# Create instance with provider network
openstack server create \
--image ubuntu-22.04 \
--flavor m2.small \
--network public-provider \
--key-name my-key \
--security-group web-sg \
web-server-01
# Wait for active state
openstack server show web-server-01 -c status -f value
# Should return: ACTIVEAssign multiple IPs from provider network:
# Create additional port on same network
openstack port create \
--network public-provider \
--fixed-ip subnet=public-provider-subnet,ip-address=192.168.10.105 \
extra-ip-port
# Attach to existing instance
openstack server add port web-server-01 extra-ip-portMonitor provider network usage:
# Check IP allocation in subnet
openstack subnet show public-provider-subnet -c allocation_pools -f yaml
# List all ports on provider network
openstack port list --network public-provider -c "Name,Fixed IP Addresses,Status"
# Find unused allocated IPs (for cleanup)
openstack port list --network public-provider --status DOWN -c NameTroubleshooting provider network issues:
Scenario: VM created but no network connectivity
# Step 1: Check port status
openstack port list --server web-server-01
# Step 2: Check if port has IP assigned
openstack port show <port-id> -c fixed_ips -f yaml
# Step 3: Check security group rules
openstack security group list --port <port-id>
openstack security group rule list <sg-id>
# Step 4: Check compute node networking (SSH to compute node)
# For OVS:
sudo ovs-vsctl list-ports br-ex
sudo ovs-ofctl dump-flows br-ex | grep <vm-mac>
# For Linux Bridge:
sudo brctl show br-ex
sudo tcpdump -i br-ex -n icmp # Test ping traffic
# Step 5: Check Neutron agent logs (via Kolla)
docker exec -it neutron_openvswitch_agent tail -100 /var/log/kolla/neutron/openvswitch-agent.logUpdate provider network configuration:
# Add more IPs to allocation pool (requires subnet update)
openstack subnet set \
public-provider-subnet \
--allocation-pool start=192.168.10.100,end=192.168.10.250
# Change DNS servers
openstack subnet set \
public-provider-subnet \
--dns-nameserver 8.8.4.4 \
--dns-nameserver 9.9.9.9🛠️ Field Tip: Provider networks are powerful but dangerous. One misconfigured VLAN can bring down physical network. Always test new provider networks with a "canary" VM first. I keep a test project just for network validation before giving access to developers.
Your development team wants to create their own isolated networks for testing. They don't need public IPs for every VM. This is where self-service (tenant) networks shine. Think apartment building: each tenant has private space, but shares one main entrance (router) to the street.
- Provider network already configured (as external gateway)
- Neutron L3 agent running on controller/network node
- DHCP agent enabled for automatic IP assignment
- Project/user with network creation permissions
Pre-configuration research:
# 1. Identify networking plugin (OVS vs LinuxBridge)
grep "mechanism_drivers" /etc/kolla/neutron/plugins/ml2/ml2_conf.ini
# 2. Visit official docs for your plugin
# OVS: https://docs.openstack.org/neutron/<version>/admin/config-ovs.html
# LinuxBridge: https://docs.openstack.org/neutron/<version>/admin/config-linuxbridge.html
# 3. Key concepts to understand from docs:
# - Router architecture (L3 agent)
# - DHCP agent configuration
# - Security group integration
# - VXLAN/GRE overlay networks (if used)
# 4. Check Kolla-Ansible networking guide
# URL: https://docs.openstack.org/kolla-ansible/latest/networking-guide.htmlStep 1: Create private self-service network
# Source environment with project credentials (not admin)
source demo-openrc.sh # or your project's openrc file
# Create private network (VXLAN example)
openstack network create private-network
# Create subnet with private range
openstack subnet create \
--network private-network \
--subnet-range 10.0.0.0/24 \
--dns-nameserver 8.8.8.8 \
private-subnetStep 2: Create router and connect networks
# Create router
openstack router create project-router
# Set external gateway (connects to provider network)
openstack router set project-router --external-gateway public-provider
# Add private subnet to router (enables routing)
openstack router add subnet project-router private-subnetStep 3: For VLAN-based self-service network
# Create VLAN network (requires VLAN range configured in ML2)
openstack network create \
--provider-network-type vlan \
--provider-physical-network provider \
--provider-segment 200 \
vlan-private-net
# Create subnet
openstack subnet create \
--network vlan-private-net \
--subnet-range 10.10.0.0/24 \
vlan-private-subnet
# Connect to router (same router can handle multiple networks)
openstack router add subnet project-router vlan-private-subnetStep-by-step via Horizon:
1. Login as project user (not admin)
2. Navigate: Project → Network → Networks
3. Click "Create Network"
4. Network tab:
- Name: private-network
- Admin State: UP ✓
- Create Subnet: ✓
- External Network: ✗ (UNCHECKED - critical)
- Shared: ✗ (usually unchecked for private nets)
- Network Type: vxlan (or vlan if configured)
- (If VLAN) Physical Network: provider, Segmentation ID: 200
5. Subnet tab:
- Network Address: 10.0.0.0/24
- IP Version: IPv4
- Gateway IP: (leave blank for auto)
- Disable Gateway: ✗
6. Allocation Pools: (optional, auto if blank)
7. DNS Nameservers: 8.8.8.8
8. Click "Create"
9. Now create router: Project → Network → Routers → "Create Router"
10. Router details:
- Name: project-router
- External Network: public-provider
11. After router created, go to Interfaces tab → "Add Interface"
12. Select subnet: private-subnet → Submit
Verify self-service network setup:
# List networks and check router association
openstack network list
openstack router list
openstack router show project-router
# Check router interfaces
openstack router show project-router -c interfaces_info -f yaml
# Verify DHCP is working (check for DHCP port)
openstack port list --network private-network | grep dhcp
# Test DHCP from within a VM (after launching one)
# SSH to VM, then run:
ip addr show # Should show IP from 10.0.0.0/24 range
cat /etc/resolv.conf # Should show DNS serversEnd-to-end connectivity test:
# 1. Launch instance on private network
openstack server create \
--image ubuntu-22.04 \
--flavor m2.small \
--network private-network \
--key-name my-key \
test-vm-private
# 2. Wait for ACTIVE state
openstack server show test-vm-private -c status -f value
# 3. Get private IP
openstack server show test-vm-private -c addresses -f value
# Output example: private-network=10.0.0.5
# 4. Test internal connectivity (from another VM on same network)
# SSH to first VM, then ping second VM's private IP
# 5. Test external connectivity (requires Floating IP - see Section 8)Create multiple isolated networks for microservices:
# Network for frontend services
openstack network create frontend-net
openstack subnet create --network frontend-net --subnet-range 10.1.0.0/24 frontend-subnet
openstack router add subnet project-router frontend-subnet
# Network for backend services
openstack network create backend-net
openstack subnet create --network backend-net --subnet-range 10.2.0.0/24 backend-subnet
openstack router add subnet project-router backend-subnet
# Launch VMs on respective networks
openstack server create --network frontend-net --flavor m2.small frontend-01
openstack server create --network backend-net --flavor db.medium backend-db-01Connect networks without router (for advanced isolation):
# Create a port on frontend network
openstack port create --network frontend-net frontend-port
# Attach this port to a VM that also has backend network
openstack server add port frontend-01 frontend-port
# Now frontend-01 can route between networks (if IP forwarding enabled)
# This is advanced - use with cautionMonitor self-service network health:
# Check DHCP agent status
openstack network agent list --agent-type dhcp
# Check L3 agent status (for routing)
openstack network agent list --agent-type l3
# Monitor IP usage in private subnets
openstack subnet show private-subnet -c allocation_pools,ip_version -f yaml
# Find ports without associated instances (cleanup candidates)
openstack port list --status DOWN --device-owner "" -c Name,StatusTroubleshooting self-service network issues:
Scenario: VM gets IP but cannot ping gateway
# Step 1: Verify router has interface on subnet
openstack router show project-router -c interfaces_info -f yaml
# Should show subnet_id matching your private-subnet
# Step 2: Check router namespace on network node (advanced)
# SSH to node running L3 agent
# Find router namespace
ip netns | grep project-router
# Execute ping from router namespace
sudo ip netns exec qrouter-<router-uuid> ping -c 2 10.0.0.1
# Step 3: Check iptables rules in namespace
sudo ip netns exec qrouter-<router-uuid> iptables -L -n -v | grep 10.0.0
# Step 4: Verify security groups allow ICMP
openstack security group rule list --protocol icmpScenario: DHCP not assigning IPs
# Check DHCP agent logs (via Kolla)
docker exec -it neutron_dhcp_agent tail -100 /var/log/kolla/neutron/dhcp-agent.log
# Check for dnsmasq processes
ps aux | grep dnsmasq | grep <network-uuid>
# Restart DHCP agent (last resort)
# Via Kolla:
kolla-ansible restart --limit network -t neutron_dhcpScale self-service networks:
# Add more private subnets to same network (multi-subnet)
openstack subnet create \
--network private-network \
--subnet-range 10.0.1.0/24 \
private-subnet-2
# Router automatically handles routing between subnets on same network
# For large deployments, consider VLAN ranges
# Update ML2 config (requires Kolla reconfigure):
# In /etc/kolla/globals.yml:
# network_vlan_ranges: "provider:100:200"
# Then reconfigure:
# kolla-ansible reconfigure -t neutron🛠️ Field Tip: Self-service networks are where developers spend most time. I create a "network-101" cheat sheet for my team: "Need internet? Attach Floating IP. Need to talk to DB? Use security groups. Network slow? Check router CPU." Simple rules prevent 80% of support tickets.
Your team asks: "Why does my small test VM cost same as production?" Flavors answer this. They're not just CPU/RAM numbers—they're cost controls, performance guarantees, and resource policies. I once saved a client 40% on cloud costs just by right-sizing flavors.
- Admin privileges to create flavors
- Understanding of workload requirements (CPU, RAM, disk patterns)
- Knowledge of underlying hardware capabilities
- Nova scheduler configuration awareness
Pre-flavor design research:
# 1. Check Nova flavor documentation
# URL: https://docs.openstack.org/nova/<version>/admin/flavors.html
# 2. Key sections to study:
# - "Flavor extra specs" for advanced tuning
# - "Resource provider inventories" for custom resources
# - "Aggregate flavors" for hardware-specific placement
# 3. Check your hypervisor capabilities
# SSH to compute node:
virsh capabilities | grep -A 10 "<cpu>"
# 4. Review existing flavors (avoid duplication)
openstack flavor list --longStep 1: Create basic flavors
# Small web server flavor
openstack flavor create \
--id 1 \
--vcpus 2 \
--ram 4096 \
--disk 20 \
--swap 0 \
--ephemeral 0 \
--is-public true \
m2.small.web
# Medium database flavor
openstack flavor create \
--id 3 \
--vcpus 4 \
--ram 16384 \
--disk 40 \
--swap 4096 \
--ephemeral 100 \
--is-public true \
m4.medium.dbStep 2: Add advanced properties (RX/TX Factor, CPU policy)
# Set network throttling (RX/TX Factor)
# 0.5 = 50% of physical NIC speed, 2.0 = 200% (if hardware supports)
openstack flavor set \
--property rxtx_factor=0.5 \
m4.medium.db
# Dedicated CPU pinning (for high-performance workloads)
openstack flavor set \
--property hw:cpu_policy=dedicated \
--property hw:cpu_thread_policy=prefer \
m4.medium.db
# NUMA topology awareness
openstack flavor set \
--property hw:numa_nodes=1 \
m4.medium.dbStep 3: Create flavor with custom resources (GPU, SR-IOV)
# For GPU workloads (requires GPU resource provider configured)
openstack flavor create \
--id 10 \
--vcpus 8 \
--ram 32768 \
--disk 100 \
--is-public true \
gpu.large
# Add GPU resource request
openstack flavor set \
--property resources:GPU=1 \
gpu.large
# For SR-IOV networking (high-performance NICs)
openstack flavor set \
--property hw:vif_multiattach_supported=true \
--property hw:pci_numa_affinity_policy=required \
gpu.largeStep-by-step via Horizon:
1. Login as admin user
2. Navigate: Admin → Compute → Flavors
3. Click "Create Flavor"
4. Details tab:
- Flavor Name: m4.medium.db
- Flavor ID: 3 (must be unique integer)
- Memory (MB): 16384
- VCPUs: 4
- Root Disk (GB): 40
- Ephemeral Disk (GB): 100
- Swap Disk (MB): 4096
- RX/TX Factor: 0.5
5. Flavor Access tab:
- Public: ✓ (or select specific projects)
6. Click "Create Flavor"
7. After creation, click flavor name → "Edit Extra Specs"
8. Add key-value pairs:
- Key: hw:cpu_policy, Value: dedicated
- Key: hw:numa_nodes, Value: 1
9. Click "Save"
Verify flavor configuration:
# List all flavors with details
openstack flavor list --long
# Show specific flavor properties
openstack flavor show m4.medium.db -f yaml
# Check extra specs
openstack flavor show m4.medium.db -c properties -f yaml
# Verify flavor is accessible to projects
openstack flavor access list m4.medium.dbTest flavor with actual instance launch:
# Launch test instance with new flavor
openstack server create \
--image ubuntu-22.04 \
--flavor m4.medium.db \
--network private-network \
--key-name my-key \
flavor-test-vm
# Wait for active, then verify resources inside VM
# SSH to VM and run:
nproc # Should show 4 CPUs
free -m # Should show ~16GB RAM
df -h / # Should show 40GB root + 100GB ephemeral
swapon --show # Should show 4GB swapRight-size flavors based on monitoring data:
# 1. Monitor actual resource usage (via Prometheus/Grafana or simple CLI)
# Example: Check CPU usage pattern for web servers
openstack server list --name web-server-* -f value -c ID | while read id; do
echo "=== $id ==="
openstack server show $id -c flavor -f value
# Correlate with monitoring data externally
done
# 2. Create optimized flavors based on data
# Example: If web servers use <1GB RAM consistently:
openstack flavor create \
--id 2 \
--vcpus 2 \
--ram 2048 \ # Reduced from 4GB
--disk 20 \
--is-public true \
m2.small.web.optimized
# 3. Migrate instances to new flavor (requires resize)
openstack server resize --flavor m2.small.web.optimized web-server-01
openstack server confirm-resize web-server-01Flavor quotas for project management:
# Set flavor-specific quotas for a project
openstack quota set \
--cores 20 \
--ram 81920 \
--instances 10 \
demo-project
# Check quota usage
openstack quota show demo-project -c cores,ram,instances -f yamlAudit and cleanup unused flavors:
# Find flavors not used in last 90 days (requires external monitoring)
# Simple approach: List flavors with zero instances
openstack flavor list -f value -c ID | while read fid; do
count=$(openstack server list --flavor $fid -f value | wc -l)
if [ $count -eq 0 ]; then
echo "Flavor $fid has no instances"
fi
done
# Document flavors before deletion
openstack flavor list -f csv > flavor-inventory-$(date +%Y%m%d).csv
# Delete unused flavor (careful!)
openstack flavor delete m2.small.legacyUpdate flavors without downtime (advanced):
# Cannot modify existing flavor properties directly
# Strategy: Create new flavor, migrate instances
# 1. Create updated flavor
openstack flavor create \
--id 4 \
--vcpus 4 \ # Increased from 2
--ram 8192 \ # Increased from 4GB
--disk 40 \
--is-public true \
m2.small.v2
# 2. Migrate instances one by one
openstack server resize --flavor m2.small.v2 web-server-01
# Test application on resized VM
openstack server confirm-resize web-server-01
# 3. After all migrated, delete old flavor
openstack flavor delete m2.small.webTroubleshooting flavor issues:
Scenario: Instance fails to launch with "No valid host was found"
# 1. Check flavor requirements vs hypervisor capabilities
openstack flavor show m4.medium.db -c properties -f yaml
# 2. Check compute node resource inventory
openstack hypervisor show <compute-node> -f yaml
# 3. Verify extra specs match hardware
# Example: If flavor requires dedicated CPUs but host doesn't support:
# Look for "hw:cpu_policy=dedicated" in flavor vs compute node capabilities
# 4. Check Nova scheduler logs
docker exec -it nova_scheduler tail -100 /var/log/kolla/nova/scheduler.logScenario: Network performance not matching RX/TX Factor
# 1. Verify factor is applied
openstack flavor show m4.medium.db -c properties -f yaml | grep rxtx
# 2. Test actual network throughput from VM
# Inside VM:
iperf3 -c <test-server> -t 30
# 3. Check Neutron QoS configuration (if used)
openstack network qos policy list
openstack network qos rule list <policy-id>
# 4. Verify physical NIC capabilities on compute node
ethtool <physical-interface> | grep Speed🛠️ Field Tip: Flavors are your cost control lever. I tag every flavor with a cost estimate in the description: "m4.medium.db - ~$120/month". When developers request new flavors, I ask: "What workload pattern does this match?" This simple question prevents flavor sprawl. Keep a "flavor catalog" document with use cases for each flavor.
Your team needs Ubuntu 22.04 for new microservices. Do you download from random website? No. You use trusted cloud images, verify checksums, and upload with proper metadata. I once prevented a security incident just by enforcing image checksum verification.
- Admin or image-creator role in OpenStack
- Access to download images (internet or internal mirror)
- Storage backend configured for Glance (file, Ceph, Swift)
- Understanding of image formats (qcow2, raw, etc.)
Pre-image management research:
# 1. Check Glance documentation for your version
# URL: https://docs.openstack.org/glance/<version>/admin/
# 2. Key sections:
# - "Supported disk formats" (qcow2 recommended)
# - "Image properties" (critical for VM boot)
# - "Storage backends" (know your backend type)
# 3. Verify Glance backend status
openstack image backend info # If supported in your version
# 4. Check existing images to avoid duplicates
openstack image list --public --privateStep 1: Download and verify official cloud image
# Create working directory
mkdir -p ~/openstack-images/ubuntu
cd ~/openstack-images/ubuntu
# Download Ubuntu 22.04 cloud image
wget https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img
# Download checksum file
wget https://cloud-images.ubuntu.com/jammy/current/SHA256SUMS
# Verify checksum (CRITICAL STEP)
sha256sum -c SHA256SUMS --ignore-missing
# Must output: jammy-server-cloudimg-amd64.img: OK
# Optional: Compress if large (saves Glance storage)
qemu-img convert -f qcow2 -O qcow2 -c \
jammy-server-cloudimg-amd64.img \
jammy-server-cloudimg-compressed.qcow2Step 2: Upload image to Glance with proper metadata
# Source admin environment
source /etc/kolla/admin-openrc.sh
# Upload with essential properties
openstack image create "Ubuntu 22.04 LTS" \
--file jammy-server-cloudimg-compressed.qcow2 \
--disk-format qcow2 \
--container-format bare \
--property os_distro=ubuntu \
--property os_version=22.04 \
--property hw_disk_bus=virtio \
--property hw_scsi_model=virtio-scsi \
--property hw_qemu_guest_agent=yes \
--public
# For Windows images (example)
openstack image create "Windows Server 2022" \
--file win2022-cloud.qcow2 \
--disk-format qcow2 \
--container-format bare \
--property os_distro=windows \
--property os_version=2022 \
--property hw_disk_bus=virtio \
--property hw_vif_model=virtio \
--property windows_license_type=evaluation \
--publicStep 3: Upload from URL (for large images or automation)
# If image is hosted on internal HTTP server
openstack image create "CentOS Stream 9" \
--container-format bare \
--disk-format qcow2 \
--file http://internal-mirror/centos-stream-9.qcow2 \
--property os_distro=centos \
--property os_version=stream9 \
--publicStep-by-step via Horizon:
1. Login as admin or user with image_create permission
2. Navigate: Project → Compute → Images (or Admin → System → Images)
3. Click "Create Image"
4. Information tab:
- Name: Ubuntu 22.04 LTS
- Description: (optional) Official cloud image, verified 2024-03
- Visibility: Public (or Project/Shared)
5. Image Location tab:
- Image Source: File (or "Import from URL" if configured)
- File: Browse and select jammy-server-cloudimg-compressed.qcow2
6. Format tab:
- Disk Format: QCOW2 - QEMU Emulator
- Container Format: Bare - No container metadata
7. Properties tab (critical for VM boot):
- Click "Add property"
- Key: hw_disk_bus, Value: virtio
- Add: hw_scsi_model = virtio-scsi
- Add: hw_qemu_guest_agent = yes
- Add: os_distro = ubuntu
- Add: os_version = 22.04
8. Click "Create Image"
9. Monitor upload progress in Images list
Verify image upload and properties:
# List images and check status
openstack image list | grep Ubuntu
# Check image details (critical properties)
openstack image show "Ubuntu 22.04 LTS" -f yaml
# Verify essential properties exist:
# - disk_format: qcow2
# - container_format: bare
# - hw_disk_bus: virtio
# - status: active
# Check image size and checksum
openstack image show "Ubuntu 22.04 LTS" -c size,checksum -f yamlTest image by launching instance:
# Launch minimal test instance
openstack server create \
--image "Ubuntu 22.04 LTS" \
--flavor m1.tiny \
--network private-network \
--key-name my-key \
image-test-vm
# Wait for ACTIVE state
openstack server show image-test-vm -c status -f value
# If instance fails to boot, check console log
openstack console log show image-test-vm --tail 50Update images with security patches (golden image pattern):
# 1. Launch instance from current image
openstack server create \
--image "Ubuntu 22.04 LTS" \
--flavor m2.small \
--network private-network \
--key-name admin-key \
image-builder-01
# 2. SSH and apply updates
ssh -i admin-key.pem ubuntu@<builder-ip>
sudo apt update && sudo apt upgrade -y
sudo apt install cloud-init qemu-guest-agent -y
sudo shutdown -h now
# 3. Create new image from updated instance
openstack server image create \
--name "Ubuntu 22.04 LTS - Patched 2024-03" \
image-builder-01
# 4. Wait for image to become active
openstack image list | grep "Patched 2024-03"
# 5. Test new image before deprecating old
# 6. Update flavor/image references in automation scriptsManage image lifecycle (deprecate old versions):
# Mark old image as deprecated (not deleted)
openstack image set \
--property status=deprecated \
"Ubuntu 22.04 LTS"
# Update description to guide users
openstack image set \
--description "Deprecated - Use 'Ubuntu 22.04 LTS - Patched 2024-03' instead" \
"Ubuntu 22.04 LTS"
# After 90 days, delete old image (if no instances using it)
# First, verify no instances use it:
openstack server list --image <old-image-id> --all-projects
# If empty, safe to delete:
openstack image delete "Ubuntu 22.04 LTS"Audit image usage and storage:
# List images with size and usage count
openstack image list -f csv | while IFS=, read id name status; do
if [ "$id" != "ID" ]; then
size=$(openstack image show $id -c size -f value)
instances=$(openstack server list --image $id -f value | wc -l)
echo "$name,$size,$instances"
fi
done > image-usage-report.csv
# Check Glance storage backend usage (if Ceph)
ceph df | grep images
# Monitor image upload/download metrics (if Prometheus enabled)
# Query: glance_image_uploads_total, glance_image_downloads_totalTroubleshooting image issues:
Scenario: Instance fails to boot with "No bootable device"
# 1. Check image properties
openstack image show <image-name> -c properties -f yaml
# 2. Verify critical properties exist:
# hw_disk_bus=virtio (most common fix)
# hw_scsi_model=virtio-scsi (for SCSI)
# If missing, update image:
openstack image set \
--property hw_disk_bus=virtio \
<image-name>
# 3. Check console output for boot errors
openstack console log show <instance-name> --tail 100
# 4. Verify image file integrity (re-download if checksum mismatch)Scenario: Image upload fails with "Disk limit exceeded"
# 1. Check Glance storage quota
openstack quota show --default | grep images
# 2. Check actual storage usage
df -h /var/lib/glance # or your Glance backend path
# 3. Clean up old/unused images first
# 4. Or request quota increase:
openstack quota set --images 50 demo-projectOptimize image storage:
# Use compressed qcow2 format (saves 30-70% space)
qemu-img convert -f qcow2 -O qcow2 -c source.img compressed.img
# For frequently used images, enable caching in Nova
# In /etc/kolla/nova/nova.conf on compute nodes:
# [libvirt]
# images_type = qcow2
# images_rbd_pool = images # if using Ceph
# Reconfigure Nova after config change:
kolla-ansible reconfigure -t nova🛠️ Field Tip: I maintain an "image catalog" spreadsheet: Image Name, Version, Source URL, Checksum, Upload Date, Deprecation Date, Use Case. When a developer asks "Which Ubuntu image should I use?", I share the catalog link. This simple document reduced "wrong image" incidents by 90%. Also, always verify checksums—never skip this step.
Your VM is running but you can't SSH in. Or worse, it's accessible from anywhere. Security groups are your virtual firewall. KeyPairs are your secure door keys. I once stopped a brute-force attack in minutes just by tightening security group rules.
- Project user permissions (for self-service) or admin
- Understanding of network protocols (TCP, UDP, ICMP)
- SSH key management knowledge
- Awareness of application port requirements
Pre-configuration research:
# 1. Check Neutron security group documentation
# URL: https://docs.openstack.org/neutron/<version>/admin/config-security-groups.html
# 2. Key concepts:
# - Stateful firewall (return traffic auto-allowed)
# - Rule direction (ingress vs egress)
# - Default deny policy (explicit allow needed)
# 3. Review existing security groups
openstack security group list
# 4. Check default security group rules
openstack security group show default -c rules -f yamlStep 1: Create and configure KeyPair
# Generate new key pair (if you don't have one)
openstack keypair create --private-key prod-admin-key.pem prod-admin-key
# Secure the private key file (CRITICAL)
chmod 600 prod-admin-key.pem
# Never share this file. Never commit to git.
# Import existing public key (if you have one)
openstack keypair create \
--public-key ~/.ssh/id_rsa.pub \
existing-key-name
# List and verify keys
openstack keypair list
openstack keypair show prod-admin-key -f yamlStep 2: Create security group for web server
# Create security group
openstack security group create web-sg \
--description "Security group for public web servers"
# Allow HTTP (port 80) from anywhere
openstack security group rule create \
--ingress \
--protocol tcp \
--dst-port 80 \
--remote-ip 0.0.0.0/0 \
web-sg
# Allow HTTPS (port 443) from anywhere
openstack security group rule create \
--ingress \
--protocol tcp \
--dst-port 443 \
--remote-ip 0.0.0.0/0 \
web-sg
# Allow SSH only from management network (example: 10.10.10.0/24)
openstack security group rule create \
--ingress \
--protocol tcp \
--dst-port 22 \
--remote-ip 10.10.10.0/24 \
web-sg
# Allow ICMP (ping) for monitoring
openstack security group rule create \
--ingress \
--protocol icmp \
web-sg
# Allow outbound traffic (usually needed)
openstack security group rule create \
--egress \
--protocol tcp \
--dst-port 1:65535 \
--remote-ip 0.0.0.0/0 \
web-sgStep 3: Create security group for database (internal only)
openstack security group create db-sg \
--description "Internal database servers - no public access"
# Allow MySQL (3306) only from app servers subnet
openstack security group rule create \
--ingress \
--protocol tcp \
--dst-port 3306 \
--remote-ip 10.0.1.0/24 \ # App server subnet
db-sg
# Allow PostgreSQL (5432) from same subnet
openstack security group rule create \
--ingress \
--protocol tcp \
--dst-port 5432 \
--remote-ip 10.0.1.0/24 \
db-sg
# NO SSH from public - only from jump host
openstack security group rule create \
--ingress \
--protocol tcp \
--dst-port 22 \
--remote-ip 10.10.10.50/32 \ # Jump host IP only
db-sg
# Allow outbound for updates
openstack security group rule create \
--egress \
--protocol tcp \
--dst-port 80,443 \
--remote-ip 0.0.0.0/0 \
db-sgStep-by-step via Horizon:
1. Login to project dashboard
2. Navigate: Project → Network → Security Groups
3. Click "Create Security Group"
- Name: web-sg
- Description: Public web servers
- Click "Create"
4. Click "Manage Rules" for web-sg
5. Add rule: HTTP
- Rule: HTTP
- Remote: CIDR
- CIDR: 0.0.0.0/0
- Click "Add"
6. Add rule: HTTPS (same as HTTP, port 443)
7. Add rule: SSH
- Rule: Custom TCP Rule
- Port Range: 22
- Remote: CIDR
- CIDR: 10.10.10.0/24 # Management network
- Click "Add"
8. Add rule: ICMP (for ping)
- Rule: All ICMP
- Remote: CIDR
- CIDR: 0.0.0.0/0
9. For KeyPair: Project → Compute → Key Pairs → "Create Key Pair"
- Name: prod-admin-key
- Click "Create Key Pair"
- Download and secure the .pem file immediately
Verify security group configuration:
# List rules for a security group
openstack security group rule list web-sg -f table
# Verify specific rule exists
openstack security group rule list \
--protocol tcp \
--dst-port 22 \
--remote-ip 10.10.10.0/24 \
web-sg
# Check which instances use this security group
openstack server list --security-group web-sg -c Name,StatusTest security group rules:
# 1. Launch test instance with web-sg
openstack server create \
--image ubuntu-22.04 \
--flavor m2.small \
--network private-network \
--security-group web-sg \
--key-name prod-admin-key \
security-test-vm
# 2. Get instance IP
openstack server show security-test-vm -c addresses -f value
# 3. From management network (10.10.10.0/24), test SSH
ssh -i prod-admin-key.pem ubuntu@<instance-ip>
# Should succeed
# 4. From public internet, test SSH (should fail)
# From your laptop (outside management network):
ssh -i prod-admin-key.pem ubuntu@<instance-ip>
# Should timeout or reject
# 5. Test HTTP from anywhere
curl http://<instance-ip>
# Should connect if web server runningApply security groups during instance launch:
# Single security group
openstack server create \
--security-group web-sg \
...
# Multiple security groups (combined rules)
openstack server create \
--security-group web-sg \
--security-group monitoring-sg \
...Update security groups for running instances:
# Add security group to running instance
openstack server add security group web-server-01 monitoring-sg
# Remove security group
openstack server remove security group web-server-01 old-sg
# Changes apply immediately (no reboot needed)Emergency lockdown procedure:
# Scenario: Suspected breach on web-server-01
# 1. Revoke all public access immediately
openstack security group rule delete <rule-id-for-public-ssh>
# 2. Or replace security group with lockdown group
openstack security group create lockdown-sg \
--description "Emergency lockdown - no inbound"
# Allow only admin jump host
openstack security group rule create \
--ingress --protocol tcp --dst-port 22 \
--remote-ip 10.10.10.50/32 \
lockdown-sg
# Apply to instance
openstack server remove security group web-server-01 web-sg
openstack server add security group web-server-01 lockdown-sg
# 3. Investigate, then restore appropriate rulesAudit security group rules regularly:
# Export all security group rules for review
openstack security group rule list -f csv > security-rules-audit-$(date +%Y%m%d).csv
# Find overly permissive rules (0.0.0.0/0 on sensitive ports)
openstack security group rule list \
--protocol tcp \
--dst-port 22,3306,5432,6379 \
--remote-ip 0.0.0.0/0 \
-f table
# Document rule purpose (add description when creating)
openstack security group rule create \
--description "SSH from jump host only - JIRA-1234" \
...Rotate SSH keys periodically:
# 1. Generate new key pair
openstack keypair create --private-key prod-admin-key-v2.pem prod-admin-key-v2
# 2. Update authorized_keys on all instances (via config management)
# Example Ansible task:
# - name: Update admin SSH key
# authorized_key:
# user: ubuntu
# key: "{{ lookup('file', 'prod-admin-key-v2.pub') }}"
# state: present
# 3. Test new key works
ssh -i prod-admin-key-v2.pem ubuntu@<instance-ip>
# 4. Remove old key from instances
# 5. Delete old keypair from OpenStack
openstack keypair delete prod-admin-keyTroubleshooting security group issues:
Scenario: Cannot SSH to instance, but security group allows it
# 1. Verify instance has correct security group
openstack server show web-server-01 -c security_groups -f yaml
# 2. Check if rule is ingress (incoming) not egress
openstack security group rule list web-sg | grep 22
# 3. Verify remote IP matches your source
# If rule allows 10.10.10.0/24 but you're connecting from 192.168.1.5, it will block
# 4. Check instance OS firewall (iptables/ufw)
# SSH via console if network blocked:
openstack console url show web-server-01
# Then inside VM:
sudo ufw status
sudo iptables -L -n
# 5. Check Neutron security group implementation
# On compute node:
sudo ovs-ofctl dump-flows br-int | grep <vm-port-uuid>Scenario: KeyPair not working for SSH
# 1. Verify keypair is associated with instance
openstack server show web-server-01 -c key_name -f value
# 2. Check private key permissions (must be 600)
ls -l prod-admin-key.pem
chmod 600 prod-admin-key.pem # If needed
# 3. Verify public key was injected (check instance console)
openstack console log show web-server-01 --tail 20 | grep -i authorized
# 4. For cloud-init images, check user-data didn't override keys
# Check /var/log/cloud-init-output.log inside VM🛠️ Field Tip: I use a "security group template" for each app type. Web servers get web-sg-template, databases get db-sg-template. When a new project starts, I copy the template instead of creating from scratch. This ensures consistency and prevents "I opened port 22 to 0.0.0.0/0 for testing and forgot to close it" incidents. Also, always add a description to every rule with a ticket number (e.g., "JIRA-5678"). Future-you will thank present-you.
You've configured networks, flavors, images, security. Now launch the VM. But launching is easy—managing it day-to-day is where the real work happens. This section covers the full lifecycle: launch, access, monitor, resize, backup, delete.
- All previous sections completed (networks, flavors, images, security)
- Sufficient quota in project (CPU, RAM, instances, volumes)
- SSH client configured with key pairs
- Understanding of cloud-init for initial configuration
Pre-launch research:
# 1. Check Nova documentation for instance operations
# URL: https://docs.openstack.org/nova/<version>/user/
# 2. Key sections:
# - "Boot an instance" (basic launch)
# - "Server actions" (resize, rescue, etc.)
# - "Metadata and user data" (cloud-init)
# 3. Verify resource availability
openstack quota show
openstack hypervisor stats show
# 4. Check image and flavor compatibility
openstack image show ubuntu-22.04 -c disk_format,container_format -f yaml
openstack flavor show m2.small -c vcpus,ram,disk -f yamlStep 1: Basic instance launch
# Launch with minimal required parameters
openstack server create \
--image ubuntu-22.04 \
--flavor m2.small \
--network private-network \
--key-name prod-admin-key \
--security-group web-sg \
web-server-01
# Wait for active state (polling)
while [ "$(openstack server show web-server-01 -c status -f value)" != "ACTIVE" ]; do
echo "Waiting for instance to become active..."
sleep 10
doneStep 2: Launch with advanced options (user data, metadata)
# Create cloud-init user data file
cat > user-data.yaml << 'EOF'
#cloud-config
package_update: true
packages:
- nginx
- fail2ban
users:
- name: appuser
groups: sudo
shell: /bin/bash
sudo: ['ALL=(ALL) NOPASSWD:ALL']
ssh_authorized_keys:
- ssh-rsa AAAAB3... user@example.com
runcmd:
- systemctl enable nginx
- systemctl start nginx
EOF
# Launch with user data and metadata
openstack server create \
--image ubuntu-22.04 \
--flavor m2.small \
--network private-network \
--key-name prod-admin-key \
--security-group web-sg \
--user-data user-data.yaml \
--property environment=production \
--property application=web-frontend \
--property owner=team-alpha \
web-server-01Step 3: Launch with multiple networks and volumes
# Create additional network port for management
openstack port create \
--network management-network \
--fixed-ip subnet=management-subnet,ip-address=10.10.10.100 \
web-server-01-mgmt-port
# Launch instance with primary network
openstack server create \
--image ubuntu-22.04 \
--flavor m2.small \
--network private-network \
--key-name prod-admin-key \
--security-group web-sg \
web-server-01
# Attach management port after launch
openstack server add port web-server-01 web-server-01-mgmt-port
# Create and attach boot volume (for persistent root disk)
openstack volume create --size 20 --image ubuntu-22.04 web-server-01-root
openstack server add volume web-server-01 web-server-01-rootStep-by-step via Horizon:
1. Navigate: Project → Compute → Instances → "Launch Instance"
2. Details tab:
- Instance Name: web-server-01
- Description: Production web server - Team Alpha
3. Source tab:
- Select Boot Source: Image
- Image: ubuntu-22.04
- Create New Volume: No (or Yes for persistent root)
4. Flavor tab:
- Select: m2.small
5. Networks tab:
- Select: private-network (click + to add)
6. Security Groups tab:
- Select: web-sg (click + to add)
7. Key Pair tab:
- Select: prod-admin-key
8. Configuration tab (advanced):
- Customization Script: Paste cloud-init YAML or browse file
- Metadata: Add key-value pairs:
* environment = production
* application = web-frontend
9. Click "Launch Instance"
10. Monitor progress in Instances list
Verify instance launch and configuration:
# Check instance status and details
openstack server show web-server-01 -f yaml
# Key fields to verify:
# - status: ACTIVE
# - flavor: m2.small
# - image: ubuntu-22.04
# - addresses: shows assigned IPs
# - security_groups: shows applied groups
# - key_name: shows associated key
# Check console output for boot issues
openstack console log show web-server-01 --tail 50
# Get VNC console URL for direct access (if needed)
openstack console url show web-server-01Test instance connectivity:
# Get instance IP address
INSTANCE_IP=$(openstack server show web-server-01 -c addresses -f value | grep -oE '([0-9]{1,3}\.){3}[0-9]{1,3}' | head -1)
# Test SSH connectivity
ssh -i prod-admin-key.pem -o ConnectTimeout=10 ubuntu@$INSTANCE_IP echo "SSH successful"
# If using Floating IP (Section 8), test public access
# curl http://<floating-ip>
# Verify cloud-init executed (inside VM)
ssh ubuntu@$INSTANCE_IP "cloud-init status --long"
ssh ubuntu@$INSTANCE_IP "systemctl status nginx" # If installed via user-dataStart/Stop/Reboot instances:
# Graceful shutdown (OS level)
openstack server stop web-server-01
# Hard power off (use with caution)
openstack server stop --hard web-server-01
# Start instance
openstack server start web-server-01
# Reboot (graceful)
openstack server reboot web-server-01
# Reboot (hard)
openstack server reboot --hard web-server-01Resize instance (change flavor):
# 1. Initiate resize
openstack server resize --flavor m4.medium web-server-01
# 2. Instance will be in VERIFY_RESIZE state
# 3. Test application on resized instance
# 4. Confirm resize (makes it permanent)
openstack server confirm-resize web-server-01
# If issues, revert:
openstack server revert-resize web-server-01Create backup image from running instance:
# Create image snapshot
openstack server image create \
--name "web-server-01-backup-$(date +%Y%m%d)" \
web-server-01
# Wait for image to become active
openstack image list | grep backup
# Use this image to launch recovery instances if neededAccess instance logs and metrics:
# Get console log (boot messages, errors)
openstack console log show web-server-01 --tail 100
# Get instance diagnostics (resource usage)
openstack server show web-server-01 -c diagnostic -f yaml
# For detailed metrics, use monitoring stack (Prometheus/Grafana)
# Example query: instance:node_cpu_utilisation:avg1m{instance="web-server-01"}Instance lifecycle management:
# Tag instances for automation (using metadata)
openstack server set \
--property backup-daily=true \
--property retention-days=30 \
web-server-01
# List instances by tag for batch operations
openstack server list --long | grep "backup-daily=true"
# Automated cleanup script concept (run manually, not automated here):
# For each instance with retention-days:
# if age > retention-days and status != production:
# notify team, then deleteTroubleshooting instance issues:
Scenario: Instance stuck in BUILD state
# 1. Check detailed status
openstack server show web-server-01 -c fault -f yaml
# 2. Common causes:
# - No valid host (resource shortage): Check hypervisor stats
# - Image issue: Verify image status and properties
# - Network issue: Check port creation
# 3. Check Nova compute logs on target host
# SSH to compute node, find instance UUID:
openstack server show web-server-01 -c id -f value
# Then check logs:
grep <instance-uuid> /var/log/kolla/nova/nova-compute.logScenario: Instance running but no network connectivity
# 1. Verify port status
openstack port list --server web-server-01
# 2. Check security group rules (Section 6)
# 3. Verify instance OS network config (via console)
openstack console url show web-server-01
# Inside VM:
ip addr show
ip route show
cat /etc/resolv.conf
# 4. Check Neutron agent on compute node
docker exec -it neutron_openvswitch_agent ovs-vsctl showInstance performance optimization:
# Enable QEMU guest agent for better metrics
# Ensure image has property: hw_qemu_guest_agent=yes
# Inside VM, install guest agent:
# Ubuntu: sudo apt install qemu-guest-agent
# Then restart instance
# Use dedicated CPU for high-performance workloads
# Flavor property: hw:cpu_policy=dedicated
# Requires compute node with CPU pinning enabled
# Optimize disk I/O with virtio-blk
# Ensure image property: hw_disk_bus=virtio🛠️ Field Tip: I treat every instance like a pet with a name tag. Metadata is my tag system: environment=production, team=alpha, cost-center=1234. When finance asks "Who's using all the CPU?", I query by metadata. Also, I never delete instances immediately—I stop them first, wait 7 days, then delete. This saved me twice when someone accidentally deleted the wrong VM. Use metadata, use soft deletes, sleep better.
Your VM has a private IP (10.0.0.5), but users need to reach it from the internet. Floating IPs are your public phone numbers that forward to private extensions. Critical for web servers, APIs, and any public-facing service.
- Provider network configured as external (Section 2)
- Router with external gateway (Section 3)
- Available public IP addresses in provider subnet
- Security group allowing inbound traffic (Section 6)
Pre-configuration research:
# 1. Check Neutron floating IP documentation
# URL: https://docs.openstack.org/neutron/<version>/admin/config-floating-ip.html
# 2. Key concepts:
# - Floating IP vs fixed IP
# - Port association (1:1 NAT)
# - Security group application point
# 3. Verify external network has available IPs
openstack subnet show public-provider-subnet -c allocation_pools -f yaml
# 4. Check router external gateway
openstack router show project-router -c external_gateway_info -f yamlStep 1: Allocate floating IP from provider network
# Allocate one floating IP
openstack floating ip create public-provider
# Allocate specific IP (if you need a particular address)
openstack floating ip create \
--floating-ip-address 192.168.10.150 \
public-provider
# List allocated floating IPs
openstack floating ip listStep 2: Associate floating IP with instance
# Get instance port ID (on private network)
INSTANCE_PORT=$(openstack port list --server web-server-01 --network private-network -f value -c ID)
# Associate floating IP
openstack server add floating ip web-server-01 192.168.10.150
# Or associate directly to port (more control)
openstack floating ip set \
--port $INSTANCE_PORT \
192.168.10.150Step 3: Verify connectivity
# Get floating IP details
openstack floating ip show 192.168.10.150 -f yaml
# Key fields:
# - fixed_ip_address: should show private IP (e.g., 10.0.0.5)
# - port_id: should match instance port
# - status: ACTIVE
# Test from external network (your laptop)
ping 192.168.10.150
curl http://192.168.10.150 # If web server running
ssh -i prod-admin-key.pem ubuntu@192.168.10.150 # If SSH allowedStep-by-step via Horizon:
1. Navigate: Project → Network → Floating IPs
2. Click "Allocate IP To Project"
3. Pool: public-provider (your external network)
4. Click "Allocate IP"
5. New floating IP appears in list
6. Click "Associate" next to the IP
7. Association dialog:
- Port to be associated: Select instance port (e.g., web-server-01)
- IP Address: Auto-filled or select specific
- Click "Associate"
8. Verify: Floating IP list shows instance name and private IP
Verify floating IP setup:
# Check NAT mapping
openstack floating ip show 192.168.10.150 -c fixed_ip_address,port_id -f yaml
# Verify router has NAT rules
# On network node (advanced):
sudo ip netns exec qrouter-<router-uuid> iptables -t nat -L -n -v | grep 192.168.10.150
# Test end-to-end connectivity
# From external client:
curl -I http://192.168.10.150 # Should return HTTP headers
traceroute 192.168.10.150 # Should show path through your networkTest failover scenario:
# 1. Disassociate floating IP
openstack server remove floating ip web-server-01 192.168.10.150
# 2. Verify connectivity lost
curl --connect-timeout 5 http://192.168.10.150 # Should timeout
# 3. Re-associate to different instance (for DR testing)
openstack server add floating ip backup-server-01 192.168.10.150
# 4. Verify traffic now goes to backup
curl http://192.168.10.150 # Should reach backup serverManage multiple services with one floating IP (port forwarding):
# Note: OpenStack floating IPs are 1:1 NAT by default
# For port forwarding (one IP, multiple services), use:
# Option 1: Load balancer (Octavia) - recommended
# Option 2: Instance-based reverse proxy (Nginx/HAProxy)
# Example: Instance with Nginx reverse proxy
# Floating IP -> Instance (10.0.0.10) -> Nginx routes to:
# /api -> backend-server (10.0.0.20:8080)
# / -> frontend-server (10.0.0.30:80)Floating IP lifecycle management:
# Reserve floating IPs for critical services
# (Prevent accidental deletion)
openstack floating ip set \
--description "Reserved for production web - DO NOT DELETE" \
192.168.10.150
# Document floating IP assignments
openstack floating ip list -f csv > floating-ip-inventory-$(date +%Y%m%d).csv
# Cleanup unused floating IPs
# Find floating IPs not associated with any port:
openstack floating ip list --status DOWN -f value -c ID | while read fip; do
echo "Unused floating IP: $fip"
# Review before deletion:
# openstack floating ip delete $fip
doneMonitor floating IP usage:
# Count floating IPs per project
openstack floating ip list --all-projects -f csv | \
awk -F, 'NR>1 {print $3}' | sort | uniq -c
# Check for floating IP exhaustion
openstack subnet show public-provider-subnet \
-c allocation_pools,ip_version -f yaml
# Alert when >90% of pool is used (external monitoring integration)Troubleshooting floating IP issues:
Scenario: Floating IP allocated but no connectivity
# 1. Verify association
openstack floating ip show 192.168.10.150 -c port_id,fixed_ip_address -f yaml
# 2. Check security group on instance port
PORT_ID=$(openstack floating ip show 192.168.10.150 -c port_id -f value)
openstack port show $PORT_ID -c security_groups -f yaml
# 3. Verify security group allows inbound traffic (Section 6)
# 4. Check router external gateway
openstack router show project-router -c external_gateway_info -f yaml
# Should show: {"network_id": "<public-net-id>", "enable_snat": true}
# 5. Test from network node (advanced)
# SSH to node running L3 agent:
sudo ip netns exec qrouter-<router-uuid> ping -c 2 <instance-private-ip>
# If this works but external ping fails, check physical firewall/NATScenario: Cannot allocate more floating IPs
# 1. Check subnet allocation pool
openstack subnet show public-provider-subnet \
-c allocation_pools,gateway_ip -f yaml
# 2. Count used vs available IPs
# Used: openstack floating ip list | grep ACTIVE | wc -l
# Available: Calculate from allocation_pools minus used
# 3. Solutions:
# a) Expand allocation pool (requires subnet update)
openstack subnet set public-provider-subnet \
--allocation-pool start=192.168.10.100,end=192.168.10.250
# b) Use a different provider network/VLAN
# c) Implement load balancer (Octavia) to reduce floating IP needsFloating IP security best practices:
# Always pair floating IPs with strict security groups
# Example: Web server floating IP should only allow 80/443 from 0.0.0.0/0
# and SSH only from management network
# Monitor floating IP access logs
# On instance: Check /var/log/nginx/access.log or auth.log
# At network level: Enable Neutron logging (if configured)
# Rotate floating IPs for sensitive services (advanced)
# 1. Allocate new floating IP
# 2. Associate to instance
# 3. Update DNS
# 4. Wait for TTL expiry
# 5. Disassociate old floating IP🛠️ Field Tip: Floating IPs are expensive (public IP space) and risky (direct internet exposure). I use a "floating IP request form" process: Team submits JIRA ticket with justification, security review, and expected lifespan. This reduced unused floating IPs by 70%. Also, always pair floating IPs with a load balancer for production services—never point a floating IP directly to a single VM. If that VM dies, your service dies. Load balancers provide health checks and failover.
Your database needs persistent storage that survives VM reboots. Or your app needs to share data between multiple VMs. Cinder volumes are your network-attached disks. They're like USB drives you can plug into any VM in your cloud.
- Cinder service deployed and running (via Kolla-Ansible)
- Storage backend configured (LVM, Ceph, NFS, etc.)
- Sufficient storage quota in project
- Understanding of volume types and QoS
Pre-configuration research:
# 1. Check Cinder documentation for your version
# URL: https://docs.openstack.org/cinder/<version>/admin/
# 2. Key sections:
# - "Storage backends" (know your backend type)
# - "Volume types and extra specs"
# - "QoS specifications"
# 3. Verify Cinder services status
openstack volume service list
# 4. Check available volume types
openstack volume type listStep 1: Create basic volume
# Create 50GB volume from default backend
openstack volume create \
--size 50 \
--description "Database data volume" \
db-data-vol-01
# Wait for available status
while [ "$(openstack volume show db-data-vol-01 -c status -f value)" != "available" ]; do
echo "Waiting for volume to be available..."
sleep 5
doneStep 2: Create volume with specific type and QoS
# First, check available volume types
openstack volume type list
# Create volume with SSD backend (if configured)
openstack volume create \
--size 100 \
--volume-type ssd \
--description "High-performance app volume" \
app-fast-vol-01
# Add QoS specs (IOPS limits)
# Create QoS spec first (admin only)
openstack volume qos create \
--property read_iops_sec=5000 \
--property write_iops_sec=5000 \
high-iops-qos
# Associate QoS with volume type
openstack volume type set \
--property qos_supported=True \
ssd
openstack volume type associate qos high-iops-qos ssd
# Now volumes created with 'ssd' type get QoS limitsStep 3: Attach volume to instance
# Attach to running instance
openstack server add volume web-server-01 db-data-vol-01
# Verify attachment
openstack volume show db-data-vol-01 -c attachments -f yaml
# Inside instance, find and mount the new disk
# SSH to instance:
ssh ubuntu@<instance-ip>
# List new disks:
lsblk
# Should show new disk (e.g., /dev/vdb)
# Format and mount (example for ext4)
sudo mkfs.ext4 /dev/vdb
sudo mkdir /data
sudo mount /dev/vdb /data
# Make mount persistent:
echo '/dev/vdb /data ext4 defaults,nofail 0 2' | sudo tee -a /etc/fstabStep-by-step via Horizon:
1. Navigate: Project → Volumes → Volumes
2. Click "Create Volume"
3. Volume Details:
- Volume Name: db-data-vol-01
- Description: Database data volume
- Size: 50 GB
- Volume Type: (select if multiple types available)
- Availability Zone: (usually auto)
4. Click "Create Volume"
5. After volume shows "Available", click "Edit Attachments"
6. Attach to Instance:
- Select instance: web-server-01
- Mountpoint: (leave blank for auto)
- Click "Attach Volume"
7. Verify: Volume status changes to "In-use"
Verify volume creation and attachment:
# List volumes with details
openstack volume list --long
# Show specific volume details
openstack volume show db-data-vol-01 -f yaml
# Key fields to verify:
# - status: in-use (if attached) or available
# - size: 50
# - attachments: shows instance ID and device path
# - volume_type: default or ssd
# Verify from instance side
ssh ubuntu@<instance-ip> "lsblk | grep vdb"
ssh ubuntu@<instance-ip> "df -h /data" # If mountedTest volume persistence:
# 1. Write test data to volume
ssh ubuntu@<instance-ip> "echo 'test-data-$(date)' | sudo tee /data/test.txt"
# 2. Detach volume
openstack server remove volume web-server-01 db-data-vol-01
# 3. Wait for volume to become "available"
openstack volume show db-data-vol-01 -c status -f value
# 4. Attach to different instance
openstack server add volume backup-server-01 db-data-vol-01
# 5. Verify data persists
ssh ubuntu@<backup-ip> "cat /data/test.txt" # Should show same dataCreate volume from snapshot (backup/clone):
# Create snapshot of existing volume
openstack volume snapshot create \
--volume db-data-vol-01 \
--description "Pre-upgrade backup" \
db-data-snap-20240325
# Wait for snapshot to complete
openstack volume snapshot show db-data-snap-20240325 -c status -f value
# Create new volume from snapshot
openstack volume create \
--snapshot db-data-snap-20240325 \
--size 50 \
db-data-vol-02
# Attach to instance for recovery/testing
openstack server add volume test-server-01 db-data-vol-02Resize volume (expand storage):
# Extend volume size (online resize supported in most backends)
openstack volume set --size 100 db-data-vol-01
# Inside instance, resize filesystem
# For ext4:
ssh ubuntu@<instance-ip> "sudo resize2fs /dev/vdb"
# For xfs:
ssh ubuntu@<instance-ip> "sudo xfs_growfs /data"
# Verify new size
ssh ubuntu@<instance-ip> "df -h /data"Manage volume backups (to object storage):
# Create backup of volume (requires Swift/Ceph RGW backend)
openstack volume backup create \
--name db-data-backup-20240325 \
--description "Weekly backup" \
db-data-vol-01
# List backups
openstack volume backup list
# Restore backup to new volume
openstack volume backup restore \
db-data-backup-20240325 \
db-data-vol-restoredMonitor volume usage and performance:
# Check volume I/O stats (if monitoring stack enabled)
# Prometheus query example:
# cinder_volume_stats{volume_name="db-data-vol-01"}
# Check backend storage capacity (Ceph example)
ceph df | grep volumes
# Alert on volume usage >80% (external monitoring integration)Troubleshooting volume issues:
Scenario: Volume stuck in "creating" or "attaching" state
# 1. Check volume status details
openstack volume show db-data-vol-01 -c os-vol-attr-works:status -f yaml
# 2. Check Cinder service logs
# On storage node:
docker exec -it cinder_volume tail -100 /var/log/kolla/cinder/volume.log
# 3. Common causes:
# - Backend storage full: Check Ceph/LVM capacity
# - Network issue to backend: Verify connectivity
# - Driver error: Check backend-specific logs
# 4. Force delete stuck volume (last resort)
openstack volume set --state error db-data-vol-01
openstack volume delete db-data-vol-01Scenario: Volume attached but not visible in instance
# 1. Verify attachment at OpenStack level
openstack volume show db-data-vol-01 -c attachments -f yaml
# 2. Check instance OS for new disk
ssh ubuntu@<instance-ip> "dmesg | tail -20" # Look for new disk detection
ssh ubuntu@<instance-ip> "lsblk" # List all block devices
# 3. If using virtio, ensure image has hw_disk_bus=virtio property
# 4. Check Nova compute logs on target host
# Find instance UUID, then:
grep <instance-uuid> /var/log/kolla/nova/nova-compute.logVolume lifecycle best practices:
# Tag volumes for automation
openstack volume set \
--property backup-enabled=true \
--property retention-days=90 \
db-data-vol-01
# Automated cleanup concept (manual execution):
# For each volume with retention-days:
# if age > retention-days and not attached:
# notify team, then delete
# Document volume purpose in description
openstack volume set \
--description "PostgreSQL data - cluster-prod-01 - JIRA-9876" \
db-data-vol-01🛠️ Field Tip: Volumes are where data lives—treat them with respect. I follow the "3-2-1 rule" for critical volumes: 3 copies, 2 different media, 1 offsite. In OpenStack terms: primary volume, snapshot backup, and Cinder backup to object storage. Also, never delete a volume without checking attachments first. I once almost deleted a database volume because I forgot it was attached to a stopped instance. Always run
openstack volume show <vol> -c attachmentsbefore delete.
It's 2 AM. Alert: "Instance web-server-01 not responding." You wake up, SSH to controller, and start diagnosing. This section is your 2 AM playbook: common issues, quick checks, and recovery steps. No theory—just commands that work.
- Admin access to OpenStack CLI
- SSH access to controller/compute nodes
- Basic understanding of OpenStack architecture
- Monitoring/alerting system in place (optional but recommended)
Pre-troubleshooting research:
# 1. Bookmark key troubleshooting docs
# Nova: https://docs.openstack.org/nova/<version>/admin/troubleshooting.html
# Neutron: https://docs.openstack.org/neutron/<version>/admin/troubleshooting.html
# Cinder: https://docs.openstack.org/cinder/<version>/admin/troubleshooting.html
# 2. Know your logging locations (Kolla-Ansible)
# All logs in containers: /var/log/kolla/<service>/
# Access via: docker exec -it <container> tail -f <logfile>
# 3. Prepare quick diagnostic commands (save in ~/openstack-debug.sh)Scenario 1: Instance not booting
# Step 1: Check instance status
openstack server show web-server-01 -c status,fault -f yaml
# If status: ERROR, check fault message
# Common faults:
# "No valid host was found" -> Resource shortage
# "Image not found" -> Image deleted or permission issue
# "Network error" -> Neutron issue
# Step 2: Check console log for boot errors
openstack console log show web-server-01 --tail 100
# Look for:
# "Kernel panic" -> Image/kernel issue
# "VFS: Unable to mount root fs" -> Disk/image corruption
# "cloud-init: failed" -> User-data syntax error
# Step 3: Verify image and flavor
openstack server show web-server-01 -c image,flavor -f yaml
openstack image show <image-id> -c status -f value # Should be active
openstack flavor show <flavor-id> -c vcpus,ram -f yaml
# Step 4: Check compute node resources
openstack hypervisor show <compute-node> -f yaml
# Look for: free_ram_mb, free_disk_gb, running_vms
# Step 5: Check Nova compute logs
# SSH to compute node:
docker exec -it nova_compute tail -100 /var/log/kolla/nova/compute.log | grep <instance-uuid>Scenario 2: No network connectivity to instance
# Step 1: Verify instance has IP address
openstack server show web-server-01 -c addresses -f yaml
# Step 2: Check port status
openstack port list --server web-server-01 -f table
# Status should be: ACTIVE
# Step 3: Verify security group rules
openstack security group rule list --project $(openstack server show web-server-01 -c project_id -f value) -f table | grep -E "22|80|443"
# Step 4: Test from within OpenStack network
# Launch a test instance on same network, try to ping
openstack server create --name net-test --image cirros --flavor m1.tiny --network private-network
# Wait for active, then:
openstack console log show net-test --tail 20 # Get test instance IP
# SSH to test instance, ping target instance private IP
# Step 5: Check Neutron agents
openstack network agent list --host <compute-node> -f table
# All should show: Alive = :-)
# Step 6: Check OVS/Linux bridge on compute node
# For OVS:
docker exec -it neutron_openvswitch_agent ovs-vsctl show
# Look for: bridge br-int, ports for instance
# For Linux Bridge:
docker exec -it neutron_linuxbridge_agent brctl showScenario 3: Volume not attaching to instance
# Step 1: Check volume status
openstack volume show db-data-vol-01 -c status,attachments -f yaml
# Should be: status=available (before attach) or in-use (after)
# Step 2: Check Cinder services
openstack volume service list | grep -E "cinder-volume|cinder-scheduler"
# Step 3: Verify backend storage
# For Ceph backend:
ceph -s # Should show HEALTH_OK
ceph df # Check pool capacity
# For LVM backend:
# On storage node:
vgs # Check volume group free space
lvs # Check logical volumes
# Step 4: Check Cinder logs
docker exec -it cinder_volume tail -100 /var/log/kolla/cinder/volume.log | grep <volume-id>
# Step 5: Verify Nova-Cinder integration
# On compute node:
grep cinder /var/log/kolla/nova/compute.log | tail -20Scenario 4: OpenStack service down
# Step 1: Identify which service is down
openstack compute service list | grep DOWN
openstack network agent list | grep DOWN
openstack volume service list | grep DOWN
# Step 2: Check Kolla container status
# On affected node:
docker ps | grep <service-name>
# If not running:
docker ps -a | grep <service-name> # Check if exited
# Step 3: Check container logs
docker logs --tail 50 <container-name-or-id>
# Step 4: Common fixes:
# - Restart container:
docker restart <container-name>
# - Reconfigure service (if config changed):
kolla-ansible reconfigure -t <service-name>
# - Check disk space:
df -h /var/lib/docker # Or /var/lib/kolla
# Step 5: If persistent, check host resources
top # CPU/memory pressure
dmesg -T | tail -20 # Kernel errorsMorning routine (5-minute check):
#!/bin/bash
# Save as ~/openstack-morning-check.sh (run manually)
source /etc/kolla/admin-openrc.sh
echo "=== OpenStack Morning Check $(date) ==="
echo -e "\n[1] Service Status"
openstack compute service list --long | grep -E "State|Status" | head -10
openstack network agent list | grep -E "Agent type|Alive" | head -10
openstack volume service list | grep -E "Status|State" | head -10
echo -e "\n[2] Resource Usage"
openstack hypervisor stats show | grep -E "vcpus|memory_mb|disk_gb"
openstack volume stats show # If supported
echo -e "\n[3] Recent Activity"
echo "Instances created last 24h:"
openstack server list --all-projects --sort created_at:desc --limit 5 -c Name,Status,created_at
echo -e "\nVolumes created last 24h:"
openstack volume list --all-projects --sort created_at:desc --limit 5 -c Name,Status,created_at
echo -e "\n[4] Error Alerts (check logs)"
# Quick log scan for errors (last 100 lines)
for service in nova neutron cinder; do
echo "Checking $service logs..."
docker exec -it ${service}_api tail -100 /var/log/kolla/$service/api.log 2>/dev/null | grep -i error | tail -5
done
echo -e "\n=== Check Complete ==="Weekly maintenance tasks:
# 1. Clean up old snapshots
openstack volume snapshot list --all-projects | grep "available" | while read id name status; do
if [ "$id" != "ID" ]; then
# Check age (example: delete if >30 days old)
# This requires parsing created_at - use external script for production
echo "Review snapshot: $name ($id)"
fi
done
# 2. Audit security groups
openstack security group rule list --all-projects -f csv | grep "0.0.0.0/0" | grep -E "22|3306|5432"
# 3. Check for orphaned resources
# Ports without instances:
openstack port list --status DOWN --device-owner "" -c Name
# Volumes not attached:
openstack volume list --status available -c Name,Size
# 4. Backup critical configs
tar -czf openstack-config-backup-$(date +%Y%m%d).tar.gz \
/etc/kolla/ \
~/kolla-ansible/inventory/Log management strategy:
# Kolla-Ansible uses rsyslog or fluentd for log aggregation
# Check log rotation config:
cat /etc/kolla/config/rsyslog.conf # If used
# Manual log cleanup (if needed):
# On each node, rotate large logs:
for log in /var/log/kolla/*/*.log; do
if [ $(stat -f%z "$log" 2>/dev/null || stat -c%s "$log") -gt 1073741824 ]; then
echo "Rotating large log: $log"
# Use logrotate or manual rotation
fi
done
# Centralized logging recommendation:
# Deploy ELK stack or Loki for log aggregation
# Query example: "service:nova AND level:ERROR"Performance monitoring essentials:
# Quick resource checks:
# CPU/Memory on controller:
docker stats --no-stream # Shows container resource usage
# Disk I/O on storage nodes:
iostat -x 2 5 # Run on storage node
# Network throughput:
iftop -i <external-interface> # Run on network node
# Integrate with Prometheus/Grafana for dashboards:
# Key metrics to monitor:
# - nova.compute.instance.count
# - neutron.agent.state
# - cinder.volume.status
# - hypervisor.cpu.usageEmergency recovery procedures:
# Scenario: Controller node down
# 1. Verify HA setup (if deployed)
# 2. Promote standby controller (if applicable)
# 3. Restore from backup if needed:
# - Restore /etc/kolla from backup
# - Re-run kolla-ansible deploy with --limit controller
# Scenario: Database corruption (MariaDB Galera)
# 1. Check cluster status:
mysql -u root -p -e "SHOW STATUS LIKE 'wsrep%';"
# 2. If one node out of sync:
# - Stop MariaDB on problematic node
# - Clear data directory (backup first!)
# - Start MariaDB to rejoin cluster
# 3. If all nodes down: Restore from backup
# Always have a runbook for critical scenarios
# Store in: /opt/openstack-runbooks/🛠️ Field Tip: I keep a "2 AM cheat sheet" on my phone: 10 commands for top 10 issues. Example: "Instance not booting? 1) openstack console log show, 2) check image status, 3) verify flavor resources." Also, I schedule a monthly "chaos test": randomly stop a non-critical service and practice recovery. Muscle memory beats panic. Document every incident—your future self will copy-paste your solution.
Backups aren't sexy until you need them. This section covers practical backup strategies for OpenStack: what to backup, how often, and how to test restores. Plus routine maintenance to keep your cloud healthy.
- Backup storage (NFS, Ceph, object storage)
- Cron or systemd timer access for scheduling
- Understanding of RPO (Recovery Point Objective) and RTO (Recovery Time Objective)
- Documentation of critical vs non-critical components
Pre-backup research:
# 1. Review OpenStack backup recommendations
# URL: https://docs.openstack.org/operations-guide/backup-restore.html
# 2. Key components to backup:
# - MariaDB/PostgreSQL databases (critical)
# - /etc/kolla configuration files (critical)
# - Cinder volumes (via snapshot/backup)
# - Glance images (if not in Ceph)
# - RabbitMQ queues (if not ephemeral)
# 3. Understand your storage backend:
# - Ceph: Use ceph backup commands
# - LVM: Use lvcreate snapshot
# - File: Use rsync/tar
# 4. Test restore procedure in non-production firstStep 1: Backup MariaDB database (critical)
# On controller node, backup all OpenStack databases
# Create backup directory
mkdir -p /backup/openstack/mysql/$(date +%Y%m%d)
# Dump all databases (requires MySQL root password)
mysqldump -u root -p --all-databases --single-transaction --quick \
> /backup/openstack/mysql/$(date +%Y%m%d)/all-databases.sql
# Compress backup
gzip /backup/openstack/mysql/$(date +%Y%m%d)/all-databases.sql
# Verify backup size and integrity
ls -lh /backup/openstack/mysql/$(date +%Y%m%d)/
gunzip -t /backup/openstack/mysql/$(date +%Y%m%d)/all-databases.sql.gz
# Copy to offsite storage (example: rsync to backup server)
rsync -avz /backup/openstack/mysql/$(date +%Y%m%d)/ \
backup-user@backup-server:/offsite-backups/openstack/mysql/Step 2: Backup Kolla configuration files
# Backup critical config directories
tar -czf /backup/openstack/kolla-config-$(date +%Y%m%d).tar.gz \
/etc/kolla/ \
/etc/ansible/ \
~/kolla-ansible/inventory/
# Verify archive
tar -tzf /backup/openstack/kolla-config-$(date +%Y%m%d).tar.gz | head -20
# Copy to offsite
rsync -avz /backup/openstack/kolla-config-*.tar.gz \
backup-user@backup-server:/offsite-backups/openstack/configs/Step 3: Backup Cinder volumes (via Cinder backup)
# Backup critical volumes to object storage (Swift/Ceph RGW)
# List volumes to backup
openstack volume list --project demo -c Name,ID,Status | grep in-use
# Backup each critical volume
for vol_id in $(openstack volume list --project demo -f value -c ID); do
vol_name=$(openstack volume show $vol_id -c name -f value)
if [[ "$vol_name" == *"prod"* ]]; then # Only backup production volumes
openstack volume backup create \
--name "${vol_name}-backup-$(date +%Y%m%d)" \
--description "Daily backup" \
$vol_id
fi
done
# Verify backups created
openstack volume backup list --project demo | grep $(date +%Y%m%d)Step 4: Backup Glance images (if not using Ceph backend)
# If using file backend, backup image files
# Find Glance data directory (check /etc/kolla/glance-api/glance-api.conf)
# Example: /var/lib/kolla/glance/images/
tar -czf /backup/openstack/glance-images-$(date +%Y%m%d).tar.gz \
/var/lib/kolla/glance/images/
# For Ceph backend, no need to backup images separately
# Ceph replication handles durabilitySchedule backups via Horizon (if backup plugin enabled):
Note: Horizon backup features are limited. CLI/automation is preferred.
However, for manual volume backups:
1. Navigate: Project → Volumes → Volumes
2. Click "Create Snapshot" for critical volumes
3. Or: Project → Volumes → Backups → "Create Backup"
4. Fill details:
- Name: db-data-backup-20240325
- Description: Daily backup
- Container: (if using Swift)
5. Click "Create Backup"
6. Verify in Backups list
Verify backup integrity:
# Test MySQL backup restore (in isolated environment)
# NEVER test on production database
mkdir /tmp/mysql-test-restore
cd /tmp/mysql-test-restore
gunzip -c /backup/openstack/mysql/20240325/all-databases.sql.gz | \
mysql -u root -p -e "CREATE DATABASE test_restore; USE test_restore; SOURCE /dev/stdin;"
# Verify Kolla config backup
tar -tzf /backup/openstack/kolla-config-20240325.tar.gz | grep -E "globals.yml|passwords.yml"
# Verify Cinder backup can be restored
# Create test volume from backup:
openstack volume backup restore \
db-data-backup-20240325 \
test-restore-volume
# Verify test-restore-volume appears and can be attachedMonitor backup success:
# Check recent backup jobs
openstack volume backup list --sort created_at:desc --limit 10
# Check backup storage usage
# For Ceph backend:
ceph df | grep backups
# For file backend:
du -sh /backup/openstack/*
# Alert on backup failures (integrate with monitoring)
# Example: If no backup created in last 24h, send alertDaily backup routine:
# Save as /usr/local/bin/openstack-daily-backup.sh
#!/bin/bash
BACKUP_DATE=$(date +%Y%m%d)
BACKUP_DIR="/backup/openstack"
RETENTION_DAYS=7
# 1. MySQL backup
mysqldump -u root -p --all-databases --single-transaction --quick | \
gzip > $BACKUP_DIR/mysql/${BACKUP_DATE}-all-databases.sql.gz
# 2. Kolla config backup
tar -czf $BACKUP_DIR/kolla-config-${BACKUP_DATE}.tar.gz \
/etc/kolla/ /etc/ansible/ ~/kolla-ansible/inventory/
# 3. Cinder backup for production volumes
openstack volume list -f value -c ID | while read vol_id; do
vol_name=$(openstack volume show $vol_id -c name -f value)
if [[ "$vol_name" == *"prod"* ]]; then
openstack volume backup create \
--name "${vol_name}-backup-${BACKUP_DATE}" \
$vol_id
fi
done
# 4. Cleanup old backups
find $BACKUP_DIR/mysql -name "*.gz" -mtime +$RETENTION_DAYS -delete
find $BACKUP_DIR -name "kolla-config-*.tar.gz" -mtime +$RETENTION_DAYS -delete
openstack volume backup list -f value -c ID | while read backup_id; do
backup_name=$(openstack volume backup show $backup_id -c name -f value)
if [[ "$backup_name" == *"backup"* ]] && [[ ! "$backup_name" =~ $BACKUP_DATE ]]; then
# Delete backups older than retention (add date parsing for production)
echo "Review old backup: $backup_name"
fi
done
# 5. Log completion
echo "$(date): Daily backup completed" >> /var/log/openstack-backup.logWeekly maintenance tasks:
# 1. Test restore procedure (monthly recommended)
# - Restore MySQL backup to test instance
# - Verify OpenStack services start with restored config
# 2. Update OpenStack packages (if not using containers)
# For Kolla-Ansible, upgrade via:
# kolla-ansible upgrade --limit controller
# 3. Review and rotate logs
# Check log rotation config in /etc/logrotate.d/kolla
# 4. Audit user access
openstack user list --project admin -f csv
openstack role assignment list --project admin -f csv
# 5. Check for security updates
# For underlying OS:
apt list --upgradable # Ubuntu/Debian
# Or:
yum check-update # CentOS/RHELBackup storage management:
# Monitor backup storage usage
df -h /backup # For file-based backups
ceph df # For Ceph-based backups
# Set up alerts for storage >80% full
# Example cron job to check and alert:
#!/bin/bash
USAGE=$(df /backup | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $USAGE -gt 80 ]; then
echo "Backup storage at ${USAGE}% - cleanup needed" | \
mail -s "ALERT: Backup Storage" admin@example.com
fi
# Implement backup rotation policy
# Example: Keep 7 daily, 4 weekly, 12 monthly
# Use separate directories or tags for rotationDisaster recovery testing:
# Quarterly DR test plan:
# 1. Document current state:
openstack server list -f csv > dr-test-start-state.csv
openstack volume list -f csv >> dr-test-start-state.csv
# 2. Simulate failure (non-production):
# - Stop critical service container
# - Or: Delete test instance and restore from backup
# 3. Execute recovery:
# - Restore MySQL from backup
# - Re-deploy services with kolla-ansible
# - Restore instances from backups
# 4. Verify functionality:
# - Launch test instance
# - Verify network connectivity
# - Check data integrity on restored volumes
# 5. Document lessons learned
# Update runbooks with any gaps foundMaintenance window procedures:
# Pre-maintenance checklist:
# [ ] Notify stakeholders of maintenance window
# [ ] Backup critical data (MySQL, configs)
# [ ] Document current service states
openstack compute service list > pre-maintenance-services.txt
openstack server list --all-projects > pre-maintenance-instances.txt
# During maintenance:
# [ ] Apply updates/upgrades in stages:
# 1. Controller nodes (one at a time if HA)
# 2. Network nodes
# 3. Compute nodes
# [ ] Verify each stage before proceeding
# [ ] Monitor logs for errors
# Post-maintenance:
# [ ] Verify all services active:
openstack compute service list | grep -v "enabled.*:-)"
openstack network agent list | grep -v ":-)"
# [ ] Test critical workflows:
# - Launch test instance
# - Create test volume
# - Assign floating IP
# [ ] Update documentation with changes made🛠️ Field Tip: I follow the "backup golden rule": A backup isn't real until you've restored from it. Every quarter, I pick one non-critical volume, delete it, and restore from backup. This caught a backup corruption issue before it mattered. Also, automate backup verification: after each backup, run a checksum and compare. And never store backups on the same physical hardware as production. I learned that the hard way when a RAID controller failed and took both production and local backups. Offsite or cloud storage for backups is non-negotiable.
You now have a practical, hands-on guide for post-deployment OpenStack operations. This isn't the end—it's your foundation.
Next steps to level up:
- Automate: Turn these manual commands into Ansible playbooks
- Monitor: Integrate Prometheus/Grafana for proactive alerting
- Scale: Learn about cell v2 architecture for large deployments
- Secure: Implement Barbican for secrets management
- Optimize: Explore placement API for intelligent scheduling
Remember the field tips:
- Metadata is your friend—tag everything
- Test restores, not just backups
- Document every incident—your future self will thank you
- Start small, validate each step, then scale
💬 One last story: When I first deployed OpenStack, I spent days troubleshooting a "simple" network issue. Turned out, a VLAN tag was off by one on the physical switch. Now, I keep a "physical network checklist" next to my OpenStack docs. Cloud is software-defined, but it still runs on metal. Respect both layers.
You've got this. Every expert was once a beginner who kept going.
Happy cloud building,
Sumon
DevOps & Cloud Infrastructure Engineer
Document Version: 2.0
Last Updated: March 2026
Feedback welcome: This guide evolves with real-world use. Share your improvements.