kdevops allows you to provide variability for Terraform using kconfig, and uses Ansible to deploy the files needed to get you going with a Terraform plan.
Terraform is used to deploy your development hosts on cloud virtual machines. Below are the list of clouds providers currently supported:
Traditional Cloud Providers:
- azure - Microsoft Azure
- aws - Amazon Web Service
- gce - Google Cloud Compute
- oci - Oracle Cloud Infrastructure
Neoclouds (GPU-optimized):
- datacrunch - DataCrunch GPU Cloud
- lambdalabs - Lambda Labs GPU Cloud
You configure which cloud provider you want to use, what feature from that cloud provider you want to use, and then you can use kdevops to select which workflows you want to enable on that configuration.
To configure which cloud provider you will use you will use the same mechanism to configure anything in kdevops:
make menuconfigUnder "Bring up methods" you will see the option for "Node bring up method (Vagrant for local virtualization (KVM / VirtualBox))". Click on that and then change the option to "Terraform for cloud environments". That should let you start configuring your cloud provider options. You can use the same main menu to configure specific workflows supported by kdevops, by defaults no workflows are enabled, and so all you get is the bringup.
To install the dependencies of everything which you just enabled just run:
makeYou can just use:
make bringupOr if you want to do this manually:
make deps
cd terraform/you_provider
make deps
# Make sure you then add the variables to let you log in to your cloud provider
terraform init
terraform plan
terraform applyWe provide support for updating your configured SSH configuration file
(typically ~/.ssh/config) automatically for you, however each cloud provider
requires support to be added in order for this to work. At the time of this
writing we support this for all cloud providers we support.
After make bringup you should have had your SSH configuration file updated
automatically with the provisioned hosts. The Terraform module
add-host-ssh-config is used to do the work of updating your SSH configuration,
a module is used to share the code with provisioning with vagrant.
The Terraform module on the registry:
The Terraform source code:
Because the same code is shared between the vagrant Ansible role and the Terraform module, a git subtree is used to maintain the shared code. The Terraform code downloads the module on its own, while the code for the Vagrant Ansible role has the code present on the kdevops tree as part of its local directories in under:
playbooks/roles/update_ssh_config_vagrant/update_ssh_config/
Patches for code for in update_ssh_config can go against
the playbooks/roles/update_ssh_config_vagrant/update_ssh_config/
directory, but should be made atomic so that these changes can
be pushed onto the standalone git tree for update_ssh_config on
a regular basis. For details on the development workflow for it,
read the documentation on:
Just run:
make destroyOr if you are doing things manually:
cd terraform/you_provider
terraform destroyWe run the devconfig Ansible role after we update your SSH configuration, as part of the bring up process. If can happen that this can fail due to connectivity issues. In such cases, you can run the Ansible role yourself manually:
ansible-playbook -l kdevops playbooks/devconfig.ymlNote that there a few configuration items you may have enabled, for things
which we are aware of that we need to pass in as extra arguments to
the roles we support we automatically build an extra_vars.yaml with all
known extra arguments. We do use this for one argument for the devconfig
role, and a series of these for the bootlinux role. The extra_args.yaml
file is read by all kdevops Ansible roles, it does this on each role with
a task, so that users do not have to specify the
--extra-args=@extra_args.yaml argument themselves. We however strive to
make inferences for sensible defaults for most things.
Before running Ansible make sure you can SSH into the hosts listed on ansible/hosts.
make unameThere is documentation about different workflows supported on the top level documentation.
To get set up with cloud providers with Terraform we provide some more references below which are specific to each cloud provider.
Read these pages:
https://www.terraform.io/docs/providers/azurerm/auth/service_principal_client_certificate.html https://github.com/terraform-providers/terraform-provider-azurerm.git https://docs.microsoft.com/en-us/azure/virtual-machines/linux/terraform-create-complete-vm https://wiki.debian.org/Cloud/MicrosoftAzure
But the skinny of it:
$ openssl req -newkey rsa:4096 -nodes -keyout "service-principal.key" -out "service-principal.csr"
$ openssl x509 -signkey "service-principal.key" -in "service-principal.csr" -req -days 365 -out "service-principal.crt"
$ openssl pkcs12 -export -out "service-principal.pfx" -inkey "service-principal.key" -in "service-principal.crt"
Use the documentation to get your subscription ID, tenant ID, and application id. Set these values via "make menuconfig" in the Terraform Providers submenu.
AWS is supported. For authentication, kdevops relies on a shared credentials file, separate from kdevops' .config:
~/.aws/credentials
This file is rather simple with a structure as follows:
[default]
aws_access_key_id = SOME_ACCESS_KEY
aws_secret_access_key = SECRET_KEY
This file may contain authentication secrets for more than one user. The TERRAFORM_AWS_PROFILE setting enables you to select which entry in this file kdevops will use to authenticate.
To read more about shared credentials refer to:
- https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
- https://docs.aws.amazon.com/powershell/latest/userguide/shared-credentials-in-aws-powershell.html
If you run kdevops on CodeBuild (or ECS), configure an IAM Task Role for the build container so that kdevops can create AWS resources. For more information, see:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html https://docs.aws.amazon.com/prescriptive-guidance/latest/terraform-aws-provider-best-practices/security.html
kdevops can provision Google Compute Engine resources. If you are new to GCE, review this article:
https://cloud.google.com/docs/overview
to understand the concepts and requirements for setting up and using GCE resources.
Once your project is created, the administrator must create a service account key and provide it to you in a file. The pathname to that file is specified in the "Identity & Access" submenu. See:
https://cloud.google.com/iam/docs/creating-managing-service-account-keys
Further, the service account must be granted access to allow it to create instances and other resources. See:
kdevops supports using the OCI Public Cloud (OCI).
You can find a generic tutorial guide at the following link. You'll need many (but not all) of these steps to bring up kdevops with OCI.
https://docs.oracle.com/en-us/iaas/Content/dev/terraform/tutorials/tf-provider.htm
This explains what an "OCID" is:
https://docs.oracle.com/en-us/iaas/Content/General/Concepts/identifiers.htm
To authenticate to the Oracle cloud, kdevops uses the API Key authentication method, described here:
https://docs.oracle.com/en-us/iaas/Content/API/Concepts/apisigningkey.htm
To generate a signing key, follow these instructions:
By default, kdevops picks up the "DEFAULT" entry in the config file. You can add more entries and use Kconfig to have kdevops use one of the non-default entries in that file.
OCI pre-configures an admin ssh user on each instance. Under the CONFIG_TERRAFORM_SSH_CONFIG_USER option, you need to explicitly set kconfig's ssh login name depending on which OS image you have selected:
opcfor Oracle Linux.ubuntufor Ubuntu Linux
If your Ansible controller (where you run "make bringup") and your test instances operate inside the same subnet, you can disable the TERRAFORM_OCI_ASSIGN_PUBLIC_IP option for better network security.
A neocloud is a new type of specialized cloud provider that focuses on offering high-performance computing, particularly GPU-as-a-Service, to handle demanding AI and machine learning workloads. Unlike traditional, general-purpose cloud providers like AWS or Azure, neoclouds are purpose-built for AI with infrastructure optimized for raw speed, specialized hardware like dense GPU clusters, and tailored services like fast deployment and simplified pricing.
kdevops supports the following neocloud providers:
kdevops supports DataCrunch, a cloud provider specialized in GPU computing with competitive pricing for NVIDIA A100, H100, B200, and B300 instances.
DataCrunch requires API key authentication. Create your credentials file:
mkdir -p ~/.datacrunch
cat > ~/.datacrunch/credentials << EOF
[default]
datacrunch_client_id=your-client-id-here
datacrunch_client_secret=your-client-secret-here
EOF
chmod 600 ~/.datacrunch/credentialsGet your API credentials from: https://cloud.datacrunch.io/
kdevops provides several pre-configured defconfigs for DataCrunch:
Single GPU Instances:
make defconfig-datacrunch-a100 # Single A100 40GB SXM GPU
make defconfig-datacrunch-h100-pytorch # Single H100 80GB with PyTorch imageMulti-GPU Instances:
make defconfig-datacrunch-4x-h100-pytorch # 4x H100 80GB with PyTorch
make defconfig-datacrunch-4x-b200 # 4x B200 (Blackwell architecture)
make defconfig-datacrunch-4x-b300 # 4x B300 (Blackwell architecture)DataCrunch defconfigs can be combined with workflow CLI parameters. For example, to enable the knlp ML research workflow:
make defconfig-datacrunch-a100 KNLP=1
make bringupThis automatically configures a DataCrunch A100 instance with the knlp workflow enabled, setting up the ML research environment with kernel development methodologies.
DataCrunch offers various GPU instance types:
A100 Series (40GB SXM):
- 1A100.40S.22V - Single GPU, 22 vCPUs, 80GB RAM (~$1.39/hr)
- 2A100.40S.44V - Dual GPU, 44 vCPUs, 160GB RAM (~$2.78/hr)
- 4A100.40S.88V - Quad GPU, 88 vCPUs, 320GB RAM (~$5.56/hr)
- 8A100.40S.176V - 8x GPU, 176 vCPUs, 640GB RAM (~$11.12/hr)
H100 Series (80GB SXM):
- 1H100.80S.30V - Single GPU, 30 vCPUs (~$1.99/hr)
- 1H100.80S.32V - Single GPU, 32 vCPUs (~$1.99/hr)
- 2H100.80S.80V - Dual GPU, 80 vCPUs (~$3.98/hr)
- 4H100.80S.176V - Quad GPU, 176 vCPUs (~$7.96/hr)
- 8H100.80S.176V - 8x GPU, 176 vCPUs (~$15.92/hr)
Blackwell Series (Latest Architecture):
- 4B200.120V - 4x B200 GPUs, 120 vCPUs
- 4B300.120V - 4x B300 GPUs, 120 vCPUs
DataCrunch provides various OS images optimized for ML workloads:
- ubuntu-24.04-cuda-12.8-open-docker - Ubuntu 24.04 with CUDA 12.8 and Docker
- ubuntu-22.04-pytorch - Ubuntu 22.04 with PyTorch pre-installed
- ubuntu-22.04 - Ubuntu 22.04 base
- ubuntu-20.04 - Ubuntu 20.04 base
- debian-11 - Debian 11
- debian-12 - Debian 12
All images use root as the default SSH user.
kdevops automatically configures DataCrunch instances with:
- System updates (apt-get dist-upgrade)
- Development tools (git, make, flex, python3.12-venv, npm, etc.)
- Python virtual environment at
~/.venv - PyTorch installation
- NVIDIA kernel module reload
- Claude Code installation via npm
This happens automatically during make bringup via the datacrunch_ml_setup
Ansible role.
kdevops automatically checks instance availability before provisioning:
# Check capacity manually
./scripts/datacrunch_check_capacity.py
# Check specific instance type
./scripts/datacrunch_check_capacity.py --instance-type 1H100.80S.32VThe bringup process automatically selects an available location for your chosen instance type.
DataCrunch instances are automatically added to your SSH config with
checksum-based filenames (e.g., ~/.ssh/config_kdevops_2df337e6).
This allows multiple kdevops directories to coexist without conflicts.
The SSH config is automatically created during make bringup and removed
during make destroy.
# Configure for DataCrunch with knlp workflow
make defconfig-datacrunch-a100 KNLP=1
# Review configuration (optional)
make menuconfig
# Provision instance and setup environment
make bringup
# The instance is now ready with:
# - Python venv with PyTorch and knlp dependencies
# - Claude Code installed
# - NVIDIA drivers loaded
# - SSH configured
# SSH into instance
ssh demo-knlp # hostname from your configuration
# Destroy when done
make destroyAPI Authentication Issues: Check your credentials file exists and has correct permissions:
ls -l ~/.datacrunch/credentials
# Should show: -rw------- (600 permissions)Instance Type Not Available: If your desired instance type isn't available, kdevops will show available alternatives. Use the capacity checker to see current availability:
./scripts/datacrunch_check_capacity.pyProvider Installation:
If using a locally built DataCrunch provider (for development), ensure
you have the dev_overrides configured in ~/.terraformrc:
provider_installation {
dev_overrides {
"squat/datacrunch" = "/home/user/go/bin"
}
direct {}
}For more information, visit: https://datacrunch.io/
kdevops supports Lambda Labs, a cloud provider focused on GPU instances for machine learning workloads with competitive pricing.
For detailed documentation on Lambda Labs integration, including tier-based GPU selection, smart instance selection, and dynamic Kconfig generation, see:
- Lambda Labs Dynamic Cloud Kconfig - Dynamic configuration generation for Lambda Labs
- Lambda Labs CLI Reference - Man page for the lambda-cli tool
Lambda Labs offers various GPU instance types including A10, A100, and H100 configurations. kdevops provides smart selection features that automatically choose the cheapest available instance type and region.
For more information, visit: https://lambdalabs.com/