The Landscape 2025 Predictions: Cloud-native principles will drive AI workflows : @VMblog

Article

Search:

Follow VMblog.com:

Improve end user experience in VDI, DaaS and physical endpoint environments

The Landscape 2025 Predictions: Cloud-native principles will drive AI workflows

Industry executives and experts share their predictions for 2025. Read them in this 17th annual VMblog.com series exclusive.

By Sylvain Kalache, tech entrepreneur, software engineer, PR consultant and co-host of The Landscape

While AI dominates as the most talked-about and invested-in industry, the infrastructure powering these innovations is proving itself invaluable. Since the inception of PaaS, engineering organizations have focused on simplifying application development, deployment, and scaling. However, companies now aim to run any workload seamlessly, cloud-native technologies have stepped up to enable this shift, and AI is no different.

Cloud-native tools designed for machine learning have long supported AI workloads, but the rise of large language models (LLMs) has elevated the industry to an entirely new level. These demanding workloads are quickly becoming a central focus for innovation. In 2025, I anticipate software makers will concentrate on three critical areas: elasticity, scheduling, and observability.

Cloud-native Elasticity and Scalability

Most drastic elasticity and scalability infrastructure use cases are generated by human behavior, most notably here through the use of generative AI tools. Running these LLMs requires immense amounts of resources that experience peak demand during working hours. A recent survey found that only 36% of respondents were using AI tools daily, a number that is certain to rise rapidly. Tools like Kubeflow - an ecosystem of Kubernetes-based components supporting each stage of the AI/ML lifecycle with best-in-class open-source tools and frameworks - will continue to grow. Generative AI is a giant money pit, and the need to optimize infrastructure costs will come sooner or later.

Cloud-native Scheduling and Orchestration

Scheduling and orchestration are much needed for powering predictive AI and training model training. Predictive AI workloads are generally aimed at predicting and analyzing existing patterns or outcomes - they are not real-time workloads and can be scheduled. Similarly, generative AI models need to be trained and are not facing real-time challenges. Jobs can be orchestrated to optimize for costs and available resources. Scheduling support is evolving in Kubernetes through efforts such as Yunikorn, Volcano, and Kueue, the latter two addressing batch scheduling, which is particularly valuable for efficient AI/ML training. Who knows, we might soon see orchestration tools requesting a few additional nuclear rods to be put into production to get enough energy to power their workloads.

Cloud-native Observability

Finally, observability will experience a significant boost in interest as AI remains largely a black box. Debugging or improving models fundamentally differs from regular code because engineers can only observe model behavior and adjust training accordingly. Projects like OpenLLMetry, building on (OTel), already provide instrumentation for LLM observability, offering insights that can guide optimization and debugging efforts.

However, this trend might be shifting soon, and companies should take note. Anthropic recently announced significant progress in addressing this issue, marking a potential turning point for the industry. Additionally, the EU AI Act regulation framework mandates that high-risk AI systems - those posing significant risks to health, safety, or fundamental rights - must ensure transparency (as regular code does) to gain authorization. Companies that can meet these stringent requirements will gain a competitive advantage in bringing their products to market in Europe.

LLMs Driving New Cloud-Native Features

LLMs will also enable new features within the cloud-native ecosystem, and I predict they will be very powerful when embedded in observability tools or used to analyze logs. The open-source CNCF project K8sGPT leverages LLMs like Bedrock and Cohere to assist Kubernetes operators in troubleshooting and managing their workloads more effectively.

Start your cloud-native AI-based infrastructure

While I believe those 3 areas will receive the most attention, there are plenty of other areas of the MLOps flow that companies need to pay attention to. The Linux Foundation compiled a list of cloud-native projects that can be used to manage AI workloads as part of their CNAI Cloud Native Artificial Intelligence efforts. It's a good place to start to find tools for your stack or just to wrap your head around what's needed for your 2025 AI projects.

ABOUT THE AUTHOR

Sylvain-Kalache

Sylvain Kalache is a tech entrepreneur, software engineer, and PR consultant. He co-hosts The Landscape, a podcast discussing everything related to the CNCF landscape and beyond. As the co-founder of Holberton, he built a global school that trained thousands of software engineers hire by leading companies like Google and Tesla. Currently leading Big O PR, Sylvain helps devtool and cloud-native startups engage with technical audiences. A former SRE at LinkedIn and SlideShare, he has extensive expertise in scaling web infrastructure.

Published Wednesday, January 29, 2025 7:33 AM by David Marshall