Industry executives and experts share their predictions for 2025. Read them in this 17th annual VMblog.com series exclusive. By Sylvain Kalache, tech entrepreneur, software engineer, PR
consultant and co-host of The Landscape
While AI dominates as the most talked-about and invested-in
industry, the infrastructure powering these innovations is proving itself
invaluable. Since the inception of PaaS, engineering organizations have focused
on simplifying application development, deployment, and scaling. However,
companies now aim to run any workload seamlessly,
cloud-native technologies have stepped up to enable this shift, and AI is no
different.
Cloud-native tools designed for machine learning have long
supported AI workloads, but the rise of large language models (LLMs) has
elevated the industry to an entirely new level. These demanding workloads are
quickly becoming a central focus for innovation. In 2025, I anticipate software
makers will concentrate on three critical areas: elasticity, scheduling, and
observability.
Cloud-native Elasticity and Scalability
Most
drastic elasticity and scalability infrastructure use cases are generated by
human behavior, most notably here through the use of generative AI tools.
Running these LLMs requires immense amounts of resources that experience peak
demand during working hours. A recent survey found that only 36% of respondents were using AI tools daily, a
number that is certain to rise rapidly. Tools like Kubeflow - an
ecosystem of Kubernetes-based components supporting each stage of the AI/ML
lifecycle with best-in-class open-source tools and frameworks - will continue
to grow. Generative AI is a giant money pit, and the need to optimize
infrastructure costs will come sooner or later.
Cloud-native Scheduling and Orchestration
Scheduling and orchestration are much needed for powering
predictive AI and training model training. Predictive AI workloads are
generally aimed at predicting and analyzing existing patterns or outcomes -
they are not real-time workloads and can be scheduled. Similarly, generative AI
models need to be trained and are not facing real-time challenges. Jobs can be
orchestrated to optimize for costs and available resources. Scheduling support
is evolving in Kubernetes through efforts such as Yunikorn, Volcano, and Kueue, the
latter two addressing batch scheduling, which is particularly valuable for
efficient AI/ML training. Who knows, we might soon see orchestration tools
requesting a few additional nuclear rods to be put into production to get
enough energy to power their workloads.
Cloud-native Observability
Finally,
observability will experience a significant boost in interest as AI remains
largely a black box. Debugging or improving models fundamentally differs from
regular code because engineers can only observe model behavior and adjust
training accordingly. Projects like OpenLLMetry, building
on (OTel), already provide instrumentation for LLM observability, offering
insights that can guide optimization and debugging efforts.
However,
this trend might be shifting soon, and companies should take note. Anthropic
recently announced significant progress in addressing
this issue, marking a potential turning point for the industry. Additionally,
the EU AI Act regulation framework mandates
that high-risk AI systems - those posing significant risks to health, safety,
or fundamental rights - must ensure transparency (as regular code does) to gain
authorization. Companies that can meet these stringent requirements will gain a
competitive advantage in bringing their products to market in Europe.
LLMs Driving New Cloud-Native Features
LLMs will also enable new features within the cloud-native
ecosystem, and I predict they will be very powerful when embedded in
observability tools or used to analyze logs. The open-source CNCF project K8sGPT leverages
LLMs like Bedrock and Cohere to assist Kubernetes operators in troubleshooting
and managing their workloads more effectively.
Start your cloud-native AI-based infrastructure
While I believe those 3 areas will receive the most attention,
there are plenty of other areas of the MLOps flow that companies need to pay
attention to. The Linux Foundation compiled a list of cloud-native projects that can
be used to manage AI workloads as part of their CNAI Cloud Native Artificial Intelligence efforts.
It's a good place to start to find tools for your stack or just to wrap your
head around what's needed for your 2025 AI projects.
##
ABOUT THE AUTHOR
Sylvain Kalache is a tech entrepreneur, software engineer, and PR
consultant. He co-hosts The Landscape, a podcast discussing everything related
to the CNCF landscape and beyond. As the co-founder of Holberton, he built a
global school that trained thousands of software engineers hire by leading
companies like Google and Tesla. Currently leading Big O PR, Sylvain helps
devtool and cloud-native startups engage with technical audiences. A former SRE
at LinkedIn and SlideShare, he has extensive expertise in scaling web
infrastructure.