Skip to content

LLM Hosting and Data Centers banner with server racks and cloud icon.

Where LLMs Live: Data Centers, Hosting, and Sovereignty

Large language models are often described as if they live in an abstract place called "the cloud." But the cloud is not a place. It is a collection of buildings, power contracts, cooling systems, network cables, storage arrays, GPUs, software stacks, and governance decisions. Every model that answers a prompt is running somewhere specific. It may be running on a laptop, on a lab server, in a university data center, on an NSF-funded research cloud, inside a national laboratory, or in a hyperscale commercial data center owned by a major technology company.

That physical location matters. It determines who controls the model, who pays for the electricity, what data can safely move through the system, how reproducible the workflow can be, and whether the benefits of the infrastructure flow back to the community that hosts it. For science, these are not side issues. They shape whether AI becomes a shared public infrastructure or a private dependency.

This chapter explains the major ways LLMs are hosted, what resources they require, where those resources are located, and how communities evaluate whether the infrastructure is welcome, reasonable, or extractive.

What hosting means

Hosting is the process of making a model available for use. At a minimum, hosting requires hardware that can load the model, software that can run inference, storage for model files and logs, networking so users can reach it, and some form of access control. A hosted model may be exposed through a chat interface, an API endpoint, a Jupyter notebook, a classroom tool, or a workflow system.

For small experiments, hosting can be informal. A researcher can download a model and run it locally through a tool like Ollama or llama.cpp. For shared scientific use, hosting becomes infrastructure. The model needs uptime, authentication, monitoring, versioning, storage, documentation, cost control, and governance. The difference between "I ran a model" and "we host a model" is the difference between a demonstration and a service.

The central challenge is that open models are easier to download than they are to operate. Open weights make a model legally and technically available. Hosting makes it practically usable.

The pieces of an LLM hosting system

LLM hosting is a coupled system. The model is only one part of it.

Component What it does Typical constraint
GPU or CPU compute Executes inference or training FLOPS, GPU count, accelerator type
GPU memory Holds model weights, key-value cache, and context Often the main inference bottleneck
System memory Supports loading, preprocessing, and non-GPU work Can bottleneck large models or RAG systems
Storage Holds model weights, embeddings, documents, logs, and outputs Capacity and I/O speed
Networking Connects users, APIs, storage, and compute Latency, bandwidth, firewall policy
Serving software Runs the model behind an API Throughput, batching, scheduling
Orchestration Starts, stops, scales, and monitors services Operational complexity
Authentication Controls who can use the system Identity integration
Governance Controls quotas, budgets, models, data policy, and logging Institutional maturity
Power and cooling Keeps hardware online and stable Energy, heat, water, facility limits

Inference is usually constrained by memory. The model weights must fit into GPU memory, and long conversations require additional memory for the key-value cache. Training and large-scale fine-tuning are constrained by compute, GPU interconnect, storage throughput, and total runtime. Retrieval-augmented generation adds another layer because documents, embeddings, and vector databases must also be hosted and governed.

Three main hosting models

Most scientific LLM hosting falls into three broad categories: self-hosting, institutional or open-science hosting, and commercial API hosting. These are not mutually exclusive. Mature systems often combine all three.

Self-hosting

Self-hosting means running the model on hardware that an individual, lab, department, university, Tribal organization, agency, or project controls directly. This can be a laptop, a workstation, a rack-mounted GPU server, or a private cloud instance.

Self-hosting provides the most control. The user can choose the model, inspect the software stack, decide where logs are stored, restrict access, and keep sensitive data inside a trusted environment. This is especially important when the data are unpublished, regulated, proprietary, Indigenous-governed, human-subjects related, or otherwise sensitive.

The tradeoff is responsibility. The group must manage hardware, drivers, model serving, user access, security, backups, monitoring, upgrades, and costs. A single researcher can often make a local model work. A lab can often maintain a small server. A university-wide service requires a professional platform.

Institutional and open-science hosting

Institutional hosting sits between individual self-hosting and commercial hyperscale AI. Examples include campus research clouds, ACCESS-CI resources, CyVerse services, Jetstream2, Delta, and VERDE-like platforms. These systems are designed to support many users while keeping infrastructure aligned with research and education.

This layer is especially important for open models. It gives communities somewhere to run them without asking every lab to become a data center operator. It can also provide governance: identity management, quotas, allocations, model catalogs, shared documentation, cost controls, and data policies.

CyVerse VERDE is an example of this service-layer idea. Instead of requiring every user to know which GPU is running which model, VERDE provides a more unified access point for models, APIs, and retrieval workflows. ACCESS-CI resources provide the underlying research cyberinfrastructure that can support cloud-like services, batch workloads, training, fine-tuning, and scientific AI workflows.

Commercial API hosting

Commercial API hosting means sending requests to external providers such as OpenAI, Anthropic, Google, Microsoft, Amazon, Hugging Face, or other hosted inference providers. This is the fastest and simplest path. Users do not need to buy GPUs, operate servers, or manage model deployments.

The tradeoff is dependency. The provider controls the model, pricing, retention policies, regional hosting options, uptime, and terms of service. Commercial APIs are often excellent for rapid prototyping, low-risk tasks, and general-purpose workloads. They are less ideal when data sovereignty, auditability, reproducibility, or long-term cost stability are central requirements.

A practical comparison

Question Self-hosting Institutional / ACCESS / VERDE Commercial APIs
Who controls the hardware? User, lab, or institution Institution or research cyberinfrastructure provider Vendor
Who controls the model? User or institution Shared governance Vendor
Setup burden High Medium Low
Operating burden High Shared Low for user, high dependency
Best for Sensitive data, experimentation, custom systems Shared science, teaching, governed access Rapid prototyping, general tasks
Data sovereignty Strongest Strong if designed well Weak to moderate, depending on provider and contract
Cost pattern Capital cost plus operations Allocation or institutional cost Usage-based tokens
Reproducibility Strong if versioned Strong if platform logs versions Limited by model changes and provider access
Scaling Hard without staff Designed for communities Easy but externally controlled
Main risk Underused hardware and admin burden Governance complexity Lock-in, cost growth, data exposure

The important point is that no one layer solves every problem. A good scientific AI strategy usually uses commercial APIs for low-risk convenience, institutional systems for shared and governed research use, and self-hosted systems for sensitive data or specialized methods.

How much infrastructure are we talking about?

The scale of LLM hosting ranges from a single GPU to infrastructure that affects regional power planning. A small quantized model can run on a laptop or workstation. A useful shared service for a lab or course may require one to four data-center GPUs. A campus service may require tens to hundreds of GPUs plus staff. National open-science systems may operate hundreds to tens of thousands of GPUs. Commercial frontier-model infrastructure is larger still.

A useful rule of thumb is that inference is mostly about memory and training is mostly about compute. A 7B or 8B parameter model can often run on a single consumer GPU if quantized. A 70B parameter model is already a different class. In full precision, it may require roughly 140 GB just for weights. In 8-bit form it may require roughly 70 to 85 GB. In 4-bit form it may fit in roughly 40 to 50 GB, not including all serving overheads. Exact requirements vary by architecture, quantization method, context length, batch size, and serving engine. (source needed)

That is why open weights do not automatically mean local usability. A model can be open and still require data-center-class hardware to serve well.

Hosting tier Typical hardware What it can realistically support Main limitation
Laptop CPU, Apple Silicon, or small GPU Small local models, demos, private experiments Slow, limited concurrency
Workstation 1 to 2 consumer GPUs 7B to 14B models; larger models with quantization Heat, noise, VRAM, uptime
Lab server 1 to 4 data-center GPUs Shared lab chatbot, RAG, coding assistant, moderate open models Administration, security, scheduling
Campus service Tens to hundreds of GPUs Courses, research groups, shared APIs Governance, funding, staffing
ACCESS or national resource Hundreds to thousands of GPUs Training, fine-tuning, large-scale inference, open science workflows Allocation and service design
Hyperscale AI data center Thousands to hundreds of thousands of accelerators across sites Frontier model training and global API serving Grid, water, land, community impact

Workloads are not all the same

Different AI workloads stress infrastructure in different ways. A campus chatbot, a fine-tuning job, a document-grounded RAG system, and an autonomous agent workflow may all use an LLM, but they behave very differently as infrastructure.

Inference

Inference is the act of running a trained model to generate outputs. Chatbots, coding assistants, summarizers, classification pipelines, and most API calls are inference workloads. Inference needs enough GPU memory to load the model and enough compute to generate tokens quickly. It also needs low latency, uptime, user authentication, and monitoring.

A small model can support many users on one GPU. A large model may support only a few concurrent users at acceptable latency. Throughput depends on model size, context length, batching, quantization, serving software, GPU type, and user behavior.

Training and fine-tuning

Training and fine-tuning are much more compute-intensive. They require repeated passes through data, high GPU utilization, fast storage, and often high-bandwidth interconnects between GPUs. These workloads are often better suited to HPC systems like Delta or leadership-class systems than to a small persistent service.

Fine-tuning small models can be realistic for a lab or campus. Training frontier-scale models is not. It belongs to national labs, large companies, or very large consortia.

Retrieval-augmented generation

RAG systems add documents to the hosting problem. The system must store source documents, create embeddings, maintain a vector database, retrieve relevant chunks, and pass them into the model. This shifts some of the bottleneck from GPU memory to storage, indexing, permissions, and governance.

For science, RAG is often more important than fine-tuning. It allows an existing model to answer using project documents, data dictionaries, protocols, publications, or course materials without changing the model weights. But RAG also increases the sovereignty problem because private documents may be sent into prompts or embedding systems.

Agents and workflow systems

Agent systems make many model calls as they reason, search, write code, call tools, and revise outputs. They can be bursty and difficult to budget. A single user may trigger dozens or hundreds of calls. This makes governance, quotas, logging, and cost accounting more important than in simple chat.

Throughput and cost intuition

Exact performance numbers change quickly, but rough orders of magnitude help users understand the hosting problem.

Model size Typical serving hardware Rough behavior
7B to 8B One consumer or data-center GPU Good for teaching, demos, lightweight tools
13B to 14B One higher-memory GPU Better quality, lower throughput
30B to 34B One large GPU or multiple GPUs Useful but less convenient
70B One very high-memory GPU with quantization or multiple GPUs Stronger model, much lower concurrency
100B+ Multi-GPU serving Institutional or commercial scale

Self-hosting becomes economically attractive when utilization is high and predictable. Commercial APIs are often cheaper and easier when usage is low, bursty, or experimental. A GPU server that sits idle most of the day has a high cost per useful token. A commercial API may seem expensive per token but avoids idle hardware, maintenance, and staffing costs.

This is one of the most important lessons for institutions: do not compare only the price of a GPU with the price of tokens. Compare the full cost of ownership, including staff time, electricity, cooling, networking, downtime, security, and replacement cycles.

Energy and power

Modern AI hardware is power dense. A single high-end data-center GPU may draw hundreds of watts. A server with eight high-end GPUs can draw several kilowatts before facility overhead. A rack of liquid-cooled AI servers can draw tens of kilowatts. A large AI data center can draw tens to hundreds of megawatts.

System Approximate IT load
One GPU workstation Hundreds of watts to around 1 kW
One 4-GPU server Roughly 2 to 4 kW
One 8-GPU high-end server Roughly 6 to 10+ kW
Dense AI rack Tens of kW
Small institutional AI room Hundreds of kW to around 1 MW
Large commercial AI data center Tens to hundreds of MW

Facility efficiency matters. Data centers are often described using PUE, or power usage effectiveness. PUE is total facility power divided by IT equipment power. A PUE of 2.0 means that for every watt used by computing equipment, another watt is used for cooling, power conversion, lighting, and facility overhead. A PUE close to 1.0 is much more efficient.

This means local hosting is not automatically greener. A GPU server in a poorly cooled room may have low total scale but poor efficiency. A professional data center may have much higher total demand but better energy use per computation because it can cool efficiently, maintain high utilization, and reuse waste heat.

The sustainability question is therefore not simply local versus data center. It is utilization, facility efficiency, grid carbon intensity, water use, heat reuse, hardware lifetime, and whether the work being done is worth the resource cost.

Place matters: where these systems actually are

AI infrastructure is geographic. It is built in specific places with specific power grids, water systems, land-use politics, labor markets, universities, national labs, and communities. The same GPU cluster can be welcomed in one place and resisted in another.

Facility or region Location Type Approximate scale Community reception Why it matters
Jetstream2 Indiana University and Texas Advanced Computing Center ACCESS-CI open science cloud About 360 NVIDIA A100 GPUs (verify) Generally welcomed Framed as research and education infrastructure
Delta NCSA, University of Illinois Urbana-Champaign ACCESS-CI GPU/HPC resource Hundreds of A100, A40, H200, and related GPUs (verify) Generally welcomed Campus and national research mission
Frontier Oak Ridge National Laboratory, Tennessee DOE leadership supercomputer Tens of thousands of GPUs (verify) Welcomed as national scientific instrument Public mission, peer-reviewed science access
Aurora Argonne National Laboratory, Illinois DOE leadership supercomputer Tens of thousands of GPUs (verify) Welcomed as national scientific instrument Science, simulation, AI, and data-intensive research
NREL HPC Data Center Golden, Colorado Energy-optimized scientific data center MW-scale (verify) Welcomed Known for efficiency, warm-water cooling, and waste-heat reuse
Northern Virginia Data Center Alley Loudoun County and surrounding region, Virginia Commercial hyperscale corridor Hundreds of facilities (verify) Increasingly contested Grid constraints, land use, transmission, local saturation
Phoenix metro Arizona Commercial hyperscale growth region Rapid expansion (verify) Mixed to contested Water scarcity, heat, energy demand
Georgia and Atlanta region Georgia Commercial hyperscale growth region Rapid expansion (verify) Increasing scrutiny Grid planning and local benefit questions
Rural Pennsylvania proposals Pennsylvania Proposed AI/data-center campuses Multi-site proposals (verify) Often resisted Concerns over land, utility costs, noise, and extractive development

The pattern is not random. AI infrastructure is more likely to be welcomed when it is legible as public infrastructure: research, education, national science, workforce development, or local institutional benefit. It is more likely to be resisted when it appears extractive: local power, land, water, and noise in exchange for distant profits and few local benefits.

Welcomed versus resisted infrastructure

Communities tend to welcome AI infrastructure when it is tied to a mission they understand and value. A university research cloud is easier to defend than an anonymous hyperscale facility. A national lab supercomputer is framed as a scientific instrument. An ACCESS-CI resource is allocated to research and education. The benefit is public, or at least institutionally accountable.

Communities tend to resist AI infrastructure when it arrives as a private industrial load on local resources. The concerns are usually concrete: electricity demand, transmission lines, water use, backup diesel generators, noise, land clearing, tax incentives, and whether local residents will see meaningful benefits.

A useful framing question is:

Is this community hosting infrastructure for its own future, or absorbing impacts for someone else's model?

That question often determines whether AI infrastructure is seen as innovation or extraction.

Data sovereignty

Data sovereignty is the principle that communities, institutions, nations, or rights holders should control how their data are stored, accessed, moved, interpreted, and reused. For LLM hosting, sovereignty becomes central because prompts are data, documents passed into RAG systems are data, embeddings are data, logs are data, and model outputs may reveal information about inputs.

Self-hosting or institutional hosting is especially important when data cannot leave a jurisdiction, when Indigenous data governance applies, when human-subjects or health data are involved, when research is unpublished or sensitive, when government or agency rules restrict external processing, or when communities require control over interpretation and access.

In these cases, commercial APIs may still be useful for low-risk tasks, but they should not become the default path for sensitive workflows. The safest architecture is often layered:

Layer Best use
Local or community-controlled hosting Sensitive data, sovereignty, experimentation
Institutional hosting Shared research, courses, governed access
ACCESS or national research infrastructure Open science compute, scaling, reproducibility
Commercial APIs Low-risk tasks, rapid prototyping, overflow capacity

Sovereignty does not always require that every model run under a desk. It requires that hosting decisions match the governance needs of the data and the community.

What makes VERDE and ACCESS important

The open-source AI world has made model weights more available, but it has not made hosting equally available. ACCESS-CI, Jetstream2, Delta, CyVerse, and VERDE help solve the next problem: where open models can run for science.

ACCESS resources provide national-scale cyberinfrastructure for research and education. Jetstream2 is especially relevant for persistent, cloud-like services and interactive environments. Delta is more appropriate for large batch workloads, training, fine-tuning, and GPU-heavy scientific computing. VERDE-like services add the user-facing layer: model catalogs, APIs, routing, retrieval, access control, and policy.

This division of labor matters. A GPU cluster alone is not an AI service. A service needs identity, quotas, model management, documentation, monitoring, storage, and governance. VERDE is valuable because it points toward an institutional control plane for AI: a layer that can connect users to models without exposing every user to the full complexity of the hardware.

Common failure modes

Most LLM hosting failures are predictable.

VRAM exhaustion happens when the model, context window, and batch size exceed available GPU memory. This is common when users try to host larger open-weight models than their hardware can comfortably support.

Underutilization happens when expensive GPUs sit idle. This makes self-hosting look cheap on paper but expensive in practice. Utilization is one of the main reasons shared infrastructure can be more efficient than isolated lab servers.

Overload happens when too many users hit a service at once. Latency spikes, queues grow, and users lose trust. This is common when a prototype becomes a shared tool without a scaling plan.

RAG bottlenecks happen when document retrieval, vector search, storage, or permissions are not designed carefully. The model may be fast, but the system feels slow or unreliable because the retrieval layer is weak.

Governance failure happens when there are no quotas, no model policy, no logging policy, and no cost controls. This is how experimental systems become expensive, insecure, or impossible to maintain.

A decision framework

Use a local machine when the goal is experimentation, privacy, or learning, and when small models are sufficient.

Use a lab server when a small group needs shared access, when data should stay local, and when someone can maintain the system.

Use campus, CyVerse, VERDE, or ACCESS-style infrastructure when the goal is shared scientific use, teaching, reproducibility, or governed access across many users.

Use commercial APIs when speed matters, the data are low risk, the workload is bursty, and the group does not want to operate infrastructure.

Use national HPC systems when the workload is large-scale training, fine-tuning, model evaluation, simulation, or batch scientific analysis.

Avoid pretending that one layer solves every problem. The best architecture is usually hybrid.

The main lesson

Open models lowered the barrier to entry, but infrastructure determines participation. The next phase of open AI will depend less on whether a model can be downloaded and more on whether communities have somewhere reasonable to run it.

That means hosting is not just a technical decision. It is a decision about energy, geography, sovereignty, governance, and public value.

The question is not only which model is best.

The question is where that model lives, who controls it, who pays for it, and whether the community hosting it wants it there.