Update docs

rahulgurnani · rahulgurnani · commit 7f9c878b54dd · 2025-12-17T23:05:05.000Z
diff --git a/config/manifests/gateway/gke/sglang-httproute.yaml b/config/manifests/gateway/gke/sglang-httproute.yaml
@@ -0,0 +1 @@
+# Sample http route for GKE Gateway to route traffic to sglang InferencePool
diff --git a/site-src/_includes/model-server-cpu.md b/site-src/_includes/model-server-cpu.md
@@ -1,4 +1,4 @@
-=== "CPU-Based Model Server"
+=== "CPU-Based vLLM deployment"
 
     ???+ warning
 
diff --git a/site-src/_includes/model-server-gpu.md b/site-src/_includes/model-server-gpu.md
@@ -1,4 +1,4 @@
-=== "GPU-Based Model Server"
+=== "GPU-Based vLLM deployment"
 
     For this setup, you will need 3 GPUs to run the sample model server. Adjust the number of replicas as needed.
     Create a Hugging Face secret to download the model [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
diff --git a/site-src/_includes/model-server-sim.md b/site-src/_includes/model-server-sim.md
@@ -1,4 +1,4 @@
-=== "vLLM Simulator Model Server"
+=== "vLLM Simulator deployment"
 
     This option uses the [vLLM simulator](https://github.com/llm-d/llm-d-inference-sim/tree/main) to simulate a backend model server.
     This setup uses the least amount of compute resources, does not require GPU's, and is ideal for test/dev environments.
diff --git a/site-src/_includes/model-server.md b/site-src/_includes/model-server.md
diff --git a/site-src/_includes/sglang-gpu.md b/site-src/_includes/sglang-gpu.md
@@ -0,0 +1,7 @@
+=== "GPU-Based SGLang deployment"
+
+    For this setup, you will need 3 GPUs to run the sample model server. Adjust the number of replicas as needed.
+    Create a Hugging Face secret to download the model [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
+    Ensure that the token grants access to this model.
+
+    Deploy a sample SGLang deployment with the proper protocol to work with the LLM Instance Gateway.
diff --git a/site-src/guides/index.md b/site-src/guides/index.md
@@ -42,6 +42,12 @@ IGW_LATEST_RELEASE=$(curl -s https://api.github.com/repos/kubernetes-sigs/gatewa
     kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/${IGW_LATEST_RELEASE}/config/manifests/vllm/sim-deployment.yaml
     ```
 
+--8<-- "site-src/_includes/sglang-gpu.md"
+
+   ```bash
+   kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/${IGW_LATEST_RELEASE}/config/manifests/sglang/gpu-deployment.yaml
+   ```
+
 ### Install the Inference Extension CRDs
 
 ```bash
@@ -153,11 +159,15 @@ kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extens
          inference-gateway   inference-gateway   <MY_ADDRESS>    True         22s
          ```
       1. Deploy the HTTPRoute:
-
+      
+         For vllm deployment:
          ```bash
          kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/${IGW_LATEST_RELEASE}/config/manifests/gateway/gke/httproute.yaml
          ```
-
+         For sglang deployment:
+         ```bash
+         kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/${IGW_LATEST_RELEASE}/config/manifests/gateway/gke/httproute-sglang.yaml
+         ```
       1. Confirm that the HTTPRoute status conditions include `Accepted=True` and `ResolvedRefs=True`:
 
          ```bash

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+# Sample http route for GKE Gateway to route traffic to sglang InferencePool`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-=== "CPU-Based Model Server"`
	`1`	`+=== "CPU-Based vLLM deployment"`
`2`	`2`
`3`	`3`	`???+ warning`
`4`	`4`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-=== "GPU-Based Model Server"`
	`1`	`+=== "GPU-Based vLLM deployment"`
`2`	`2`
`3`	`3`	`For this setup, you will need 3 GPUs to run the sample model server. Adjust the number of replicas as needed.`
`4`	`4`	`Create a Hugging Face secret to download the model [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-=== "vLLM Simulator Model Server"`
	`1`	`+=== "vLLM Simulator deployment"`
`2`	`2`
`3`	`3`	`This option uses the [vLLM simulator](https://github.com/llm-d/llm-d-inference-sim/tree/main) to simulate a backend model server.`
`4`	`4`	`This setup uses the least amount of compute resources, does not require GPU's, and is ideal for test/dev environments.`