Azure AI Foundry SDK – Azure AI Inference with Python Fast API with Streaming

Learn how to use the Azure AI Foundry SDK for AI inference with Python and FastAPI. Build, deploy, and serve AI models seamlessly using Azure AI services.

Prerequisites

  1. Follow previous part: Getting Started with Azure AI Foundry in Python | Connect Deepseek, llama, Phi, OpenAI.
  2. Azure AI Foundry Project

Detailed Video Tutorial

Create Python API for Azure AI Inference

Create a new file named app.py in our existing project from the last blog. Install fast api and uvicron packages.

pip install fastapi
pip install uvicorn

Import the following packages in app.py.

from fastapi import FastAPI
from pydantic import BaseModel
from fastapi.responses import StreamingResponse  # Import StreamingResponse
from fastapi.middleware.cors import CORSMiddleware  # Import CORSMiddleware

Add the following code for getting the configuration from .env file.

## Get config from .env file
import os
from dotenv import load_dotenv
load_dotenv()

PROJECT_CONNECTION_STRING = os.getenv("PROJECT_CONNECTION_STRING")

Add the import statements for Azure AI Foundry SDK to create a project client and azure ai inference client. These imports will be used for calling the AI models for inference.

from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient

Initialize Fast API with the below statement.

app = FastAPI()

Let’s create AI Project Client with the connection string and authenticating with the Azure Identity.

# Create a client using the connection string
project = AIProjectClient.from_connection_string(PROJECT_CONNECTION_STRING, credential=DefaultAzureCredential())

Create a new class ChatRequest which will be used to parse the incoming API JSON request body. We have added almost all the properties to make it dynamic.

  1. question: User’s question must go here.
  2. system_prompt: System message to give the instruction to the AI model.
  3. stream: To enable or disable streaming capabilities. Make it true or false accordingly.
  4. temperature: To controls the “creativity” or randomness of the text generated by AI models. A higher temperature (e.g., 0.7) results in more diverse and creative output, while a lower temperature (e.g., 0.2) makes the output more deterministic and focused.
  5. topp: It is an alternative to temperature. Instead of considering all possible tokens, AI Models considers only a subset of tokens (the nucleus) whose cumulative probability mass adds up to a certain threshold (top_p). Consider updating any one (temperature or topp) not both of them.
  6. max_tokens: The number of tokens to get the response in.
  7. deployment_name: The name of the deployment given for models deployed in Azure AI Foundry portal.
class ChatRequest(BaseModel):
    question: str
    system_prompt: str
    stream: bool
    temperature: float
    topp: float
    max_tokens: int
    deployment_name: str

Create a post endpoint /chat.

@app.post("/chat")
def get_chat_response(body: ChatRequest):
    chat_inference = project.inference.get_chat_completions_client()
    response = chat_inference.complete(
       model=body.deployment_name,
       messages=[
          {"role": "system", "content": body.system_prompt},
          {"role": "user", "content": body.question}
       ],
       temperature=body.temperature,
       top_p=body.topp,
       max_tokens=body.max_tokens,
       stream=body.stream
    )
    chat_inference.close()
    return {"status": "ok", "message": response.choices[0].message.content}

Add the streaming capability to the /chat endpoint.

# If streaming is enabled, return a StreamingResponse
    if body.stream:
        def stream_generator():
            for update in response:
                if update.choices and update.choices[0].delta:
                    yield update.choices[0].delta.content or ""
            chat_inference.close()

        return StreamingResponse(stream_generator(), media_type="text/plain")

Create the Run configuration in VS Code

Go to Run in VS Code and click on Add configuration.

Select Python Debugger.

Choose FastAPI.

Update the file name in configuration to app from main. Finally, click on Run -> Run without debugging. Once, project is ran, it will show up similar output.

Open the Url mentioned in the console and append /docs to go to the Swagger UI.

Your FastAPI is running now, go ahead the test the endpoint from the swagger ui. You can also test from postman.

Testing on Postman and on Swagger UI will not show the streaming capability. Therefore, we need to create a small web application to test this feature. We will create a simple HTML page with JavaScript interactivity.

Create a Web Application to Test Streaming Capabilities of Azure AI Inference

Create a new file named index.html. Add the following code to it that shows a form to fill out all the parameters and call the API.

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Chat Streaming</title>
  </head>
  <body>
    <h1>Chat with AI</h1>
    <form id="chatForm">
      <label for="question">Your Question:</label><br />
      <textarea id="question" name="question" rows="4" cols="50"></textarea
      ><br /><br />

      <label for="temperature">Temperature:</label>
      <input
        type="number"
        id="temperature"
        name="temperature"
        step="0.1"
        value="0.7"
      /><br /><br />

      <label for="topp">Top P:</label>
      <input
        type="number"
        id="topp"
        name="topp"
        step="0.1"
        value="0.9"
      /><br /><br />

      <label for="maxTokens">Max Tokens:</label>
      <input
        type="number"
        id="maxTokens"
        name="maxTokens"
        value="100"
      /><br /><br />

      <label for="deploymentName">Deployment Name:</label>
      <input
        type="text"
        id="deploymentName"
        name="deploymentName"
        value="DeepSeek-V3"
      /><br /><br />

      <button type="submit">Submit</button>
    </form>

    <h2>Response:</h2>
    <div id="response"></div>
    <div id="responseTime"></div>
    <!-- Added to display start and end time -->

    <script>
      document
        .getElementById("chatForm")
        .addEventListener("submit", async function (event) {
          event.preventDefault();

          const systemPrompt = "you are a helpful assistant";
          const question = document.getElementById("question").value;
          const temperature = parseFloat(
            document.getElementById("temperature").value
          );
          const topp = parseFloat(document.getElementById("topp").value);
          const maxTokens = parseInt(
            document.getElementById("maxTokens").value,
            10
          );
          const deploymentName =
            document.getElementById("deploymentName").value;

          const responseDiv = document.getElementById("response");
          const responseTimeDiv = document.getElementById("responseTime");
          responseDiv.innerHTML = ""; // Clear previous response
          responseTimeDiv.innerHTML = ""; // Clear previous time

          const startTime = new Date();
          responseTimeDiv.innerText = `Stream started at: ${startTime.toLocaleTimeString()}`;

          const response = await fetch("http://127.0.0.1:8000/chat", {
            method: "POST",
            headers: {
              "Content-Type": "application/json",
            },
            body: JSON.stringify({
              question,
              system_prompt: systemPrompt,
              stream: true,
              temperature,
              topp,
              max_tokens: maxTokens,
              deployment_name: deploymentName,
            }),
          });

          if (!response.body) {
            responseDiv.innerText = "No response body found.";
            return;
          }

          const reader = response.body.getReader();
          const decoder = new TextDecoder("utf-8");
          let done = false;

          while (!done) {
            const { value, done: readerDone } = await reader.read();
            done = readerDone;
            if (value) {
              const chunk = decoder.decode(value, { stream: true });
              responseDiv.innerText += chunk; // Append streamed content
            }
          }

          const endTime = new Date();
          const duration = ((endTime - startTime) / 1000).toFixed(2); // Calculate duration in seconds
          responseTimeDiv.innerText += `\nStream ended at: ${endTime.toLocaleTimeString()}`;
          responseTimeDiv.innerText += `\nDuration: ${duration} seconds`;
        });
    </script>
  </body>
</html>

Launch this HTML page on a local server. Test your AI model by filling out necessary values.

You might not receive the response due to CORS error. Add the below code in your app.py.

# Add CORS middleware
app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://127.0.0.1:5500"],  # Replace with your frontend's origin
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

Source Code

The complete source code of this implementation is available at Dewiride Shop (JD Bots Repository) website.

Thank you All!!! Hope you find this useful…


Leave a Reply

Up ↑

Discover more from JD Bots

Subscribe now to keep reading and get access to the full archive.

Continue reading