Learn how to use the Azure AI Foundry SDK for AI inference with Python and FastAPI. Build, deploy, and serve AI models seamlessly using Azure AI services.
Prerequisites
- Follow previous part: Getting Started with Azure AI Foundry in Python | Connect Deepseek, llama, Phi, OpenAI.
- Azure AI Foundry Project
Detailed Video Tutorial
Create Python API for Azure AI Inference
Create a new file named app.py in our existing project from the last blog. Install fast api and uvicron packages.
pip install fastapi
pip install uvicorn
Import the following packages in app.py.
from fastapi import FastAPI
from pydantic import BaseModel
from fastapi.responses import StreamingResponse # Import StreamingResponse
from fastapi.middleware.cors import CORSMiddleware # Import CORSMiddleware
Add the following code for getting the configuration from .env file.
## Get config from .env file
import os
from dotenv import load_dotenv
load_dotenv()
PROJECT_CONNECTION_STRING = os.getenv("PROJECT_CONNECTION_STRING")
Add the import statements for Azure AI Foundry SDK to create a project client and azure ai inference client. These imports will be used for calling the AI models for inference.
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
Initialize Fast API with the below statement.
app = FastAPI()
Let’s create AI Project Client with the connection string and authenticating with the Azure Identity.
# Create a client using the connection string
project = AIProjectClient.from_connection_string(PROJECT_CONNECTION_STRING, credential=DefaultAzureCredential())
Create a new class ChatRequest which will be used to parse the incoming API JSON request body. We have added almost all the properties to make it dynamic.
- question: User’s question must go here.
- system_prompt: System message to give the instruction to the AI model.
- stream: To enable or disable streaming capabilities. Make it true or false accordingly.
- temperature: To controls the “creativity” or randomness of the text generated by AI models. A higher temperature (e.g., 0.7) results in more diverse and creative output, while a lower temperature (e.g., 0.2) makes the output more deterministic and focused.
- topp: It is an alternative to temperature. Instead of considering all possible tokens, AI Models considers only a subset of tokens (the nucleus) whose cumulative probability mass adds up to a certain threshold (top_p). Consider updating any one (temperature or topp) not both of them.
- max_tokens: The number of tokens to get the response in.
- deployment_name: The name of the deployment given for models deployed in Azure AI Foundry portal.
class ChatRequest(BaseModel):
question: str
system_prompt: str
stream: bool
temperature: float
topp: float
max_tokens: int
deployment_name: str
Create a post endpoint /chat.
@app.post("/chat")
def get_chat_response(body: ChatRequest):
chat_inference = project.inference.get_chat_completions_client()
response = chat_inference.complete(
model=body.deployment_name,
messages=[
{"role": "system", "content": body.system_prompt},
{"role": "user", "content": body.question}
],
temperature=body.temperature,
top_p=body.topp,
max_tokens=body.max_tokens,
stream=body.stream
)
chat_inference.close()
return {"status": "ok", "message": response.choices[0].message.content}
Add the streaming capability to the /chat endpoint.
# If streaming is enabled, return a StreamingResponse
if body.stream:
def stream_generator():
for update in response:
if update.choices and update.choices[0].delta:
yield update.choices[0].delta.content or ""
chat_inference.close()
return StreamingResponse(stream_generator(), media_type="text/plain")
Create the Run configuration in VS Code
Go to Run in VS Code and click on Add configuration.

Select Python Debugger.

Choose FastAPI.

Update the file name in configuration to app from main. Finally, click on Run -> Run without debugging. Once, project is ran, it will show up similar output.

Open the Url mentioned in the console and append /docs to go to the Swagger UI.

Your FastAPI is running now, go ahead the test the endpoint from the swagger ui. You can also test from postman.

Testing on Postman and on Swagger UI will not show the streaming capability. Therefore, we need to create a small web application to test this feature. We will create a simple HTML page with JavaScript interactivity.
Create a Web Application to Test Streaming Capabilities of Azure AI Inference
Create a new file named index.html. Add the following code to it that shows a form to fill out all the parameters and call the API.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Chat Streaming</title>
</head>
<body>
<h1>Chat with AI</h1>
<form id="chatForm">
<label for="question">Your Question:</label><br />
<textarea id="question" name="question" rows="4" cols="50"></textarea
><br /><br />
<label for="temperature">Temperature:</label>
<input
type="number"
id="temperature"
name="temperature"
step="0.1"
value="0.7"
/><br /><br />
<label for="topp">Top P:</label>
<input
type="number"
id="topp"
name="topp"
step="0.1"
value="0.9"
/><br /><br />
<label for="maxTokens">Max Tokens:</label>
<input
type="number"
id="maxTokens"
name="maxTokens"
value="100"
/><br /><br />
<label for="deploymentName">Deployment Name:</label>
<input
type="text"
id="deploymentName"
name="deploymentName"
value="DeepSeek-V3"
/><br /><br />
<button type="submit">Submit</button>
</form>
<h2>Response:</h2>
<div id="response"></div>
<div id="responseTime"></div>
<!-- Added to display start and end time -->
<script>
document
.getElementById("chatForm")
.addEventListener("submit", async function (event) {
event.preventDefault();
const systemPrompt = "you are a helpful assistant";
const question = document.getElementById("question").value;
const temperature = parseFloat(
document.getElementById("temperature").value
);
const topp = parseFloat(document.getElementById("topp").value);
const maxTokens = parseInt(
document.getElementById("maxTokens").value,
10
);
const deploymentName =
document.getElementById("deploymentName").value;
const responseDiv = document.getElementById("response");
const responseTimeDiv = document.getElementById("responseTime");
responseDiv.innerHTML = ""; // Clear previous response
responseTimeDiv.innerHTML = ""; // Clear previous time
const startTime = new Date();
responseTimeDiv.innerText = `Stream started at: ${startTime.toLocaleTimeString()}`;
const response = await fetch("http://127.0.0.1:8000/chat", {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({
question,
system_prompt: systemPrompt,
stream: true,
temperature,
topp,
max_tokens: maxTokens,
deployment_name: deploymentName,
}),
});
if (!response.body) {
responseDiv.innerText = "No response body found.";
return;
}
const reader = response.body.getReader();
const decoder = new TextDecoder("utf-8");
let done = false;
while (!done) {
const { value, done: readerDone } = await reader.read();
done = readerDone;
if (value) {
const chunk = decoder.decode(value, { stream: true });
responseDiv.innerText += chunk; // Append streamed content
}
}
const endTime = new Date();
const duration = ((endTime - startTime) / 1000).toFixed(2); // Calculate duration in seconds
responseTimeDiv.innerText += `\nStream ended at: ${endTime.toLocaleTimeString()}`;
responseTimeDiv.innerText += `\nDuration: ${duration} seconds`;
});
</script>
</body>
</html>
Launch this HTML page on a local server. Test your AI model by filling out necessary values.

You might not receive the response due to CORS error. Add the below code in your app.py.
# Add CORS middleware
app.add_middleware(
CORSMiddleware,
allow_origins=["http://127.0.0.1:5500"], # Replace with your frontend's origin
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
Source Code
The complete source code of this implementation is available at Dewiride Shop (JD Bots Repository) website.


Leave a Reply