Gwyneth Peña-Siguenza · Python Developer Advocate, Microsoft
slides → madebygps.github.io/pycon26
Scan for slides
madebygps.azurewebsites.net
SLMs are the answer.
For this talk, an SLM is a small language model: small enough to run locally, no custom hardware.
Families worth knowing right now: Qwen, Gemma, Phi, Llama, Mistral
Lesson 1
Setups should make using and swapping models seamless.
Step 1
ollama pull gemma4:e4b → serves on localhost:11434no accounts no billing no keys
Step 2
API_HOST=ollama
OLLAMA_MODEL=gemma4:e4b
OLLAMA_ENDPOINT=http://localhost:11434/v1
OLLAMA_API_KEY=no-key-needed
Every repo ships one with the model and endpoint baked in. Learners cp it, rename it, configured.
Ollama exposes an OpenAI-compatible endpoint, so the local SLM looks just like a hosted model.
Step 3
Provide samples across frameworks so learners have ample working code to ground themselves with.
examples/
├── agentframework_basic.py
├── agentframework_tools.py
├── langchainv1_basic.py
├── langchainv1_tools.py
├── langgraph_agent.py
├── llamaindex.py
├── openai_agents_basic.py
├── pydanticai_basic.py
└── … (~25 more)
agents/
├── agentframework_http.py
├── agentframework_learn.py
└── langchainv1_github.py
servers/
├── basic_mcp_http.py
├── basic_mcp_stdio.py
└── auth_entra_mcp.py
Two of seven repos. Full list and 150+ samples in the README at github.com/madebygps/pycon26.
load_dotenv(override=True)
client = OpenAIChatClient(
base_url=os.environ.get(
"OLLAMA_ENDPOINT",
"http://localhost:11434/v1"),
api_key="none",
model=os.environ.get(
"OLLAMA_MODEL", "gemma4:e4b"),
)
load_dotenv(override=True)
model = ChatOpenAI(
model=os.environ.get(
"OLLAMA_MODEL", "gemma4:e4b"),
base_url=os.environ.get(
"OLLAMA_ENDPOINT",
"http://localhost:11434/v1"),
api_key="none",
)
load_dotenv(override=True)
client = AsyncOpenAI(
base_url=os.environ.get(
"OLLAMA_ENDPOINT",
"http://localhost:11434/v1"),
api_key="none",
)
model = OpenAIChatModel(
os.environ["OLLAMA_MODEL"],
provider=OpenAIProvider(openai_client=client),
)
load_dotenv(override=True)
Settings.llm = OpenAILike(
model=os.environ.get(
"OLLAMA_MODEL", "gemma4:e4b"),
api_base=os.environ.get(
"OLLAMA_ENDPOINT",
"http://localhost:11434/v1"),
api_key="none",
is_chat_model=True,
)
Same env, same shape. Different framework.
agent = ChatAgent(
chat_client=client,
instructions=system_prompt,
tools=[get_current_date, get_weather,
get_activities],
)
result = await agent.run(
"hii what can I do this "
"weekend in San Francisco?"
)
agent = create_agent(
model=model,
system_prompt=system_prompt,
tools=[get_current_date, get_weather,
get_activities],
)
result = agent.invoke({"messages": [
{"role": "user", "content":
"hii what can I do this "
"weekend in San Francisco?"}
]})
agent = Agent(
model,
system_prompt=system_prompt,
tools=[get_current_date, get_weather,
get_activities],
)
result = await agent.run(
"hii what can I do this "
"weekend in San Francisco?"
)
agent = ReActAgent(
name="planner",
llm=Settings.llm,
tools=[get_current_date, get_weather,
get_activities],
system_prompt=system_prompt,
)
result = await agent.run(
"hii what can I do this "
"weekend in San Francisco?"
)
A pattern learners recognize from one sample to the next.
async with (
MCPStreamableHTTPTool(
name="Microsoft Learn MCP",
url="https://learn.microsoft.com/api/mcp",
) as mcp_server,
Agent(client=client, tools=[mcp_server]) as agent,
):
result = await agent.run(query)
mcp_client = MultiServerMCPClient({
"github": {
"url": "https://api.githubcopilot.com/mcp/",
"transport": "streamable_http",
"headers": {"Authorization": f"Bearer {token}"},
}
})
tools = await mcp_client.get_tools()
server = MCPServerStreamableHTTP(
url="http://localhost:8000/mcp",
)
agent = Agent(
model,
system_prompt=system_prompt,
toolsets=[server],
)
result = await agent.run(query)
mcp_server = MCPServerStreamableHttp(
name="weather",
params={"url": "http://localhost:8000/mcp/"},
)
await mcp_server.connect()
agent = Agent(
name="Assistant",
instructions=system_prompt,
mcp_servers=[mcp_server],
model=OpenAIResponsesModel(
model=MODEL_NAME, openai_client=client),
)
result = await Runner.run(agent, query)
For learners to actually try SLMs across all these frameworks, the path has to be seamless.
git clone <repo>
install ollama
ollama pull gemma4:e4b
cp .env-sample .env
uv sync
uv run examples/weekend_planner.py
// .devcontainer/ollama/devcontainer.json
"features": {
"ghcr.io/prulloac/devcontainer-features/ollama:1": {
"pull": "gemma4:e4b"
}
},
"postCreateCommand":
"uv sync && cp .env.sample.ollama .env"
From the code's point of view, the local SLM looked just like a hosted model. Switch the env, keep the code.
API_HOST=azure
AZURE_OPENAI_ENDPOINT=https://my-resource.openai.azure.com
AZURE_OPENAI_CHAT_DEPLOYMENT=gpt-4o
Use any OpenAI-compatible hosted model.
Lesson 2
New models drop constantly. Re-evaluate constantly.
Two open models worth using today.
| qwen3.5:9b | gemma4:e4b | |
|---|---|---|
| Disk / RAM (Q4) | 6.6 GB | 9.6 GB on disk, ~5 GB at runtime |
| Total params | 9.65B | 8B (4.5B effective) |
| Modalities in | text, image | text, image, audio |
| Context window | 256K | 128K |
| Capabilities | tools, thinking, vision | tools, thinking, vision, audio |
Zoomed out, on the family:
"The Qwen models are pretty good at things like function calling." Ross Taylor, Interconnects
"Gemma 4: byte for byte, the most capable open models." Simon Willison
qwen3.5:9b
gemma4:e4b
Full results, file mappings, and run notes: github.com/madebygps/pycon26/field-reports.md
Categorize your samples into one of three:
works
Learners can simply run it. The SLM is capable of executing the task end to end.
doesn't
You experimented with the sample, tried things your experience told you to try. The SLM just can't do this one yet.
needs help
Could work with the SLM, but needs context engineering, prompt engineering, a small pipeline tweak. Something a learner can grow into.
The needs-help bucket is where the teaching lives. Two examples next.
Example 1
uv run examples/langchainv1_tools.py
qwen3.5:4b
get_current_date get_weather get_activities
Prompt rule: "if an activity would be unpleasant in the weather, don't suggest it."
Query: "hii what can I do this weekend in San Francisco?"
This weekend in San Francisco (January 17-18, 2026) is expected to be rainy with temperatures around 60°F. Given the weather conditions, here are some recommendations: Saturday, January 17, 2026 (Rainy, 60°F): - Museum visits: Indoor museums like the San Francisco Museum of Modern Art or the de Young Museum would be perfect for staying dry while exploring exhibits - Beach visits: While technically outdoor, the beaches are still enjoyable in mild rain conditions - Hiking: Light rain may be acceptable for some outdoor activities [Sunday, January 18 looks similar…]
hallucinated ignored instruction
system_prompt=(
"You help users plan weekends using only information returned by the tools. "
"Always call get_current_date first. Interpret 'this weekend' relative to that date. "
"If today is Saturday or Sunday, include today as part of this weekend. "
"Use only exact dates returned or derived from the current date, and make sure weekdays match dates. "
"Call get_weather and get_activities only for the weekend dates you plan to discuss. "
"Do not invent specific venues, restaurants, museums, neighborhoods, or activities unless a tool returned them. "
"If an activity would be unpleasant in the weather, explain briefly and do not recommend it. "
"Return a concise plain-text recommendation."
)
This Weekend in San Francisco (January 17-18, 2026)
Saturday January 17: Rainy, 60°F. Visit the Museum.
Sunday January 18: Rainy, 60°F. Visit the Museum.
Both days feature Museum activities, which is ideal given the rainy
conditions. Hiking and Beach activities are not recommended due to the rain.
grounded follows instruction
Example 2
uv run agents/agentframework_learn.py
qwen3.5:9b
An agent connected to the Microsoft Learn MCP server:
MCPStreamableHTTPTool(
name="Microsoft Learn MCP",
url="https://learn.microsoft.com/api/mcp",
)
System prompt:
You help with Microsoft documentation questions.
Query: "What are the available hosting options for a Python web app on Azure? Compare them briefly."
This is a crucial distinction to understand when working with official
Microsoft documentation tools.
Here is a detailed comparison of `microsoft_docs_search` and
`microsoft_docs_fetch`:
`microsoft_docs_search` is your primary tool for discovery...
`microsoft_docs_fetch` is used for depth...
Summary Workflow:
1. Start with `microsoft_docs_search`
2. Identify the Best URL
3. Follow up with `microsoft_docs_fetch`
...
The model dropped the question and described the MCP tools instead.
The trace showed huge MCP responses, too much for a small context window to focus.
off task
New system prompt:
You are a helpful assistant that answers questions using Microsoft documentation.
Use only the supplied evidence packet.
Keep your answer focused on what the user asked. Do not add unrelated topics.
Be concise and cite sources when possible.
Student implementation:
Question + evidence packet to the agent.
Azure offers several hosting options, with the best choice depending on
your required level of control, complexity, and scaling needs. The main
options are Azure Container Apps, Azure Kubernetes Service, Azure App
Service, and Azure Functions.
| Service | Best For | Complexity | Control |
|--------------------------|-----------------------------------------|--------------|---------|
| Azure Container Apps | Microservices, Web APIs, containers | Low–Medium | High |
| Azure Kubernetes Service | Large-scale enterprise apps | High | Highest |
| Azure App Service | Traditional web apps | Low | Medium |
| Azure Functions | Event-driven tasks | Lowest | Low |
grounded on task
All because of the constraints.
The same skills production AI work demands.
SLMs make you practice them. Bigger models let you skip them.
Gwyneth Peña-Siguenza
@madebygps · madebygps.github.io/pycon26
Scan for slides