How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

MarkTechPost

Sana Hassan · 2026-04-29 · via MarkTechPost

In this tutorial, we build a complete, production-style LLM workflow using Promptflow within a Colab environment. We begin by setting up a reliable keyring backend to avoid OS dependency issues and securely configure our OpenAI connection. From there, we establish a clean workspace and define a structured Prompty file that acts as the core LLM component of our pipeline. We then design a class-based flex flow that combines deterministic preprocessing with LLM reasoning, allowing us to inject computed hints into model responses. We also enable tracing to monitor each execution step, run both single- and batch-queries, and generate outputs in a structured format. Finally, we extend the system with an evaluation pipeline that leverages an LLM-as-a-judge to score responses against expected answers.

!pip install -q keyrings.alt


import keyring
from keyrings.alt.file import PlaintextKeyring
keyring.set_keyring(PlaintextKeyring())


import os
from promptflow.client import PFClient
from promptflow.connections import OpenAIConnection


pf = PFClient()
CONN = "open_ai_connection"
try:
   pf.connections.get(name=CONN)
   print(f"Using existing connection '{CONN}'")
except Exception:
   pf.connections.create_or_update(
       OpenAIConnection(name=CONN, api_key=os.environ["OPENAI_API_KEY"])
   )
   print(f"Created connection '{CONN}'")

We begin by installing a fallback keyring backend to avoid dependency issues in environments like Colab. We then initialize the Promptflow client and check if an OpenAI connection already exists. If not, we create one using the API key from the environment, ensuring a reusable and consistent connection setup.

!pip install -q "promptflow>=1.13.0" "promptflow-tracing" "promptflow-tools" openai


import os, sys, json, getpass, textwrap, importlib
from pathlib import Path


if "OPENAI_API_KEY" not in os.environ:
   os.environ["OPENAI_API_KEY"] = getpass.getpass("Paste your OpenAI API key: ")


WORK_DIR = Path("/content/pf_demo"); WORK_DIR.mkdir(exist_ok=True, parents=True)
os.chdir(WORK_DIR); sys.path.insert(0, str(WORK_DIR))


from promptflow.client import PFClient
from promptflow.connections import OpenAIConnection
from promptflow.tracing import start_trace


pf = PFClient()
CONN = "open_ai_connection"
try:
   pf.connections.get(name=CONN); print(f"Using existing connection '{CONN}'")
except Exception:
   pf.connections.create_or_update(OpenAIConnection(name=CONN, api_key=os.environ["OPENAI_API_KEY"]))
   print(f"Created connection '{CONN}'")

We install all required Promptflow libraries and set up the project’s working directory. We securely capture the OpenAI API key if it is not already set and configure the environment accordingly. We then reinitialize the Promptflow client and ensure that the connection is properly established for downstream usage.

(WORK_DIR / "researcher.prompty").write_text("""---
name: Researcher
description: Concise research assistant.
model:
 api: chat
 configuration:
   type: openai
   connection: open_ai_connection
   model: gpt-4o-mini
 parameters:
   temperature: 0.2
   max_tokens: 350
inputs:
 question: {type: string}
 hint:     {type: string, default: ""}
sample:
 question: "What is the speed of light in vacuum?"
 hint: ""
---
system:
You are a precise research assistant. Answer in 1-3 sentences. If a `hint` is given, weave it in.


user:
Q: {{question}}
{% if hint %}Hint: {{hint}}{% endif %}
""")


(WORK_DIR / "flow.py").write_text(textwrap.dedent('''
   from pathlib import Path
   from promptflow.tracing import trace
   from promptflow.core import Prompty


   BASE = Path(__file__).parent


   @trace
   def safe_calc(expression: str) -> str:
       """A tiny deterministic 'tool' the assistant can lean on."""
       if not set(expression) <= set("0123456789+-*/(). "):
           return "unsafe"
       try: return str(eval(expression))
       except Exception as e: return f"error:{e}"


   class ResearchAssistant:
       """Class-based flex flow. __init__ args become flow init parameters."""
       def __init__(self, model: str = "gpt-4o-mini"):
           self.model = model
           self.llm = Prompty.load(source=BASE / "researcher.prompty")


       @trace
       def __call__(self, question: str) -> dict:
           hint = ""
           if "*" in question or "+" in question:
               tokens = [t for t in question.replace("?","").split() if any(c.isdigit() for c in t)]
               expr = "".join(tokens)
               if expr:
                   hint = f"computed: {expr} = {safe_calc(expr)}"


           answer = self.llm(question=question, hint=hint)


           return {"question": question, "answer": str(answer).strip(), "hint_used": hint}
'''))


(WORK_DIR / "flow.flex.yaml").write_text(
   "$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.json\n"
   "entry: flow:ResearchAssistant\n"
)

We define a Prompty file that structures how the LLM should behave as a concise research assistant. We then create a class-based flow that combines a deterministic calculation tool with an LLM call, enabling hybrid reasoning. Finally, we register this flow using a YAML configuration, making it executable within the Promptflow framework.

try: start_trace()
except Exception as e: print("trace ui unavailable on Colab — traces still recorded:", e)


import flow as _flow; importlib.reload(_flow)
agent = _flow.ResearchAssistant(model="gpt-4o-mini")


print("\n=== Single call ===")
print(json.dumps(agent(question="In one sentence, what is photosynthesis?"), indent=2))
print(json.dumps(agent(question="What is 21 * 19 ?"), indent=2))


data = [
   {"question": "What is the capital of France?",          "expected": "Paris"},
   {"question": "Chemical symbol for gold?",               "expected": "Au"},
   {"question": "Who wrote the play Hamlet?",              "expected": "Shakespeare"},
   {"question": "What is 12 * 11 ?",                       "expected": "132"},
   {"question": "Boiling point of water at sea level (C)?","expected": "100"},
   {"question": "Largest planet in our solar system?",     "expected": "Jupiter"},
]
data_path = WORK_DIR / "data.jsonl"
data_path.write_text("\n".join(json.dumps(r) for r in data))


print("\n=== Batch run ===")
base_run = pf.run(
   flow=str(WORK_DIR / "flow.flex.yaml"),
   data=str(data_path),
   column_mapping={"question": "${data.question}"},
   stream=True,
)
print(pf.get_details(base_run))

We enable tracing to capture execution details and instantiate our research assistant flow. We test the system with individual queries to verify both natural language and arithmetic handling. We then prepare a dataset and run a batch job in Promptflow, collecting structured outputs for further evaluation.

(WORK_DIR / "judge.prompty").write_text("""---
name: Judge
model:
 api: chat
 configuration:
   type: openai
   connection: open_ai_connection
   model: gpt-4o-mini
 parameters:
   temperature: 0
   max_tokens: 150
   response_format: {type: json_object}
inputs:
 question: {type: string}
 answer:   {type: string}
 expected: {type: string}
---
system:
You are an exacting grader. Decide whether the assistant's answer contains the expected fact (case-insensitive, allowing reasonable phrasing/synonyms). Reply ONLY as JSON: {"score": 0 or 1, "reason": "..."}.


user:
Question: {{question}}
Expected: {{expected}}
Answer:   {{answer}}
""")


(WORK_DIR / "eval_flow.py").write_text(textwrap.dedent('''
   import json
   from pathlib import Path
   from promptflow.tracing import trace
   from promptflow.core import Prompty


   BASE = Path(__file__).parent


   class Evaluator:
       def __init__(self):
           self.judge = Prompty.load(source=BASE / "judge.prompty")


       @trace
       def __call__(self, question: str, answer: str, expected: str) -> dict:
           raw = self.judge(question=question, answer=answer, expected=expected)
           if isinstance(raw, str):
               try: raw = json.loads(raw)
               except Exception: raw = {"score": 0, "reason": f"unparseable:{raw[:80]}"}
           return {"score": int(raw.get("score", 0)), "reason": str(raw.get("reason",""))}


       def __aggregate__(self, line_results):
           """Run-level aggregation. Whatever this returns shows up in pf.get_metrics()."""
           scores = [r["score"] for r in line_results if r]
           return {
               "accuracy": (sum(scores) / len(scores)) if scores else 0.0,
               "passed":   sum(scores),
               "total":    len(scores),
           }
'''))


(WORK_DIR / "eval.flex.yaml").write_text(
   "$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.json\n"
   "entry: eval_flow:Evaluator\n"
)


print("\n=== Evaluation run ===")
eval_run = pf.run(
   flow=str(WORK_DIR / "eval.flex.yaml"),
   data=str(data_path),
   run=base_run,
   column_mapping={
       "question": "${data.question}",
       "expected": "${data.expected}",
       "answer":   "${run.outputs.answer}",
   },
   stream=True,
)


eval_details = pf.get_details(eval_run)
print(eval_details)


print("\n=== Aggregated metrics (from __aggregate__) ===")
print(json.dumps(pf.get_metrics(eval_run), indent=2))


import pandas as pd
if "outputs.score" in eval_details.columns:
   s = pd.to_numeric(eval_details["outputs.score"], errors="coerce").fillna(0)
   print(f"Manual accuracy: {s.mean():.2%}  ({int(s.sum())}/{len(s)})")

We create a judging Prompty that evaluates model outputs against expected answers using structured JSON responses. We implement an evaluator class that parses results, computes scores, and defines an aggregation method for overall metrics. Also, we run the evaluation pipeline, link it to the base run, and compute accuracy both through Promptflow metrics and a manual fallback.

In conclusion, we built a robust, modular LLM pipeline that extends beyond basic prompt-response interactions. We integrated deterministic tools, structured prompting, and reusable flow components to create a system that is both transparent and scalable. Through batch execution and linked evaluation runs, we established a clear feedback loop that helps us measure performance using accuracy metrics and detailed reasoning. The inclusion of tracing and aggregation functions enables us to debug, monitor, and improve the system efficiently. Also, this workflow demonstrates how we can design reliable, end-to-end LLM applications with strong foundations in structure, evaluation, and reproducibility.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Sana Hassan

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

MarkTechPost

Sana Hassan