The Cisco AI Readiness Index shows that most organizations are already seeing tangible value from investment in artificial intelligence (AI). However, early adopters quickly encountered limitations when attempting to generate long-form, technical content. For instance, when given raw notes and asked to create technical reports, large language models (LLMs) such as ChatGPT, Claude, and Gemini generated polished-looking results that often contained significant inaccuracies, unusual conclusions, and inconsistent writing styles.
The Cisco Talos Incident Response (Talos IR) AI Tiger Team set out to identify the root causes of these output problems, which we collectively refer to as “inconsistencies.” After defining these issues, we experimented with various solutions through prompt engineering. In the following sections, we share our findings on the consistency problem and our control methods based on a specific case study, drafting an experimental AI-assisted Tabletop Exercise (TTX) report.
In a nutshell, a TTX involves cybersecurity stakeholders gathering in a virtual or physical conference room and talking through a fictitious, tailored scenario involving a cybersecurity incident. Facilitators guide them through a discussion of incident resolution, asking probing questions to highlight areas of strength and potential gaps in the organization’s incident response processes. While this case study focuses on a TTX report, the methodology could be adapted to any cybersecurity reporting use case with standardized inputs and predictable outputs.
As an important note, the Talos IR AI Tiger Team experiments with and publishes these findings from a strictly research-oriented perspective.
Defining the inconsistency problem in AI reporting
Various types of inconsistencies in AI output frequently diminish the efficiency gains that AI reporting processes promise to deliver. At their core, most inconsistencies stem from the probability-driven nature of LLMs. These models generate output by predicting the next token, typically a word or sub-word, in a sequence, based on model weights and training data. In essence, this means that no two LLM outputs will be identical, even when provided with the exact same prompt multiple times.
Talos IR identified four ways this probabilistic nature manifests itself during report content generation, detailed in the following list:
- Inconsistency in research and sourcing: LLMs utilize various data sources, ranging from static training sets to real-time internet access. Because a model may pull from different websites during separate runs, the underlying data often shifts. This variability in source material directly leads to inconsistent results, making it difficult to rely on an LLM for repeatable, standardized research outcomes.
- Inconsistency in conclusions: Even with identical data, LLMs may produce different conclusions. For example, in a data breach scenario, a model might suggest a full organization-wide password reset in one instance and a targeted reset in another. Without the nuance to evaluate specific context, the model often defaults to whichever recommendation it generates first. This lack of consistency complicates decision-making, as the model may fail to provide the most appropriate solution for the specific circumstances at hand.
- Inconsistency in output format: Because LLMs generate content token-by-token, document structure and formatting can fluctuate between runs. This unpredictability is problematic for professional environments where standardized layouts, such as consistent executive summaries or recommendation sections, are essential for quality control. Achieving a predictable, uniform output remains a significant challenge when using LLMs for formal report generation.
- Inconsistency due to context drift and pollution: LLMs use a “context window” to track conversation history, but this creates two primary issues. First, when the window hits its limit, the model discards older information, potentially losing critical initial instructions. Second, performing multiple unrelated tasks in one session leads to “context pollution,” where conflicting data causes the model to produce unpredictable or blended results. As a session grows, these factors degrade performance, as the model struggles to maintain focus on the original task requirements.
Methods to control inconsistencies
The Talos IR AI Tiger Team developed and tested various prompt engineering methods to control each type of inconsistency. While none of these methods are particularly groundbreaking individually, they collectively produced the highly accurate report described in the “Case Study” section. The four following inconsistency control methods are described and discussed to help others on their own prompt-writing journey.
- Prompt specialization: Prompt specialization mitigates context drift and pollution by replacing large, unified prompts with granular, single-task instructions. By focusing each prompt on a specific, small portion of the report, the risk of hallucination or cross-contamination between sections is significantly reduced. This modular approach allows for greater transparency and easier optimization of individual components.
- Specified source constraints: Specified source constraints address inconsistencies in research and conclusions by mandating exactly where the LLM should retrieve information. By providing explicit instructions on data provenance, users limit the model’s ability to pull from unreliable or conflicting sources. This control ensures that the final output remains grounded in authoritative data, preventing the generation of inaccurate or speculative content. Defining these boundaries within the prompt is essential for maintaining integrity and ensuring that the model’s conclusions align strictly with the provided source material.
- Output format specification: Output format specification ensures consistency by providing the LLM with rigid parameters regarding length, tone, content, and structure. Without these instructions, models often produce excessive or overly creative content that deviates from professional standards. By explicitly defining the target audience, preferred writing style, and necessary content elements, users can force the model to adhere to a predictable structure. This level of guidance is critical for quality control, ensuring that the generated report meets professional requirements and remains free of unnecessary or redundant information.
- Template-guided prompting: Template-guided prompting is a method for strictly enforcing structural consistency. By embedding a rigid template directly into the prompt, users can control exactly how the final output is laid out. Clear instructions are provided to the model to distinguish between static text that must remain unchanged and dynamic placeholders that require replacement. This approach eliminates formatting variability, ensuring that every document follows a uniform, professional structure. By combining these templates with clear delimiter instructions, users achieve highly predictable, repeatable output that requires minimal post-processing or manual formatting.
Case study: TTX report
We selected the TTX report as an ideal case study candidate for two key reasons. First, its content is largely a reorganization of notes captured during a TTX event, meaning the LLM’s role is focused on restructuring existing data rather than generating new content creatively. Second, unlike a forensics report, which contains timestamps, file paths, and other technical elements that are difficult to manually verify, a TTX report is straightforward enough for the human author to review at a glance. This makes it significantly less likely that a hallucination would go undetected during research and testing.
As mentioned earlier, during our research the team created three TTX reporting prompts named the “Discussion Organizer,” the “Recommendation Polisher,” and the “Executive Summarizer.” One of these, the “Executive Summarizer,” is shown in full below to assist other researchers in their work. It is designed to write an accurate, concise executive summary given the rest of the report as input.
The benefits
There were many clear benefits to AI-generated reporting during our testing:
- Efficiency: As noted at the start of this post, case study test results predicted a 50% reduction in total report drafting time. This included the time spent manually writing the 10% of content that could not be efficiently AI-generated and manually editing the AI-generated content.
- Better content: The “Recommendation Polisher” prompt was effective in suggesting corollaries of recommendations that the TTX participants and facilitators may not have explicitly identified during the discussion. Our testing resulted in more robust lists of recommendations.
- Consistent quality: A blind test of the sample report in our quality assurance process showed no noticeable drop in overall writing quality. The peer reviewer, professional editor, and management reviewer all made complimentary comments about the report while unaware that it was AI-generated. The peer reviewer commented that the incidence of typos and grammatical errors was far lower than in the average report.
Cautions
There were also some drawbacks and considerations that would need to be closely managed in a production environment:
- Data management: First, proper AI tool selection is critical to protect sensitive data. Uploading organizational data into a publicly hosted AI tool would often constitute a policy violation and significant data privacy incident. Talos IR carefully adheres to Cisco’s Responsible AI principles and urges other organizations and individuals to exercise extreme caution in data handling.
- Model selection: Testing confirmed that model selection is critical for output quality. As of late 2025, Claude Sonnet 4.5 emerged as the most effective model, delivering high-quality, consistent prose. Its ability to proactively identify and flag internal conflicts in source notes significantly reduced the need for manual corrections.
- Input quality control: Unsurprisingly, we found that input quality determines output quality. To quote a coding aphorism, “Garbage in, garbage out.” The primary area where this can be problematic is the recommendations. While the model can and does identify missed recommendations, it cannot be relied upon to do so.
- LLM over-reliance: Perhaps the most obvious consideration is that report authors retain accountability for the quality of the final product. That being the case, they must edit, understand, and take ownership of every word of the final report. While testing, we found that the LLMs generated recommendations that were duplicative, irrelevant, or not actionable. If this were used in a production environment without manual checks, it could result in poor-quality recommendations in a final report.
Technology limitations
The Talos IR AI Tiger Team found during testing that editing multiple sample reports within a single session resulted in cross-contamination of content from one report’s source material to another, even if the notes used to generate the first report were deleted from the project’s reference documents. We determined that it was critical to run each prompt in a new session or project to ensure the integrity of the output.
Separately, we developed and tested a fourth prompt intended to edit a full report for errors in grammar, spelling, etc. While the process was highly effective in identifying misspellings, multiple iterations hallucinated numerous grammar issues (false positives) and failed to identify actual issues (false negatives), with a success rate below 50%. The most concerning aspect was that multiple runs with the same model, prompt, and draft report input would behave inconsistently, sometimes catching issues and sometimes overlooking them. While our team will continue to test this use case as models improve, it is currently unsuitable for production use.
What’s next
Cisco has invested considerable resources in the responsible adoption and development of AI. The primary goal of the Talos IR AI Tiger Team is to take that broad mandate and convert it into actionable applications within the fields of incident response and forensics. With that in mind, we continuously test, develop, and publish new capabilities in accordance with Cisco’s Responsible AI principles. Again, the Talos IR AI Tiger Team experiments with and publishes these findings from a strictly research-oriented perspective.
If you’re interested in learning more about Cisco Talos Incident Response and how our services could benefit your organization, we’d love to talk with you further. You can read more about us and contact us via our website.
Disclaimer: Some of the individuals posting to this site, including the moderators, work for Cisco. Opinions expressed here and in any corresponding comments are the personal opinions of the original authors, not those of Cisco.
We’d love to hear what you think! Ask a question and stay connected with Cisco Security on social media.
Cisco Security Social Media
LinkedIn
Facebook
Instagram