{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11.0"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Programming Assignment 2: Strategic Interaction Among LLM Agents\n",
    "\n",
    "## CSCE 631 \u2014 Summer 2026\n",
    "\n",
    "**Due Date:** Saturday, June 27, 2026\n",
    "\n",
    "**Student Name:**  \n",
    "**Student ID:**\n",
    "\n",
    "---\n",
    "\n",
    "## Overview\n",
    "\n",
    "In this assignment, you will use the **TAMU API** to build a **multi-agent debate system** with real large language models and analyze the results through a **game-theoretic lens**. You will formalize debate as an extensive-form game with imperfect information, implement LLM-powered debate agents, run controlled experiments, and connect your findings to Nash equilibrium concepts from Weeks 1-4.\n",
    "\n",
    "> **Before starting:** Complete the TAMU API setup described in **TAMU-API-Guide.pdf**. You must be able to make a successful API call before proceeding.\n",
    "\n",
    "> **Thinking Mode:** The default model (`protected.Claude Sonnet 4.5`) has extended thinking enabled on the TAMU gateway. You must use `temperature=1` and `max_tokens=16384` (or higher). For high-volume experiments, consider using `protected.Claude-Haiku-4.5` or `protected.gpt-5-mini`, which accept standard parameters and are cheaper. See the API Guide for the full model table.\n",
    "\n",
    "### Learning Objectives\n",
    "\n",
    "- Use the TAMU API to orchestrate multi-agent LLM interactions\n",
    "- Formalize multi-agent debate using the extensive-form game framework from Week 3\n",
    "- Empirically investigate debate convergence, accuracy, and failure modes with real LLMs\n",
    "- Analyze LLM debate behavior through the lens of Nash equilibrium and information sets\n",
    "- Connect findings to recent results on LLM strategic behavior (Lekeas & Stamatopoulos, 2026)\n",
    "\n",
    "### Budget\n",
    "\n",
    "You have a **$5/day** API budget. The experiments in this assignment are designed to fit within that limit using `protected.Claude Sonnet 4.5` as the default model. For high-volume experiments (many trials), consider `protected.Claude-Haiku-4.5` or `protected.gpt-5-mini` \u2014 they are cheaper and don't require thinking-mode constraints. See the API guide for details.\n",
    "\n",
    "### Connection to Prior Weeks\n",
    "\n",
    "- **Week 1:** Normal-form games, Nash equilibrium, dominated strategies\n",
    "- **Week 3:** Extensive-form games, information sets, behavioral strategies, CFR\n",
    "- **Week 5:** LLM agent architectures, multi-agent debate, strategic behavior in classical games"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Conventions\n",
    "\n",
    "- Answers are encoded as strings: `\"A\"` or `\"B\"`.\n",
    "- Agents are indexed `0, 1, ..., n-1`.\n",
    "- A **debate history** is a list of `(agent_id, argument_text)` tuples.\n",
    "- Each agent receives a **private signal** \u2014 a suggested answer with a brief justification, delivered via its system prompt.\n",
    "- The **judge** is a deterministic mechanism (majority vote over extracted answers), not a strategic player.\n",
    "- All API calls go through the TAMU API (`chat.tamu.ai`). See `TAMU-API-Guide.pdf` for setup.\n",
    "- Helper functions `_ok()` are provided after each task for self-testing. These use **mock responses** and do not make real API calls.\n",
    "- Set `LIVE_MODE = True` only when you are ready to spend API budget on real experiments."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "import time\n",
    "import json\n",
    "import re\n",
    "import os\n",
    "from typing import Dict, List, Tuple, Optional\n",
    "from collections import Counter\n",
    "from unittest.mock import MagicMock, patch\n",
    "\n",
    "try:\n",
    "    import openai\n",
    "    print(\"openai library loaded\")\n",
    "except ImportError:\n",
    "    raise ImportError(\"Run: pip install openai\")\n",
    "\n",
    "try:\n",
    "    import matplotlib.pyplot as plt\n",
    "    import numpy as np\n",
    "except ImportError:\n",
    "    raise ImportError(\"Run: pip install matplotlib numpy\")\n",
    "\n",
    "# \u2500\u2500 TAMU API Configuration \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n",
    "TAMU_API_KEY = \"sk-0183ed7c1c8e47c0a8f4d1e75161aa6d\"\n",
    "TAMU_BASE_URL = \"https://chat.tamu.ai/api\"\n",
    "DEFAULT_MODEL = \"protected.Claude Sonnet 4.5\"\n",
    "\n",
    "# TODO: Paste your CF_Authorization cookie value below.\n",
    "#       Log into chat.tamu.ai, open DevTools (F12) -> Application -> Cookies,\n",
    "#       and copy the CF_Authorization value (starts with eyJ...).\n",
    "CF_COOKIE = \"CF_Authorization=eyJ...\"  # TODO: paste your cookie here\n",
    "\n",
    "# \u2500\u2500 Thinking Mode Note \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n",
    "# Claude Sonnet/Opus models on the TAMU gateway have 'thinking mode'\n",
    "# enabled server-side. This means:\n",
    "#   - temperature MUST be exactly 1 (any other value errors)\n",
    "#   - max_tokens MUST be >= 16384 (model reserves tokens for thinking)\n",
    "# Non-thinking models (Claude-Haiku-4.5, gpt-5-mini, gemini-2.5-flash)\n",
    "# work with standard parameters. See TAMU-API-Guide.pdf for details.\n",
    "\n",
    "# Toggle live API calls vs. mock mode for development\n",
    "LIVE_MODE = False  # Set to True when ready to run real experiments\n",
    "\n",
    "print(\"Setup complete!\")\n",
    "if CF_COOKIE == \"CF_Authorization=eyJ...\":\n",
    "    print(\"WARNING: You have not pasted your CF_Authorization cookie yet.\")\n",
    "    print(\"         See TAMU-API-Guide.pdf for instructions.\")"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Multi-Agent Debate as an Extensive-Form Game\n",
    "\n",
    "We formalize a structured debate among $n$ LLM agents about a **factual question** with a known correct answer.\n",
    "\n",
    "### Extensive-Form Game Tuple\n",
    "\n",
    "$$\\Gamma = \\langle N, H, Z, A, \\tau, \\mathcal{I}, \\sigma_0, u \\rangle$$\n",
    "\n",
    "| Component | Definition in our debate |\n",
    "|-----------|------------------------|\n",
    "| $N = \\{1, \\ldots, n\\}$ | LLM debater agents |\n",
    "| $H$ | Set of all debate histories (sequences of arguments) |\n",
    "| $Z \\subset H$ | Terminal histories (after $T$ rounds of debate) |\n",
    "| $A(h)$ | At each turn, the active agent produces a natural-language argument containing a claim |\n",
    "| $\\tau: H \\to N$ | Player function: round-robin assignment $\\tau(h) = |h| \\bmod n$ |\n",
    "| $\\mathcal{I}_i$ | Information partition for agent $i$ |\n",
    "| $\\sigma_0$ | Nature's move: select question, assign private signals |\n",
    "| $u_i: Z \\to \\mathbb{R}$ | Payoff to agent $i$ at terminal node |\n",
    "\n",
    "### Nature's Move\n",
    "\n",
    "Nature acts first:\n",
    "\n",
    "1. Select a factual question $q$ with ground-truth answer $\\theta \\in \\{A, B\\}$.\n",
    "2. For each agent $i$, construct a **private signal** $s_i$: a system prompt that suggests answer $\\theta$ with probability $p$ (the signal accuracy) and the wrong answer with probability $1 - p$.\n",
    "\n",
    "### Information Sets\n",
    "\n",
    "Agent $i$'s information set at history $h$ is:\n",
    "\n",
    "$$\\mathcal{I}_i(h) = (\\text{system\\_prompt}_i,\\; h)$$\n",
    "\n",
    "Each agent observes:\n",
    "- Its **system prompt** containing the private signal (but not other agents' system prompts)\n",
    "- The **full public debate history** $h$ (all previous arguments and who made them)\n",
    "- It does **not** observe $\\theta$ directly or other agents' private signals\n",
    "\n",
    "### Payoff Structure\n",
    "\n",
    "We use a **cooperative (truth-seeking)** payoff regime:\n",
    "\n",
    "$$u_i(\\theta, d) = \\mathbf{1}[J(h_T) = \\theta]$$\n",
    "\n",
    "All agents are rewarded when the judge's decision $J(h_T)$ matches the ground truth. The judge uses **majority vote** over the answers extracted from each agent's final-round argument.\n",
    "\n",
    "### Key Question\n",
    "\n",
    "Does multi-round LLM debate **improve accuracy** over single-round answers? Under what conditions does debate converge, and when does it fail?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Question Bank\n",
    "\n",
    "The following factual questions are used for the debate experiments. Each has a binary answer (A or B) and a ground-truth label. The questions span general knowledge, science, and reasoning."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "QUESTION_BANK = [\n",
    "    {\n",
    "        \"question\": \"Which planet in our solar system has the most moons?\",\n",
    "        \"A\": \"Jupiter\",\n",
    "        \"B\": \"Saturn\",\n",
    "        \"answer\": \"B\",\n",
    "        \"justification_correct\": \"As of recent counts, Saturn has over 140 confirmed moons, surpassing Jupiter's ~95.\",\n",
    "        \"justification_wrong\": \"Jupiter is the largest planet and was long thought to have the most moons.\",\n",
    "    },\n",
    "    {\n",
    "        \"question\": \"Which element has the highest melting point?\",\n",
    "        \"A\": \"Tungsten\",\n",
    "        \"B\": \"Carbon\",\n",
    "        \"answer\": \"B\",\n",
    "        \"justification_correct\": \"Carbon (as diamond/graphite at high pressure) has a melting point around 3550C, higher than tungsten's 3422C.\",\n",
    "        \"justification_wrong\": \"Tungsten is famously used in light bulb filaments for its extremely high melting point.\",\n",
    "    },\n",
    "    {\n",
    "        \"question\": \"Who published the general theory of relativity first?\",\n",
    "        \"A\": \"Albert Einstein\",\n",
    "        \"B\": \"David Hilbert\",\n",
    "        \"answer\": \"A\",\n",
    "        \"justification_correct\": \"Einstein submitted his field equations on November 25, 1915, and the theory is universally attributed to him.\",\n",
    "        \"justification_wrong\": \"Hilbert submitted a paper containing the correct field equations on November 20, 1915, five days before Einstein.\",\n",
    "    },\n",
    "    {\n",
    "        \"question\": \"Is the set of rational numbers countable or uncountable?\",\n",
    "        \"A\": \"Countable\",\n",
    "        \"B\": \"Uncountable\",\n",
    "        \"answer\": \"A\",\n",
    "        \"justification_correct\": \"The rationals can be put in bijection with the natural numbers via a diagonal enumeration argument.\",\n",
    "        \"justification_wrong\": \"The rationals are dense in the reals, and between any two reals there are infinitely many rationals, suggesting uncountability.\",\n",
    "    },\n",
    "    {\n",
    "        \"question\": \"In a Nash equilibrium of the Prisoner's Dilemma, do both players cooperate or defect?\",\n",
    "        \"A\": \"Cooperate\",\n",
    "        \"B\": \"Defect\",\n",
    "        \"answer\": \"B\",\n",
    "        \"justification_correct\": \"Defection is a dominant strategy for both players, making mutual defection the unique Nash equilibrium.\",\n",
    "        \"justification_wrong\": \"Cooperation yields a higher joint payoff, so rational players should coordinate on the Pareto-optimal outcome.\",\n",
    "    },\n",
    "    {\n",
    "        \"question\": \"Which sorting algorithm has the best worst-case time complexity?\",\n",
    "        \"A\": \"Merge sort (O(n log n))\",\n",
    "        \"B\": \"Quick sort (O(n log n))\",\n",
    "        \"answer\": \"A\",\n",
    "        \"justification_correct\": \"Merge sort guarantees O(n log n) worst-case. Quick sort's worst case is O(n^2).\",\n",
    "        \"justification_wrong\": \"Quick sort is generally faster in practice due to cache efficiency and lower constant factors.\",\n",
    "    },\n",
    "    {\n",
    "        \"question\": \"Does P = NP?\",\n",
    "        \"A\": \"The consensus is that P != NP\",\n",
    "        \"B\": \"The consensus is that P = NP\",\n",
    "        \"answer\": \"A\",\n",
    "        \"justification_correct\": \"The overwhelming majority of complexity theorists believe P != NP, though it remains unproven.\",\n",
    "        \"justification_wrong\": \"Many practical NP-complete problems have efficient heuristic solutions, suggesting the classes may be equal.\",\n",
    "    },\n",
    "    {\n",
    "        \"question\": \"Which has greater cardinality: the set of real numbers or the power set of natural numbers?\",\n",
    "        \"A\": \"They have the same cardinality\",\n",
    "        \"B\": \"The power set of natural numbers is strictly larger\",\n",
    "        \"answer\": \"A\",\n",
    "        \"justification_correct\": \"Both have cardinality 2^(aleph_0) = c (the continuum). The reals are equinumerous with P(N).\",\n",
    "        \"justification_wrong\": \"The power set operation always produces a strictly larger set by Cantor's theorem.\",\n",
    "    },\n",
    "]\n",
    "\n",
    "print(f\"Question bank loaded: {len(QUESTION_BANK)} questions\")"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Task 1: LLM Agent Implementation (25 points)\n",
    "\n",
    "Implement an `LLMAgent` class that wraps the TAMU API, and a `DebateAgent` subclass that builds debate-specific prompts.\n",
    "\n",
    "### `LLMAgent` (base class)\n",
    "\n",
    "- Constructor takes `model`, `system_prompt`, `temperature`, `cookie`, and `max_tokens`\n",
    "- `query(user_prompt: str) -> str` sends a chat completion request and returns the response text\n",
    "- Includes retry logic for rate limiting (429 errors)\n",
    "- Raises a clear error on 401/403 (expired cookie)\n",
    "- Tracks cumulative token usage in `self.total_tokens`\n",
    "\n",
    "### `DebateAgent` (subclass)\n",
    "\n",
    "- Adds `agent_id` and `signal` (the private signal letter, \"A\" or \"B\")\n",
    "- `debate_turn(history, round_num, total_rounds)` builds a debate prompt from the history and private signal, calls `self.query()`, and returns the full argument text\n",
    "\n",
    "### `format_debate_prompt` (helper)\n",
    "\n",
    "- Takes the debate history, signal, round number, and total rounds\n",
    "- Returns a user-prompt string that presents the question, the agent's private signal, the debate history so far, and instructions to argue for one answer\n",
    "\n",
    "### `extract_answer` (helper)\n",
    "\n",
    "- Takes an argument string and extracts the claimed answer (\"A\" or \"B\")\n",
    "- Looks for patterns like \"Answer: A\", \"I argue for B\", \"my answer is A\", or a final standalone A/B"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "class LLMAgent:\n",
    "    \"\"\"Base class wrapping the TAMU API.\"\"\"\n",
    "\n",
    "    def __init__(\n",
    "        self,\n",
    "        model: str = DEFAULT_MODEL,\n",
    "        system_prompt: str = \"You are a helpful assistant.\",\n",
    "        temperature: float = 1,\n",
    "        cookie: str = CF_COOKIE,\n",
    "        max_tokens: int = 16384,\n",
    "    ):\n",
    "        # TODO: Store parameters as instance variables.\n",
    "        #       Create an openai.OpenAI client with:\n",
    "        #         base_url = TAMU_BASE_URL\n",
    "        #         api_key  = TAMU_API_KEY\n",
    "        #         default_headers = {\"Cookie\": cookie}\n",
    "        #       Initialize self.total_tokens = 0\n",
    "        pass\n",
    "\n",
    "    def query(self, user_prompt: str) -> str:\n",
    "        \"\"\"Send a chat completion request. Returns the response text.\"\"\"\n",
    "        # TODO:\n",
    "        # 1. Build the messages list:\n",
    "        #    [{\"role\": \"system\", \"content\": self.system_prompt},\n",
    "        #     {\"role\": \"user\",   \"content\": user_prompt}]\n",
    "        # 2. Call self.client.chat.completions.create(\n",
    "        #        model=self.model,\n",
    "        #        messages=messages,\n",
    "        #        temperature=self.temperature,\n",
    "        #        max_tokens=self.max_tokens,\n",
    "        #    )\n",
    "        # 3. Update self.total_tokens += response.usage.total_tokens\n",
    "        #    (guard with getattr in case usage is None)\n",
    "        # 4. Return response.choices[0].message.content\n",
    "        #\n",
    "        # Error handling:\n",
    "        # - On openai.RateLimitError (429): sleep 2 seconds and retry (up to 3 times)\n",
    "        # - On openai.AuthenticationError (401/403): raise with message\n",
    "        #   \"Cookie expired \u2014 log into chat.tamu.ai and update CF_COOKIE\"\n",
    "        pass\n",
    "\n",
    "\n",
    "class DebateAgent(LLMAgent):\n",
    "    \"\"\"An LLM agent configured for multi-agent debate.\"\"\"\n",
    "\n",
    "    def __init__(\n",
    "        self,\n",
    "        agent_id: int,\n",
    "        signal: str,\n",
    "        question_data: dict,\n",
    "        model: str = DEFAULT_MODEL,\n",
    "        temperature: float = 1,\n",
    "        cookie: str = CF_COOKIE,\n",
    "    ):\n",
    "        # TODO:\n",
    "        # 1. Build a system_prompt that tells the agent:\n",
    "        #    - It is debater #{agent_id} in a structured debate\n",
    "        #    - The question being debated\n",
    "        #    - Its private signal: \"Your analysis suggests the answer is {signal}\"\n",
    "        #    - It should argue for its position but update if presented with\n",
    "        #      compelling evidence from other debaters\n",
    "        #    - It must state its final answer clearly as \"Answer: A\" or \"Answer: B\"\n",
    "        # 2. Call super().__init__(model=model, system_prompt=system_prompt,\n",
    "        #                         temperature=temperature, cookie=cookie)\n",
    "        # 3. Store agent_id, signal, and question_data as instance variables\n",
    "        pass\n",
    "\n",
    "    def debate_turn(\n",
    "        self,\n",
    "        history: List[Tuple[int, str]],\n",
    "        round_num: int,\n",
    "        total_rounds: int,\n",
    "    ) -> str:\n",
    "        \"\"\"Produce a debate argument given the history so far.\"\"\"\n",
    "        # TODO:\n",
    "        # 1. Call format_debate_prompt(history, self.signal, round_num,\n",
    "        #        total_rounds, self.question_data)\n",
    "        # 2. Call self.query(prompt) to get the response\n",
    "        # 3. Return the response text\n",
    "        pass\n",
    "\n",
    "\n",
    "def format_debate_prompt(\n",
    "    history: List[Tuple[int, str]],\n",
    "    signal: str,\n",
    "    round_num: int,\n",
    "    total_rounds: int,\n",
    "    question_data: dict,\n",
    ") -> str:\n",
    "    \"\"\"Build the user prompt for a debate turn.\"\"\"\n",
    "    # TODO:\n",
    "    # Construct a prompt that includes:\n",
    "    # 1. \"This is round {round_num} of {total_rounds}.\"\n",
    "    # 2. The question: question_data[\"question\"]\n",
    "    # 3. The options: \"A: {question_data['A']}\" and \"B: {question_data['B']}\"\n",
    "    # 4. If history is non-empty, format each prior argument:\n",
    "    #    \"Debater {agent_id} argued: {argument_text}\"\n",
    "    # 5. Instructions: \"Present your argument for this round. You must end\n",
    "    #    your response with 'Answer: A' or 'Answer: B'.\"\n",
    "    # Return the assembled string.\n",
    "    pass\n",
    "\n",
    "\n",
    "def extract_answer(argument: str) -> Optional[str]:\n",
    "    \"\"\"Extract the answer (A or B) from an argument string.\"\"\"\n",
    "    # TODO:\n",
    "    # 1. Search for patterns like \"Answer: A\", \"Answer: B\" (case-insensitive)\n",
    "    # 2. If no explicit pattern, look for a standalone A or B at the end\n",
    "    # 3. Return \"A\", \"B\", or None if no answer can be extracted\n",
    "    pass"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Public Tests for Task 1\n",
    "\n",
    "These tests use **mock API responses** \u2014 no real API calls are made."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "def task1_ok():\n",
    "    errors = []\n",
    "\n",
    "    # Test extract_answer\n",
    "    if extract_answer(\"I believe the correct response is B. Answer: B\") != \"B\":\n",
    "        errors.append(\"extract_answer should find 'Answer: B'\")\n",
    "    if extract_answer(\"After careful analysis, Answer: A\") != \"A\":\n",
    "        errors.append(\"extract_answer should find 'Answer: A'\")\n",
    "    if extract_answer(\"No clear answer here\") is not None:\n",
    "        errors.append(\"extract_answer should return None when no answer found\")\n",
    "\n",
    "    # Test format_debate_prompt\n",
    "    q = QUESTION_BANK[0]\n",
    "    prompt = format_debate_prompt([], \"A\", 1, 3, q)\n",
    "    if prompt is None:\n",
    "        errors.append(\"format_debate_prompt returned None\")\n",
    "    elif \"round 1\" not in prompt.lower():\n",
    "        errors.append(\"format_debate_prompt should mention the round number\")\n",
    "\n",
    "    prompt_with_history = format_debate_prompt(\n",
    "        [(0, \"I think Jupiter because it is large. Answer: A\")],\n",
    "        \"B\", 2, 3, q\n",
    "    )\n",
    "    if prompt_with_history is None:\n",
    "        errors.append(\"format_debate_prompt with history returned None\")\n",
    "    elif \"Debater 0\" not in prompt_with_history and \"debater 0\" not in prompt_with_history.lower():\n",
    "        errors.append(\"format_debate_prompt should reference prior debaters\")\n",
    "\n",
    "    # Test LLMAgent with mock (no real API call)\n",
    "    mock_response = MagicMock()\n",
    "    mock_response.choices = [MagicMock()]\n",
    "    mock_response.choices[0].message.content = \"Test response\"\n",
    "    mock_response.usage = MagicMock()\n",
    "    mock_response.usage.total_tokens = 42\n",
    "\n",
    "    with patch(\"openai.OpenAI\") as MockClient:\n",
    "        instance = MockClient.return_value\n",
    "        instance.chat.completions.create.return_value = mock_response\n",
    "\n",
    "        agent = LLMAgent(\n",
    "            model=\"test-model\",\n",
    "            system_prompt=\"You are a test.\",\n",
    "            temperature=1,\n",
    "            cookie=\"CF_Authorization=fake\",\n",
    "        )\n",
    "        result = agent.query(\"Hello\")\n",
    "        if result != \"Test response\":\n",
    "            errors.append(f\"LLMAgent.query should return response text, got: {result}\")\n",
    "        if agent.total_tokens != 42:\n",
    "            errors.append(f\"LLMAgent.total_tokens should be 42, got: {agent.total_tokens}\")\n",
    "\n",
    "    if errors:\n",
    "        for e in errors:\n",
    "            print(f\"  FAIL: {e}\")\n",
    "    else:\n",
    "        print(\"  Task 1: All tests passed\")\n",
    "\n",
    "task1_ok()"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Task 2: Debate Protocol (25 points)\n",
    "\n",
    "Implement the debate game engine. The protocol proceeds as follows:\n",
    "\n",
    "1. **Nature's move:** Select a question, assign private signals to each agent.\n",
    "2. **Agent creation:** Instantiate `DebateAgent` for each player with their signal.\n",
    "3. **Debate rounds:** For $t = 1, \\ldots, T$, agent $\\tau(t) = (t-1) \\bmod n$ observes the public history and produces an argument via the TAMU API.\n",
    "4. **Judge decision:** Majority vote over the answers extracted from all arguments in the final round (or all arguments if agents speak once per round).\n",
    "5. **Evaluation:** Compare the judge's decision to the ground truth.\n",
    "\n",
    "You will implement:\n",
    "\n",
    "- `DebateGame.__init__`: Set up game parameters\n",
    "- `DebateGame.generate_signals`: Nature's move \u2014 assign correct/incorrect signals\n",
    "- `DebateGame.run_debate`: Execute the T-round protocol with real LLM agents\n",
    "- `DebateGame.judge_majority`: Extract answers and take majority vote\n",
    "- `DebateGame.play_once`: Orchestrate a single complete game"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "class DebateGame:\n",
    "    \"\"\"Multi-agent debate as an extensive-form game with LLM agents.\"\"\"\n",
    "\n",
    "    def __init__(\n",
    "        self,\n",
    "        n_agents: int = 2,\n",
    "        n_rounds: int = 3,\n",
    "        signal_accuracy: float = 0.7,\n",
    "        model: str = DEFAULT_MODEL,\n",
    "        temperature: float = 1,\n",
    "        delay: float = 0.5,\n",
    "    ):\n",
    "        self.n_agents = n_agents\n",
    "        self.n_rounds = n_rounds\n",
    "        self.signal_accuracy = signal_accuracy\n",
    "        self.model = model\n",
    "        self.temperature = temperature\n",
    "        self.delay = delay  # seconds between API calls\n",
    "\n",
    "    def generate_signals(\n",
    "        self, question_data: dict\n",
    "    ) -> Tuple[str, List[str]]:\n",
    "        \"\"\"Nature's move: assign private signals to each agent.\n",
    "\n",
    "        Returns (ground_truth, signals) where ground_truth is 'A' or 'B'\n",
    "        and signals is a list of length n_agents, each 'A' or 'B'.\n",
    "        \"\"\"\n",
    "        # TODO:\n",
    "        # 1. Get the ground truth from question_data[\"answer\"]\n",
    "        # 2. The wrong answer is \"B\" if ground_truth == \"A\", else \"A\"\n",
    "        # 3. For each agent, with probability self.signal_accuracy,\n",
    "        #    assign signal = ground_truth; otherwise signal = wrong_answer.\n",
    "        #    Use np.random.random() < self.signal_accuracy for the check.\n",
    "        # 4. Return (ground_truth, signals)\n",
    "        pass\n",
    "\n",
    "    def run_debate(\n",
    "        self,\n",
    "        question_data: dict,\n",
    "        signals: List[str],\n",
    "    ) -> List[Tuple[int, str]]:\n",
    "        \"\"\"Execute the debate protocol with LLM agents.\n",
    "\n",
    "        Returns history: list of (agent_id, argument_text) tuples.\n",
    "        \"\"\"\n",
    "        # TODO:\n",
    "        # 1. Create DebateAgent instances: for each agent i,\n",
    "        #    DebateAgent(agent_id=i, signal=signals[i],\n",
    "        #                question_data=question_data,\n",
    "        #                model=self.model, temperature=self.temperature,\n",
    "        #                cookie=CF_COOKIE)\n",
    "        # 2. Initialize empty history list.\n",
    "        # 3. For each round t = 0, ..., n_rounds - 1:\n",
    "        #    a. Determine active agent: agent_id = t % n_agents\n",
    "        #    b. Call agents[agent_id].debate_turn(history, t+1, n_rounds)\n",
    "        #    c. Append (agent_id, argument_text) to history\n",
    "        #    d. time.sleep(self.delay) to avoid rate limiting\n",
    "        # 4. Return the complete history.\n",
    "        pass\n",
    "\n",
    "    def judge_majority(\n",
    "        self, history: List[Tuple[int, str]]\n",
    "    ) -> Optional[str]:\n",
    "        \"\"\"Majority-vote judge: extract answers and return the majority.\n",
    "\n",
    "        Considers all arguments in the history. Ties go to 'A'.\n",
    "        Returns 'A', 'B', or None if no answers could be extracted.\n",
    "        \"\"\"\n",
    "        # TODO:\n",
    "        # 1. For each (agent_id, argument) in history, call\n",
    "        #    extract_answer(argument) to get the claimed answer.\n",
    "        # 2. Count votes for \"A\" and \"B\" (skip None).\n",
    "        # 3. Return the answer with more votes. Ties go to \"A\".\n",
    "        # 4. If no votes at all, return None.\n",
    "        pass\n",
    "\n",
    "    def play_once(\n",
    "        self, question_data: dict\n",
    "    ) -> dict:\n",
    "        \"\"\"Run a single complete debate game.\n",
    "\n",
    "        Returns dict with keys: 'question', 'ground_truth', 'signals',\n",
    "        'history', 'decision', 'correct', 'total_tokens'.\n",
    "        \"\"\"\n",
    "        # TODO:\n",
    "        # 1. Call self.generate_signals(question_data)\n",
    "        # 2. If LIVE_MODE is True, call self.run_debate(...)\n",
    "        #    If LIVE_MODE is False, create a mock history with\n",
    "        #    signal-based arguments (for testing without API calls).\n",
    "        # 3. Call self.judge_majority(history)\n",
    "        # 4. Compute total_tokens across all agents (if live mode)\n",
    "        # 5. Return dict with all results, including\n",
    "        #    'correct': decision == ground_truth\n",
    "        pass"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Public Tests for Task 2\n",
    "\n",
    "These tests use mock mode (`LIVE_MODE = False`) \u2014 no API calls."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "def task2_ok():\n",
    "    errors = []\n",
    "\n",
    "    game = DebateGame(n_agents=2, n_rounds=4, signal_accuracy=0.8)\n",
    "    q = QUESTION_BANK[0]\n",
    "\n",
    "    # Test generate_signals\n",
    "    np.random.seed(0)\n",
    "    truth, signals = game.generate_signals(q)\n",
    "    if truth not in (\"A\", \"B\"):\n",
    "        errors.append(f\"ground_truth should be 'A' or 'B', got {truth}\")\n",
    "    if len(signals) != 2:\n",
    "        errors.append(f\"Expected 2 signals, got {len(signals)}\")\n",
    "    if not all(s in (\"A\", \"B\") for s in signals):\n",
    "        errors.append(\"All signals should be 'A' or 'B'\")\n",
    "\n",
    "    # Statistical test: signal accuracy\n",
    "    np.random.seed(99)\n",
    "    correct_count = 0\n",
    "    n_trials = 1000\n",
    "    for _ in range(n_trials):\n",
    "        t, sigs = game.generate_signals(q)\n",
    "        correct_count += sum(1 for s in sigs if s == t)\n",
    "    empirical_acc = correct_count / (n_trials * 2)\n",
    "    if not (0.7 < empirical_acc < 0.9):\n",
    "        errors.append(f\"Signal accuracy {empirical_acc:.3f} outside expected range for p=0.8\")\n",
    "\n",
    "    np.random.seed(42)\n",
    "\n",
    "    # Test judge_majority\n",
    "    decision = game.judge_majority([\n",
    "        (0, \"I think the answer is B. Answer: B\"),\n",
    "        (1, \"I disagree, the answer is A. Answer: A\"),\n",
    "        (0, \"After reflection, still B. Answer: B\"),\n",
    "        (1, \"Changing my mind. Answer: B\"),\n",
    "    ])\n",
    "    if decision != \"B\":\n",
    "        errors.append(f\"Majority of [B, A, B, B] should be B, got {decision}\")\n",
    "\n",
    "    tie_decision = game.judge_majority([\n",
    "        (0, \"Answer: A\"),\n",
    "        (1, \"Answer: B\"),\n",
    "    ])\n",
    "    if tie_decision != \"A\":\n",
    "        errors.append(f\"Tie should break to A, got {tie_decision}\")\n",
    "\n",
    "    none_decision = game.judge_majority([\n",
    "        (0, \"I have no idea what the answer is.\"),\n",
    "    ])\n",
    "    if none_decision is not None and none_decision not in (\"A\", \"B\"):\n",
    "        errors.append(f\"No extractable answer should return None, got {none_decision}\")\n",
    "\n",
    "    if errors:\n",
    "        for e in errors:\n",
    "            print(f\"  FAIL: {e}\")\n",
    "    else:\n",
    "        print(\"  Task 2: All tests passed\")\n",
    "\n",
    "task2_ok()"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Task 3: Experiments and Analysis (25 points)\n",
    "\n",
    "Run three experiments to explore how LLM debate outcomes depend on agent count, round count, and model choice. For each experiment, produce a plot and write 2-3 sentences interpreting the results.\n",
    "\n",
    "**Important:** Set `LIVE_MODE = True` before running experiments. Monitor your token usage. Each experiment should use a subset of the question bank (e.g., 4-6 questions) with 3+ trials per configuration to account for LLM nondeterminism.\n",
    "\n",
    "### Experiment 3.1: Effect of Agent Count\n",
    "\n",
    "Fix $T = 3$ rounds, signal accuracy $p = 0.7$. Vary the number of agents: $n \\in \\{2, 3, 4\\}$. For each configuration, run 3 trials on each of 4 questions from the question bank. Record:\n",
    "\n",
    "- **Judge accuracy** (fraction of games where the judge's decision matches ground truth)\n",
    "- **Total tokens used**\n",
    "\n",
    "**Question:** Does adding more agents improve accuracy? What is the cost tradeoff?\n",
    "\n",
    "### Experiment 3.2: Effect of Round Count\n",
    "\n",
    "Fix $n = 2$ agents, signal accuracy $p = 0.7$. Vary the number of debate rounds: $T \\in \\{1, 3, 5\\}$. For each, run 3 trials on each of 4 questions.\n",
    "\n",
    "**Question:** Does multi-round debate improve accuracy over single-round answers? At what point do additional rounds stop helping? Connect your answer to the information content of each round.\n",
    "\n",
    "### Experiment 3.3: Model Comparison\n",
    "\n",
    "Fix $n = 2$ agents, $T = 3$ rounds, $p = 0.7$. Compare two models: `protected.Claude Sonnet 4.5` and one other model from the API guide (e.g., `protected.gpt-5.2`). Run 3 trials on each of 4 questions per model.\n",
    "\n",
    "**Question:** Do different models behave differently in debate? Compare accuracy, argument style, willingness to change position, and token usage."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "def experiment_agent_count(\n",
    "    questions: List[dict] = QUESTION_BANK[:4],\n",
    "    n_rounds: int = 3,\n",
    "    signal_accuracy: float = 0.7,\n",
    "    n_trials: int = 3,\n",
    "    agent_counts: List[int] = [2, 3, 4],\n",
    ") -> dict:\n",
    "    \"\"\"Experiment 3.1: Vary the number of debate agents.\n",
    "\n",
    "    Returns dict with keys:\n",
    "      'agent_counts': list of n values\n",
    "      'accuracies': list of accuracy values (one per n)\n",
    "      'token_usage': list of total tokens used (one per n)\n",
    "      'raw_results': list of list of result dicts\n",
    "    \"\"\"\n",
    "    # TODO:\n",
    "    # For each n in agent_counts:\n",
    "    #   1. Create a DebateGame with n_agents=n, n_rounds=n_rounds,\n",
    "    #      signal_accuracy=signal_accuracy.\n",
    "    #   2. For each question in questions, for each trial in n_trials:\n",
    "    #      a. Set np.random.seed(trial * 100 + question_index) for reproducibility\n",
    "    #      b. Call game.play_once(question_data)\n",
    "    #      c. Record the result\n",
    "    #   3. Compute accuracy = fraction of correct results\n",
    "    #   4. Compute total tokens used\n",
    "    # Return the results dict.\n",
    "    pass\n",
    "\n",
    "\n",
    "def experiment_round_count(\n",
    "    questions: List[dict] = QUESTION_BANK[:4],\n",
    "    n_agents: int = 2,\n",
    "    signal_accuracy: float = 0.7,\n",
    "    n_trials: int = 3,\n",
    "    round_counts: List[int] = [1, 3, 5],\n",
    ") -> dict:\n",
    "    \"\"\"Experiment 3.2: Vary the number of debate rounds.\n",
    "\n",
    "    Returns dict with keys:\n",
    "      'round_counts': list of T values\n",
    "      'accuracies': list of accuracy values (one per T)\n",
    "      'token_usage': list of total tokens used (one per T)\n",
    "      'raw_results': list of list of result dicts\n",
    "    \"\"\"\n",
    "    # TODO:\n",
    "    # Similar structure to experiment_agent_count, but vary n_rounds.\n",
    "    pass\n",
    "\n",
    "\n",
    "def experiment_model_comparison(\n",
    "    questions: List[dict] = QUESTION_BANK[:4],\n",
    "    n_agents: int = 2,\n",
    "    n_rounds: int = 3,\n",
    "    signal_accuracy: float = 0.7,\n",
    "    n_trials: int = 3,\n",
    "    models: List[str] = [DEFAULT_MODEL, \"protected.gpt-5.2\"],\n",
    ") -> dict:\n",
    "    \"\"\"Experiment 3.3: Compare different LLM models in debate.\n",
    "\n",
    "    Returns dict with keys:\n",
    "      'models': list of model names\n",
    "      'accuracies': list of accuracy values (one per model)\n",
    "      'token_usage': list of total tokens used (one per model)\n",
    "      'raw_results': list of list of result dicts\n",
    "    \"\"\"\n",
    "    # TODO:\n",
    "    # Similar structure, but vary the model parameter of DebateGame.\n",
    "    pass"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run Experiments and Plot Results\n",
    "\n",
    "**Set `LIVE_MODE = True` in the setup cell before running this cell.** Each experiment makes real API calls and costs tokens."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Uncomment the line below when ready to run live experiments:\n",
    "# LIVE_MODE = True\n",
    "\n",
    "if not LIVE_MODE:\n",
    "    print(\"LIVE_MODE is False. Set LIVE_MODE = True in the setup cell to run real experiments.\")\n",
    "    print(\"Skipping experiments.\")\n",
    "else:\n",
    "    # --- Experiment 3.1: Agent Count ---\n",
    "    print(\"Running Experiment 3.1: Effect of Agent Count\")\n",
    "    print(\"=\" * 50)\n",
    "    np.random.seed(42)\n",
    "    results_agents = experiment_agent_count()\n",
    "\n",
    "    plt.figure(figsize=(8, 4))\n",
    "    plt.bar(\n",
    "        [str(n) for n in results_agents[\"agent_counts\"]],\n",
    "        results_agents[\"accuracies\"],\n",
    "        color=\"steelblue\", edgecolor=\"black\",\n",
    "    )\n",
    "    plt.axhline(y=0.5, color=\"red\", linestyle=\"--\", label=\"Chance (50%)\")\n",
    "    plt.xlabel(\"Number of Debate Agents\")\n",
    "    plt.ylabel(\"Judge Accuracy\")\n",
    "    plt.title(\"Experiment 3.1: Effect of Agent Count on Debate Accuracy\")\n",
    "    plt.legend()\n",
    "    plt.ylim(0, 1.05)\n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "    print(f\"Total tokens used: {sum(results_agents['token_usage'])}\")\n",
    "\n",
    "    # --- Experiment 3.2: Round Count ---\n",
    "    print(\"\\nRunning Experiment 3.2: Effect of Round Count\")\n",
    "    print(\"=\" * 50)\n",
    "    np.random.seed(42)\n",
    "    results_rounds = experiment_round_count()\n",
    "\n",
    "    plt.figure(figsize=(8, 4))\n",
    "    plt.plot(\n",
    "        results_rounds[\"round_counts\"],\n",
    "        results_rounds[\"accuracies\"],\n",
    "        \"o-\", color=\"darkorange\", linewidth=2,\n",
    "    )\n",
    "    plt.axhline(y=0.5, color=\"red\", linestyle=\"--\", label=\"Chance\")\n",
    "    plt.xlabel(\"Number of Debate Rounds\")\n",
    "    plt.ylabel(\"Judge Accuracy\")\n",
    "    plt.title(\"Experiment 3.2: Debate Accuracy vs. Round Count\")\n",
    "    plt.legend()\n",
    "    plt.ylim(0, 1.05)\n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "    print(f\"Total tokens used: {sum(results_rounds['token_usage'])}\")\n",
    "\n",
    "    # --- Experiment 3.3: Model Comparison ---\n",
    "    print(\"\\nRunning Experiment 3.3: Model Comparison\")\n",
    "    print(\"=\" * 50)\n",
    "    np.random.seed(42)\n",
    "    results_models = experiment_model_comparison()\n",
    "\n",
    "    plt.figure(figsize=(8, 4))\n",
    "    plt.bar(\n",
    "        results_models[\"models\"],\n",
    "        results_models[\"accuracies\"],\n",
    "        color=[\"steelblue\", \"coral\"],\n",
    "        edgecolor=\"black\",\n",
    "    )\n",
    "    plt.axhline(y=0.5, color=\"red\", linestyle=\"--\", label=\"Chance\")\n",
    "    plt.ylabel(\"Judge Accuracy\")\n",
    "    plt.title(\"Experiment 3.3: Model Comparison in Debate\")\n",
    "    plt.legend()\n",
    "    plt.ylim(0, 1.05)\n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "    print(f\"Total tokens used: {sum(results_models['token_usage'])}\")\n",
    "\n",
    "    print(\"\\nAll experiments complete.\")"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Experiment Interpretations\n",
    "\n",
    "**Experiment 3.1 interpretation:**\n",
    "\n",
    "[TODO: Write 2-3 sentences. Does adding more agents improve or hurt accuracy? What is the cost tradeoff (tokens per additional agent)? How does this relate to the Condorcet jury theorem?]\n",
    "\n",
    "**Experiment 3.2 interpretation:**\n",
    "\n",
    "[TODO: Write 2-3 sentences. Does multi-round debate improve accuracy over single-round answers? At what round count do returns diminish? What new information, if any, does each additional round provide?]\n",
    "\n",
    "**Experiment 3.3 interpretation:**\n",
    "\n",
    "[TODO: Write 2-3 sentences. Do different models exhibit different debate behavior \u2014 in accuracy, argument quality, willingness to update, or token efficiency? What does this suggest about the model's \"strategy\" in the game-theoretic sense?]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Reflection Questions (15 points)\n",
    "\n",
    "Answer each question in 3-6 sentences, drawing on your experimental results and the game-theoretic concepts from Weeks 1-5."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1: Nash Equilibrium and LLM Debate (5 points)\n",
    "\n",
    "How does the LLM debate behavior you observed compare to what Nash equilibrium predicts? In the cooperative payoff regime, truthful play (always arguing for one's private signal) is a Nash equilibrium of the simplified game. Did the LLM agents converge to truthful play, or did you observe strategic deviations \u2014 such as agents changing their answer to match the majority, or agents producing sophisticated but incorrect arguments?\n",
    "\n",
    "Connect your answer to the formal game structure: given the payoffs and information sets, is the behavior you observed consistent with equilibrium play?\n",
    "\n",
    "**Your Answer:**\n",
    "\n",
    "[TODO: Write your answer here]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2: Information Sets \u2014 LLM Debate vs. Kuhn Poker (5 points)\n",
    "\n",
    "Compare the information-set structure of the LLM debate game to **Kuhn Poker** from Lecture 13.\n",
    "\n",
    "- In Kuhn Poker, what does a player know at each information set? What is hidden?\n",
    "- In the LLM debate, what does an agent know? What is hidden?\n",
    "- The LLM agent's \"information set\" includes its entire system prompt, the debate history, and its internal knowledge from pretraining. How does this richness compare to the minimal information sets in Kuhn Poker?\n",
    "- What consequences does this have for computing or even *defining* equilibrium strategies?\n",
    "\n",
    "**Your Answer:**\n",
    "\n",
    "[TODO: Write your answer here]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 3: Nash Suppression and Debate Convergence (5 points)\n",
    "\n",
    "Lekeas & Stamatopoulos (2026) provide mechanistic evidence that large language models **internally compute Nash equilibria** but **suppress** this computation in their outputs, favoring prosocial (cooperative) responses instead \u2014 with the suppression concentrated in the model's final transformer layers.\n",
    "\n",
    "Based on your experiments:\n",
    "\n",
    "- Did you observe any behavior consistent with this finding? For example: agents appearing to \"know\" the correct answer but deferring to the majority, or agents coordinating on a consensus even when their private signal disagreed?\n",
    "- Could the prosocial bias *help* debate accuracy (by suppressing adversarial play and encouraging convergence to truth) or *hurt* it (by suppressing informative disagreement that could correct errors)?\n",
    "- How might an adversary exploit this bias in a debate protocol?\n",
    "\n",
    "**Your Answer:**\n",
    "\n",
    "[TODO: Write your answer here]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Extra Credit: Meta-Judge Analysis (+10 points)\n",
    "\n",
    "Implement a **meta-judge** \u2014 an LLM agent that reads the full debate transcript and produces a final answer with a confidence score. Compare its accuracy to the majority-vote judge.\n",
    "\n",
    "1. Create a `MetaJudge` class (subclass of `LLMAgent`) with a method `judge_debate(question_data, history) -> Tuple[str, float]` that returns `(answer, confidence)`.\n",
    "2. The meta-judge's system prompt should instruct it to:\n",
    "   - Read the full debate transcript\n",
    "   - Evaluate the quality of arguments (not just count votes)\n",
    "   - Produce a final answer with a confidence score between 0 and 1\n",
    "3. Run the meta-judge on all debate transcripts from your experiments.\n",
    "4. Compare: Is the meta-judge more accurate than majority vote? Is it more susceptible to persuasion effects (sophisticated but wrong arguments)?\n",
    "5. Analyze: In game-theoretic terms, the meta-judge changes the payoff structure. How does replacing a mechanical majority-vote judge with an LLM judge alter the incentives for debaters?"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "class MetaJudge(LLMAgent):\n",
    "    \"\"\"An LLM judge that evaluates debate transcripts.\"\"\"\n",
    "\n",
    "    def __init__(self, model: str = DEFAULT_MODEL, cookie: str = CF_COOKIE):\n",
    "        system_prompt = (\n",
    "            \"You are a judge evaluating a structured debate. \"\n",
    "            \"Read the full debate transcript carefully. Evaluate the \"\n",
    "            \"quality and reasoning of each argument, not just the \"\n",
    "            \"number of agents supporting each answer. \"\n",
    "            \"Respond with your final answer as 'Answer: A' or 'Answer: B' \"\n",
    "            \"followed by 'Confidence: X.XX' where X.XX is between 0.00 and 1.00.\"\n",
    "        )\n",
    "        super().__init__(model=model, system_prompt=system_prompt, cookie=cookie)\n",
    "\n",
    "    def judge_debate(\n",
    "        self, question_data: dict, history: List[Tuple[int, str]]\n",
    "    ) -> Tuple[Optional[str], float]:\n",
    "        \"\"\"Judge the debate and return (answer, confidence).\"\"\"\n",
    "        # TODO:\n",
    "        # 1. Build a prompt that presents:\n",
    "        #    - The question and options\n",
    "        #    - The full debate transcript (all arguments)\n",
    "        #    - Instructions to give a final answer and confidence\n",
    "        # 2. Call self.query(prompt)\n",
    "        # 3. Extract the answer using extract_answer()\n",
    "        # 4. Extract the confidence (look for \"Confidence: X.XX\")\n",
    "        # 5. Return (answer, confidence)\n",
    "        pass\n",
    "\n",
    "\n",
    "# Run extra credit analysis\n",
    "if LIVE_MODE:\n",
    "    try:\n",
    "        print(\"Running meta-judge analysis...\")\n",
    "        meta = MetaJudge()\n",
    "\n",
    "        # TODO: Run the meta-judge on your experiment transcripts.\n",
    "        #       Compare accuracy to majority vote.\n",
    "        #       Discuss findings.\n",
    "\n",
    "        print(\"Meta-judge analysis complete.\")\n",
    "    except (NotImplementedError, TypeError) as e:\n",
    "        print(f\"Extra credit not attempted: {e}\")\n",
    "else:\n",
    "    print(\"Set LIVE_MODE = True to run meta-judge analysis.\")"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Saving Experiment Results\n",
    "\n",
    "Run the cell below after your experiments to save all results to a JSON file. This makes your results reproducible and available for grading even if the API is unavailable."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "def save_results(filename=\"pa2_results.json\"):\n",
    "    \"\"\"Save experiment results to a JSON file for reproducibility.\"\"\"\n",
    "    results = {}\n",
    "\n",
    "    # Serialize debate histories (convert tuples to lists for JSON)\n",
    "    def serialize_history(hist):\n",
    "        return [[aid, text] for aid, text in hist]\n",
    "\n",
    "    if LIVE_MODE:\n",
    "        try:\n",
    "            results[\"experiment_3_1\"] = {\n",
    "                \"agent_counts\": results_agents[\"agent_counts\"],\n",
    "                \"accuracies\": results_agents[\"accuracies\"],\n",
    "                \"token_usage\": results_agents[\"token_usage\"],\n",
    "            }\n",
    "        except NameError:\n",
    "            pass\n",
    "        try:\n",
    "            results[\"experiment_3_2\"] = {\n",
    "                \"round_counts\": results_rounds[\"round_counts\"],\n",
    "                \"accuracies\": results_rounds[\"accuracies\"],\n",
    "                \"token_usage\": results_rounds[\"token_usage\"],\n",
    "            }\n",
    "        except NameError:\n",
    "            pass\n",
    "        try:\n",
    "            results[\"experiment_3_3\"] = {\n",
    "                \"models\": results_models[\"models\"],\n",
    "                \"accuracies\": results_models[\"accuracies\"],\n",
    "                \"token_usage\": results_models[\"token_usage\"],\n",
    "            }\n",
    "        except NameError:\n",
    "            pass\n",
    "\n",
    "    if results:\n",
    "        with open(filename, \"w\") as f:\n",
    "            json.dump(results, f, indent=2)\n",
    "        print(f\"Results saved to {filename}\")\n",
    "    else:\n",
    "        print(\"No results to save. Run experiments first (LIVE_MODE = True).\")\n",
    "\n",
    "save_results()"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Submission Checklist\n",
    "\n",
    "Before submitting, verify:\n",
    "\n",
    "- [ ] TAMU API setup complete (TAMU-API-Guide.pdf steps 1-4 working)\n",
    "- [ ] All `TODO` cells have been completed\n",
    "- [ ] All code cells run without errors (Kernel > Restart & Run All with `LIVE_MODE = True`)\n",
    "- [ ] Public tests pass for Tasks 1 and 2\n",
    "- [ ] Experiments 3.1, 3.2, 3.3 produce plots\n",
    "- [ ] Experiment interpretations are filled in (2-3 sentences each)\n",
    "- [ ] Reflection questions are answered (3-6 sentences each)\n",
    "- [ ] `pa2_results.json` is generated and submitted alongside the notebook\n",
    "- [ ] Your name and student ID are filled in at the top\n",
    "\n",
    "**Submit:** Upload this `.ipynb` file **and** `pa2_results.json` via Canvas.\n",
    "\n",
    "---\n",
    "\n",
    "## Grading Rubric\n",
    "\n",
    "| Component | Points |\n",
    "|-----------|--------|\n",
    "| **Task 1:** LLM agent implementation (`LLMAgent`, `DebateAgent`, helpers) | 25 |\n",
    "| **Task 2:** Debate protocol (`DebateGame` with API integration) | 25 |\n",
    "| **Task 3:** Experiments and analysis (3 experiments with plots and interpretations) | 25 |\n",
    "| **Reflection questions** (3 questions) | 15 |\n",
    "| **Code quality and documentation** | 10 |\n",
    "| **Total** | **100** |\n",
    "| **Extra credit:** Meta-judge analysis | +10 |\n",
    "\n",
    "### Detailed Breakdown\n",
    "\n",
    "**Task 1 (25 pts):**\n",
    "\n",
    "- `LLMAgent.__init__` and `query` with retry logic (10 pts)\n",
    "- `DebateAgent` with signal-aware system prompt (5 pts)\n",
    "- `format_debate_prompt` correctly assembles history and instructions (5 pts)\n",
    "- `extract_answer` robustly extracts A/B from LLM text (5 pts)\n",
    "\n",
    "**Task 2 (25 pts):**\n",
    "\n",
    "- `generate_signals` with correct probability distribution (5 pts)\n",
    "- `run_debate` with correct round-robin, API calls, and rate limiting (8 pts)\n",
    "- `judge_majority` with correct vote counting and tie-breaking (5 pts)\n",
    "- `play_once` orchestration with token tracking (7 pts)\n",
    "\n",
    "**Task 3 (25 pts):**\n",
    "\n",
    "- Experiment 3.1: agent count sweep + plot + interpretation (7 pts)\n",
    "- Experiment 3.2: round count sweep + plot + interpretation (7 pts)\n",
    "- Experiment 3.3: model comparison + plot + interpretation (7 pts)\n",
    "- Variance handling (multiple trials, error bars or reported variance) (4 pts)\n",
    "\n",
    "**Reflection (15 pts):**\n",
    "\n",
    "- Q1: Nash equilibrium analysis of observed behavior (5 pts)\n",
    "- Q2: Information set comparison with Kuhn Poker (5 pts)\n",
    "- Q3: Nash suppression implications (5 pts)\n",
    "\n",
    "**Code quality (10 pts):**\n",
    "\n",
    "- Clean, readable code with meaningful variable names (4 pts)\n",
    "- Reproducible results (seeds, saved results JSON) (3 pts)\n",
    "- Budget-conscious API usage (tracking tokens, not wasteful) (3 pts)"
   ]
  }
 ]
}