Fundamentals for AI Agents — Beginner-Friendly Learning Map

Scene 01

The AI Landscape — What Are All These Buzzwords?

▶ Watch this section ⏱ starts at 00:00:30 Pi Step 1Pi Step 2

Altitude: The very top — this is the "why are we here?" moment before any details

😩 The confusion

You've heard words like "LLM", "RAG", "vector DB", "agents", "LangChain" everywhere but have no idea what any of them mean or how they connect

💡 The relief

All these buzzwords are parts of the same machine — by the end of this video you'll know what role each piece plays

📚 Every technical term, explained (11 terms)

AI (Artificial Intelligence): a computer system designed to do things that normally require human thinking, like answering questions, writing text, or recognizing images
LLM (Large Language Model): the specific type of AI that reads and writes text; "large" means it was trained on billions of words; "language model" means it predicts and generates language
RAG (Retrieval-Augmented Generation): a system where the AI first looks up relevant information from a database, then uses that information to write its answer; like an open-book test instead of relying only on memory
Vector DB (Vector Database): a special kind of database that stores the *meaning* of text as numbers, so you can search by meaning rather than exact words
MCP (Model Context Protocol): a standard "plug" that lets an AI agent connect to any external tool or database without custom wiring each time
Agent: an AI that doesn't just answer questions but can decide what steps to take, use tools, and work through multi-step tasks on its own
LangChain: a pre-built toolkit (a "toolbox") that makes it much easier to connect all the AI pieces together without writing thousands of lines of code from scratch
LangGraph: an extension of LangChain that handles more complex multi-step workflows where the AI needs to make decisions and loop back
Prompt engineering: the skill of writing your question or instruction to an AI in a way that gets the best possible answer
Token: the unit an AI uses to count text; roughly 3/4 of a word; used to measure how much the AI can read at once and how much an API call costs
Embedding: converting a piece of text into a list of numbers that captures its meaning; allows computers to compare meanings mathematically

🌉 Concrete analogy

Think of building an AI-powered company assistant as building a smart librarian robot. The LLM is the robot's brain. The vector DB is its filing system. RAG is how it looks things up before answering. LangChain is the robot's body/skeleton. LangGraph is its decision-making flowchart. MCP is the set of USB ports on the robot that let it plug into different databases and tools. Prompt engineering is how you train the robot to understand your instructions.

Where the metaphor breaks: The robot analogy breaks down at "training" — the LLM brain was trained by reading billions of documents, not by a single person teaching it; and unlike a robot, it has no physical body or memory between conversations unless you explicitly build that in.

⚙️ What happens, step by step

The speaker lists a stream of buzzwords (prompt engineering, context windows, tokens, embeddings, RAG, vector DB, MCP, agents, LangChain, LangGraph, Claude, Gemini) — all in the first 10 seconds
The speaker says: "If you felt left out, this is the only video you'll need to watch to catch up" — setting the promise
The tutorial introduces TechCorp, a fictional company with 500 GB of company documents, as the running real-world example
All the concepts covered in the video will be demonstrated by building a working AI assistant for TechCorp
This scene ends at ~00:00:40 as the speaker begins explaining LLMs specifically

🧪 Try it / see it

Click any concept to see what it is. They're all parts of the same machine.

↑ Pick a node to learn its role.

“If you felt left out, this is the only video you'll need to watch to catch up.” — speaker, at 00:00:30

📸 More screenshots from this section (4 frames — click any to enlarge)

Scene 02

What Is an LLM? The Brain Behind the Chatbot

▶ Watch this section ⏱ starts at 00:01:08 Pi Step 2Pi Step 3Pi Step 4

Altitude: Foundation layer — you must understand what an LLM is before anything else makes sense

😩 The confusion

You hear "AI model" and "ChatGPT" and "Claude" but have no mental picture of what these things actually ARE or how they work

💡 The relief

An LLM is a program trained on billions of words that learned to predict and generate text — like an extremely well-read text-autocomplete, but powerful enough to reason and converse

📚 Every technical term, explained (10 terms)

Large Language Model (LLM): a type of computer program trained on enormous amounts of text (billions of web pages, books, articles) so it can understand and generate human language; ChatGPT, Claude, and Gemini are all LLMs
Transformer model: the specific mathematical architecture (design blueprint) used to build modern LLMs; all the popular LLMs use this design, which is why they're all good at understanding context across long pieces of text
Training: the process of feeding an LLM billions of examples of text so it learns patterns in language; like a student reading every book in every library, except the student is a computer doing it in days
Training tokens: the pieces of text used during training; roughly 1 token = 3/4 of an English word; big models train on tens of trillions of tokens
Training data: the giant collection of text the LLM learned from; includes websites, books, scientific papers, code, and more; what's NOT in the training data is what the AI won't know about
Anthropic: the company that makes Claude (the AI this video is partly about)
OpenAI: the company that makes ChatGPT and the GPT family of models
Google: makes the Gemini family of AI models
Static brain: a phrase from the video meaning an LLM on its own only knows what it was trained on; it cannot look things up, take actions, or update its knowledge unless you build those features around it
Domain: a field of knowledge (e.g., law, medicine, coding); LLMs trained on many domains can answer questions across many topics

🌉 Concrete analogy

An LLM is like an incredibly well-read scholar who has memorized millions of books but is locked in a room with no internet. They can answer almost anything based on what they've already read — but they cannot go look up today's news, access your company's private files, or take actions in the real world. That's why we need all the other pieces (RAG, agents, MCP) to extend what the LLM can do.

Where the metaphor breaks: The scholar analogy breaks when you think about "hallucination" — the LLM doesn't "know" it doesn't know something the way a human scholar would; it will confidently generate wrong text if it doesn't have the right data. Also, the scholar reads everything instantly in parallel (attention mechanism) rather than word-by-word.

⚙️ What happens, step by step

User sends a question to an AI like ChatGPT or Claude
The question goes to an LLM — a transformer model trained on enormous datasets
The LLM processes the question using patterns it learned during training
It generates a response, word by word (token by token), predicting what comes next
The response comes back to the user as readable text
The LLM has no memory of this conversation after it ends — each session starts fresh unless memory tools are added

🧪 Try it / see it

Type a question, watch the LLM "predict the next token" one at a time.

“Popular LLMs like OpenAI's GPT, Anthropic's Claude, and Google's Gemini are all transformer models that are trained on large sets of data.” — speaker, at 00:01:08

📸 More screenshots from this section (5 frames — click any to enlarge)

Scene 03

The Context Window — How Much Can the AI Remember?

▶ Watch this section ⏱ starts at 00:02:03 Pi Step 2Pi Step 3Pi Step 7

Altitude: Core constraint — understanding the context window is why RAG and vector databases exist

😩 The confusion

You wonder why you can't just paste all your company's documents into the AI chat — and the answer is the context window has a size limit

💡 The relief

The context window is like the AI's "working desk" — it can only hold so many documents at once; this limit (measured in tokens) is why we need smarter systems to search through large knowledge bases

📚 Every technical term, explained (11 terms)

Context window: the total amount of text an AI model can "see" at one time during a conversation; includes your messages, the AI's replies, and any documents you paste in; measured in tokens
Token: roughly 3/4 of an English word; the unit LLMs use to count text; "Hello, world!" is about 3–4 tokens; 1 million tokens is roughly 750,000 words or about 50,000 lines of code
Short-term memory: in this context, the conversation history held in the context window during one session; the AI "remembers" everything in this window but forgets it when the session ends
Context window size: the maximum number of tokens the model can process in one go; varies by model; small/cheap models might have 4,000 tokens (~3,000 words); large models like Gemini 2.5 Pro can handle 1 million tokens (~750,000 words)
GPT-4.1: OpenAI's powerful model with up to 1 million token context window
Gemini 2.5 Pro: Google's model with 1 million token context window
Claude Opus 4: Anthropic's most capable model with 200,000 token context window
xAI Grok 4: another AI model with 256,000 token context window
Latency: how long you have to wait for a response; small/fast models have low latency (quick responses); large models can be slower
Flash/nano/mini models: smaller, cheaper, faster AI models with smaller context windows, good for simple tasks where speed matters more than capability
Lost-in-the-middle problem: the real research finding that even when information IS inside the context window, the LLM pays more attention to things at the beginning and end; relevant facts buried in the middle can be "lost"

🌉 Concrete analogy

The context window is like a whiteboard in a meeting room. Everything written on that whiteboard is what the AI can "see" and think about. The size of the whiteboard is the context window. If you have 500 GB of company documents, they obviously can't all fit on one whiteboard at once — you need a filing system to bring only the relevant pages to the whiteboard when needed. That filing system is what RAG and vector databases provide.

Where the metaphor breaks: The whiteboard analogy implies everything on it is equally visible — but research shows LLMs actually pay less attention to things in the middle of a long context ("lost in the middle"). Also, it doesn't capture that each new token in the context window adds cost to the API call.

⚙️ What happens, step by step

When you start a conversation with an AI, it opens a "context window" — an active memory space
Every message you send, and every reply the AI gives, gets added to this context window
If you paste a document into the chat, it also goes into the context window
When the total text in the conversation reaches the model's limit (e.g., 200,000 tokens for Claude), it can't take in more
The model uses everything in its context window to generate the next reply
When the session ends, the context window is cleared — the AI "forgets" everything unless memory tools save it elsewhere
TechCorp's 500 GB of documents far exceeds even the largest context windows — hence the need for retrieval systems (RAG)

🧪 Try it / see it

Move the slider to change the AI's context window size. Watch when the documents stop fitting.

Context window size: 200000 tokens

4k

200k

1M

Equivalent: 150,000 words ≈ 600 pages

TechCorp documents: 500 GB ≈ 100 BILLION tokens — never fits, even at 1M context. That's WHY we need RAG.

“The context window is typically limited in size and the upper limit varies depending on the model.” — speaker, at 00:02:03

📸 More screenshots from this section (6 frames — click any to enlarge)

Scene 04

Embeddings — How AI Understands Meaning, Not Just Words

▶ Watch this section ⏱ starts at 00:05:07 Pi Step 4Pi Step 5

Altitude: Key transformation — this concept is the "secret ingredient" that makes semantic search possible; without understanding embeddings, vector databases and RAG seem like magic

😩 The confusion

You wonder how a computer can understand that "vacation" and "time off" mean the same thing, or that "can I wear jeans?" is related to "dress code policy" — computers normally just match exact text

💡 The relief

Embeddings convert words and sentences into lists of numbers (vectors) that capture their *meaning*; similar meanings produce similar number patterns, so the computer can find related concepts even when the wording is completely different

📚 Every technical term, explained (8 terms)

Embedding: a way of converting text into a list of numbers (a vector) that represents the *meaning* of that text; like giving each sentence a GPS coordinate in "meaning-space" — sentences with similar meanings have coordinates that are close together
Vector: a list of numbers; for example [0.12, -0.85, 0.44, 0.67, ...] — typically 1536 numbers long for modern embedding models; the exact numbers encode meaning mathematically
Embedding model: a separate AI model (different from the LLM that answers questions) whose only job is to convert text into vectors; OpenAI's text-embedding-ada-002 is a common example
Semantic similarity: how close two pieces of text are in *meaning*, regardless of the exact words used; "holiday" and "vacation" have high semantic similarity
Cosine similarity: the mathematical formula most commonly used to measure how "close" two vectors are; produces a score between -1 and 1, where 1 means identical meaning and 0 means completely unrelated
1536 dimensions: the number of numbers in each embedding vector for popular models; each dimension captures a different aspect of meaning (like formality, topic, sentiment, etc.)
Semantic search: searching a database by meaning rather than by exact keyword matching; the query is converted to an embedding and compared against stored embeddings to find the closest matches
Vector space: the abstract mathematical "space" where embeddings live; you can think of it as a 3D map where each word or sentence is a dot, and dots with similar meanings are clustered together

🌉 Concrete analogy

Imagine a giant map of meaning where every sentence is placed as a dot. Sentences about "dogs and puppies" are clustered in one area. Sentences about "finance and money" are in another area far away. "Employee vacation policy" and "staff time off guidelines" are dots placed very close to each other even though they use different words. An embedding model is the system that assigns each sentence its location on this map. When you search by embedding, you're saying "find me all the dots near this location on the map."

Where the metaphor breaks: The map analogy suggests a flat 2D surface — but embedding space has 1536 dimensions, which means two sentences can be "close" in many different ways simultaneously. Also, the analogy doesn't explain why an embedding of "apple" might be near both "fruit" embeddings AND "iPhone" embeddings depending on context.

⚙️ What happens, step by step

A piece of text (e.g., "employee vacation policy") is fed into an embedding model
The embedding model outputs a list of 1536 numbers — the embedding vector
This vector is stored in a database alongside the original text
Later, when a user asks a question (e.g., "can I take time off?"), that question is also converted to an embedding vector
The system compares the question's vector to all stored vectors using cosine similarity (a math formula for "closeness")
The most similar stored vectors are retrieved — these are the most semantically relevant documents
Those relevant documents are passed to the LLM to generate the final answer

🧪 Try it / see it

Sentences with similar meaning land close together in "meaning space" (shown here in 2D — real embeddings are 1,536-D).

"vacation policy" never says the word "PTO" but they're neighbors. Searching by meaning finds them all.

“Embeddings capture that semantic similarity.” — speaker, at 00:05:07

📸 More screenshots from this section (4 frames — click any to enlarge)

Scene 05

LLM vs. Agent — What Makes an Agent Different?

▶ Watch this section ⏱ starts at 00:07:04 Pi Step 4Pi Step 3

Altitude: Critical conceptual divide — the difference between a passive LLM and an active agent is the central idea of the entire tutorial

😩 The confusion

"Agent" sounds like a fancy word for "AI" — you don't understand what's actually different about it or why it matters

💡 The relief

A plain LLM is a static brain — it only answers questions based on what it was trained on. An agent wraps that brain with autonomy (it decides what steps to take), memory (it remembers the conversation), and tools (it can actually *do* things like search databases or call APIs)

📚 Every technical term, explained (8 terms)

Agent: an AI system that can autonomously decide what steps to take to fulfill a request; it has access to tools (like search, databases, code execution) and can call them in whatever order it determines is needed
Autonomy: the ability to decide things independently without being explicitly told each step; an agent reads your request and figures out the path to the answer on its own
Memory (in agents): the ability to remember previous messages in the conversation and use that context in future replies; built by storing conversation history in a database
Tools (in agents): functions the agent can call to interact with the outside world; examples: search the web, query a database, send an email, check a calendar, run a calculation
Static brain: the speaker's term for a plain LLM with no tools or memory — it can only answer from its training data and has no way to look things up or take actions
Traditional software: programs written with explicit if-then rules coded by a human developer; the developer has to anticipate every possible scenario and write code for it; contrasted with agents that can reason through new scenarios
Conditional statement: a programmed rule like "IF the customer asks about refunds THEN do X" — traditional software is full of these; agents handle many scenarios without needing explicit rules
Orchestration: the process of coordinating multiple tools, memory systems, and LLM calls in the right order to complete a complex task

🌉 Concrete analogy

Compare a plain LLM to a brilliant professor who will answer any question you bring to their office. They know a tremendous amount but they sit still in their office — you have to bring the question to them and they answer from their knowledge. An agent is that same professor but given a phone, a computer, a filing cabinet, and the freedom to walk around and do research before answering. The agent can decide "I need to look this up first" and go look it up — the professor cannot.

Where the metaphor breaks: This analogy implies the agent is always better — but agents are more expensive (more API calls), slower, and can make mistakes at each tool-calling step. A plain LLM is often better for simple questions where you already know all the information it needs.

⚙️ What happens, step by step

User sends a request: "What's your company's policy on refunding a damaged product?"
Plain LLM: reads the question, generates an answer from its training data — if TechCorp's refund policy wasn't in its training data, it makes something up or says it doesn't know
Agent: reads the question, then autonomously decides: "I need to search TechCorp's policy documents"
Agent calls a search tool (the vector database), retrieves relevant policy chunks
Agent reads the retrieved chunks, may call additional tools if needed (e.g., check customer's order status)
Agent generates a final answer grounded in TechCorp's actual documents
Agent stores this exchange in memory so future messages in the same conversation can reference it

🧪 Try it / see it

Same question. Watch how a plain LLM and an agent handle it differently.

Plain LLM

👤 "What's TechCorp's refund policy?"
🧠 LLM thinks: "I don't have TechCorp's data in my training..."
⚠️ Either says "I don't know" OR hallucinates a generic policy

Agent

👤 "What's TechCorp's refund policy?"
🤖 Decides: "I should search the docs first"
🔎 Calls vector_db_search tool
📄 Reads the retrieved chunks
✅ Generates an answer grounded in real data
💾 Saves exchange to memory for follow-ups

“An agent on the other hand has autonomy, memory, and tools to perform whatever task it thinks is necessary to complete your request.” — speaker, at 00:07:04

📸 More screenshots from this section (4 frames — click any to enlarge)

Scene 06

LangChain — The Toolbox That Connects Everything

▶ Watch this section ⏱ starts at 00:07:38 Pi Step 5Pi Step 4

Altitude: Infrastructure layer — LangChain is the framework that makes building agents practical; without it you'd have to write thousands of lines of low-level code yourself

😩 The confusion

After understanding what an agent should do, you face the terrifying question: "How do I actually BUILD all of this?" Memory management, database connections, switching between AI providers, tool routing — it sounds like months of work

💡 The relief

LangChain is a pre-built library (a collection of ready-to-use code components) that handles all the plumbing — you just configure which pieces you need and connect them, like using pre-made LEGO bricks instead of sculpting each brick yourself

📚 Every technical term, explained (13 terms)

LangChain: an open-source Python library (a collection of pre-written code you can use in your projects) that provides building blocks for AI agent applications; it handles provider connections, memory, tools, and output formatting
Library / Package: a collection of pre-written code that someone else built and published; you "import" it into your project to use its features without writing them from scratch
SDK (Software Development Kit): a package of tools that makes it easier to talk to a specific service; e.g., the OpenAI SDK is code that handles all the technical details of sending a request to OpenAI's servers; LangChain is like a "universal SDK" that wraps multiple providers
Abstraction layer: a layer of code that hides complex implementation details behind a simple interface; LangChain abstracts away "how do I talk to GPT-4 vs. Claude" so your code says the same thing regardless of which model you use
Chat model: in LangChain, a class (a reusable code template) that represents a connection to an LLM provider; you create one and call `.invoke()` on it to get a response
Provider: the company offering an AI model (OpenAI, Anthropic, Google); LangChain supports switching providers by changing one line of code
MemorySaver: a LangChain component that automatically stores and retrieves conversation history so the agent remembers previous messages
Vector store: in LangChain, a connection to a vector database (like ChromaDB or Pinecone); provides a standard interface for storing and searching embeddings
Output parser: a LangChain component that converts the AI's free-text response into a structured format like a Python list or dictionary (a labeled data structure), so your program can use the data programmatically
Tool: in LangChain/agents, a Python function that the agent can choose to call; examples: search_company_db, send_email, check_inventory; the agent decides when to call which tool
Boilerplate code: repetitive code that must be written the same way every time; LangChain eliminates most boilerplate for AI applications
API (Application Programming Interface): a set of rules that lets your program communicate with another program or service (like OpenAI's servers); an API key is the password that grants access
Component: an individual, reusable piece in LangChain (e.g., a chat model, a memory store, a tool); you assemble components into a complete application

🌉 Concrete analogy

Building an AI agent without LangChain is like building a car from raw metal and rubber — possible, but it takes enormous effort. LangChain is like buying a car kit with pre-made engine, transmission, wheels, and dashboard — you still assemble and configure it, but you're not forging each bolt yourself. Each LangChain component (memory, vector store, tools) is a pre-made car part.

Where the metaphor breaks: The car kit analogy implies the parts always fit together perfectly — but LangChain versions update frequently and sometimes components break compatibility with each other. Also, if you need highly custom behavior, LangChain's abstractions can get in the way rather than help.

⚙️ What happens, step by step

Without LangChain: you write custom code for each AI provider's API, build your own memory database schema, write your own vector search logic, handle tool routing manually — this grows exponentially complex
With LangChain: you import the components you need (e.g., ChatOpenAI, ChatAnthropic, MemorySaver, ChromaDB, custom tools)
To switch from OpenAI to Anthropic, you change ONE LINE: `llm = ChatOpenAI(...)` becomes `llm = ChatAnthropic(...)`
Memory is handled by passing a MemorySaver to the agent — it automatically stores and retrieves conversation history
Vector database connections use a standard interface — whether you use Pinecone or ChromaDB, the code looks nearly identical
Tools are defined as Python functions and registered with the agent — the agent decides when to call them
The agent orchestrates all these components based on the user's message, deciding which tools to use and in what order

🧪 Try it / see it

Switch providers in one click — LangChain hides all the wiring.

LLM: Memory: Vector store:

“LangChain is an abstraction layer that helps you build AI agents with minimal code.” — speaker, at 00:07:38

📸 More screenshots from this section (6 frames — click any to enlarge)

Scene 07

Your First API Call — Talking to an AI With Code

▶ Watch this section ⏱ starts at 00:10:49 Pi Step 5Pi Step 6

Altitude: Hands-on layer — this is where abstract concepts meet actual code; seeing the real structure of an API call demystifies how "your program talks to the AI"

😩 The confusion

You've heard "make an API call" but have no idea what that means in practice — what does the code look like? What goes in? What comes back? How do you handle it?

💡 The relief

An API call to an AI is just sending a structured message (a JSON package with a role and content) to a web address, and getting back a structured reply — it's like a very formal text message with labeled fields

📚 Every technical term, explained (14 terms)

API call: sending a request from your program to another service over the internet and receiving a response; when you call the OpenAI API, your code sends a question to OpenAI's servers and gets back an AI-generated answer
API key: a secret password string (like "sk-abc123...") that identifies your account; every API call must include it so the provider knows who to charge; keep it private like a bank PIN
Base URL: the web address of the server your API calls are sent to; for OpenAI it's "https://api.openai.com/v1"; like a phone number for the AI service
Client: in code, an object (a data structure in your program) that manages the connection to an API; you create one client and use it to make all your calls; it handles authentication, retry logic, etc.
Environment variable: a named value stored in your computer's system settings (not in your code file) so you don't accidentally share secrets like API keys; accessed in Python with `os.getenv("OPENAI_API_KEY")`
Virtual environment: a self-contained Python installation for a specific project; keeps each project's libraries separate so they don't conflict with each other; like having separate toolboxes for each job
Import: a Python statement that loads a library into your program so you can use its functions; `import openai` loads the OpenAI library
Chat completions: OpenAI's API endpoint for conversational AI; you send a list of messages and get back the AI's next message
Message roles: the three roles in a conversation sent to the API: "system" (instructions that set the AI's behavior), "user" (your question), "assistant" (the AI's previous replies); together they form the conversation history
Response object: the structured data package the API sends back; contains the AI's reply text, how many tokens were used, timestamps, and other metadata; you have to navigate into it to extract what you need (e.g., `response.choices[0].message.content`)
Prompt tokens: tokens in the message you sent (your question and any documents); you pay for these
Completion tokens: tokens in the AI's reply; more expensive per token than prompt tokens
Total tokens: prompt tokens + completion tokens; used to calculate the cost of each API call
JSON (JavaScript Object Notation): a way to structure data as labeled key:value pairs, like a form; `{"role": "user", "content": "Hello"}` is JSON; APIs send and receive data in JSON format

🌉 Concrete analogy

An API call is like sending a formal letter to a very smart pen pal. Your letter follows a strict format (the API specification): you include your return address (the base URL), a password (the API key), who you are (system role), what you're asking (user role), and your conversation so far (message history). The pen pal reads your letter, writes a reply in the same formal format, and mails it back. You then open the envelope and look in the specific pocket (response.choices[0].message.content) where the actual reply is stored.

Where the metaphor breaks: The letter analogy implies slow delivery — real API calls typically take 0.5–5 seconds. Also, letters don't have "token costs" — the API analogy is closer to a pay-per-word telegraph service.

⚙️ What happens, step by step

Set up the environment: install Python, install the openai library (`pip install openai`), store your API key as an environment variable
Import the openai library and os library in your Python script (`import openai`, `import os`)
Create an API client: `client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"), base_url="https://...")`
Build your message list: a list of dicts with role ("system", "user") and content fields
Make the API call: `response = client.chat.completions.create(model="gpt-3.5-turbo", messages=[...])`
Extract the reply from the response object: `text = response.choices[0].message.content`
Check token usage for cost: `response.usage.prompt_tokens`, `response.usage.completion_tokens`, `response.usage.total_tokens`

🧪 Try it / see it

Anatomy of an API call — every field has a job.

import openai, os

client = openai.OpenAI(
    api_key = os.getenv("OPENAI_API_KEY"),  # secret password
    base_url= "https://api.openai.com/v1"   # where to send it
)

response = client.chat.completions.create(
    model    = "gpt-4o-mini",
    messages = [
        {"role": "system",  "content": "You are TechCorp support."},
        {"role": "user",    "content": "What is the return policy?"}
    ]
)

text   = response.choices[0].message.content     # the AI's reply
tokens = response.usage.total_tokens             # for cost tracking

system = sets the AI's persona/rules
user = the human's question
assistant = (added automatically) the AI's past replies for memory

“The API key works like a password that identifies us and grants access.” — speaker, at 00:10:49

📸 More screenshots from this section (7 frames — click any to enlarge)

Scene 08

Prompt Engineering — How to Talk to AI Effectively

▶ Watch this section ⏱ starts at 00:18:12 Pi Step 3Pi Step 4Pi Step 5

Altitude: Communication layer — even with a perfect agent setup, bad prompts produce bad results; this is the skill that multiplies the value of everything else

😩 The confusion

You send a message to the AI and get a generic, vague, or completely wrong answer — and you don't know why or how to fix it

💡 The relief

The quality of the AI's output is directly shaped by the quality of your input; prompt engineering is the discipline of crafting instructions that guide the AI toward the exact behavior you want

📚 Every technical term, explained (10 terms)

Prompt: the text you send to an AI model as input; it can be a question, an instruction, a request, a description of a task, or a combination of all of these
Prompt engineering: the practice of deliberately designing prompts to get better, more accurate, more structured, or more consistent responses from an AI
Zero-shot prompting: asking the AI to do something without providing any examples; you're relying entirely on the AI's existing training; "zero shots" = zero examples given
One-shot prompting: providing exactly one example of the desired output format before making your actual request; "here's one example, now do it for this case"
Few-shot prompting: providing multiple examples (typically 3–10) of the desired format/style/tone before the actual request; the AI learns the pattern from the examples
Chain-of-thought prompting: instructing the AI to reason step by step through a problem before giving the final answer; produces more reliable results on complex tasks because the AI "shows its work"
System prompt: the initial instruction given to the AI that sets its role, behavior, and constraints for the entire conversation; usually invisible to the end user but powerfully shapes responses
Role definition: telling the AI what character or professional role to play (e.g., "You are a TechCorp customer support expert"); helps the AI adopt the appropriate tone, knowledge domain, and response style
Specificity: how precise and detailed your prompt is; vague prompts ("write a policy") produce vague outputs; specific prompts ("write a 200-word GDPR-compliant privacy policy for European customers with a 30-day retention period") produce focused outputs
Constraints: limitations you add to a prompt to control the output (e.g., "in 200 words", "in bullet points", "cite sources", "do not include legal disclaimers")

🌉 Concrete analogy

Prompting an AI is like giving instructions to a very literal but extremely capable assistant. If you say "write me a report," they might write a 50-page novel or a 3-word sentence — technically both are "a report." But if you say "write me a 1-page executive summary of Q3 sales, using bullet points, highlighting the top 3 risks, and keeping the tone professional" — they'll nail it. The assistant has the capability; your instructions determine whether that capability is aimed correctly.

Where the metaphor breaks: The analogy implies the AI will always follow instructions perfectly — but AI models can still "hallucinate" (make things up), ignore constraints, or misinterpret ambiguous instructions even with well-crafted prompts. Prompt engineering reduces errors but does not eliminate them.

⚙️ What happens, step by step

Identify what you want the AI to produce (format, length, tone, content constraints)
Write a system prompt that sets the AI's role and persistent constraints (e.g., "You are a TechCorp support expert. Always respond in bullet points.")
For the user message, be specific: include the topic, any constraints, the desired format, and relevant context
Test the prompt and observe where the output falls short
Iterate: add examples if the format is wrong (one-shot or few-shot), add reasoning steps if the logic is wrong (chain-of-thought), add constraints if the output is too long/short/off-tone
Compare techniques side-by-side to find the right balance for your use case

🧪 Try it / see it

Same task. Watch quality climb as the prompt gets more specific.

Prompt	Likely output	Quality
"Write a policy"	5-page rambling generic doc	⭐
"Write a refund policy"	Generic refund policy in random format	⭐⭐
"Write a refund policy for TechCorp"	Reasonably company-flavored but inconsistent	⭐⭐⭐
"You are TechCorp support. Write a 200-word refund policy. Use bullet points. Include: 30-day window, original payment refund, damaged items free return."	Exact, on-brand, ready to ship	⭐⭐⭐⭐⭐

“The quality of your prompt directly impacts the quality of responses you receive.” — speaker, at 00:18:12

📸 More screenshots from this section (4 frames — click any to enlarge)

Scene 09

Zero-Shot, One-Shot, Few-Shot, Chain-of-Thought — The Four Techniques

▶ Watch this section ⏱ starts at 00:19:24 Pi Step 5Pi Step 6Pi Step 8

Altitude: Skill layer — these four techniques are the practical toolkit; understanding when to use each one is the difference between getting mediocre vs. excellent AI outputs

😩 The confusion

You know you should write "better prompts" but don't know what that means concretely — what should you add? What's the difference between giving an example vs. giving reasoning steps?

💡 The relief

The four techniques are distinct tools for distinct problems: zero-shot for speed, one-shot for format consistency, few-shot for tone/style mimicry, chain-of-thought for complex reasoning — choosing the right tool dramatically improves quality

📚 Every technical term, explained (9 terms)

Zero-shot: no examples given; just your instruction; works when the AI knows the task type from training; "write a data privacy policy for our European customers" is zero-shot; fast but can be inconsistent in format
One-shot: one example given; you show the AI exactly one output you like, then ask it to produce something in the same style; "here's an example refund policy formatted in 5 sections — now write a remote work policy in the same format"
Few-shot: multiple examples (typically 3–6) given; more examples help the AI learn subtleties like tone, vocabulary, and level of formality; especially powerful for customer service scripts where consistent tone matters
Chain-of-thought (CoT): you provide a series of reasoning steps the AI should follow, rather than asking for a direct answer; "Step 1: Review GDPR requirements. Step 2: Analyze our current policy for gaps. Step 3: Research best practices. Step 4: Draft recommendations. Now fix our policy." — the explicit steps guide the AI to think methodically
In-context learning: the AI's ability to learn from examples PROVIDED in the prompt (not from retraining); one-shot and few-shot prompting exploit in-context learning
Format: the structural layout of the output (bullet points, numbered list, table, paragraph, JSON, etc.); providing an example is the most reliable way to enforce a specific format
Tone: the emotional register of writing (formal, casual, empathetic, authoritative); few-shot examples teach tone more reliably than describing it in words
Consistency: getting the AI to produce similar-quality, similarly-structured outputs across many different inputs; few-shot prompting improves consistency for high-volume tasks like customer service replies
Reasoning: the process of working through a problem logically, step by step; chain-of-thought prompting explicitly asks the AI to do this rather than jump straight to an answer

🌉 Concrete analogy

Think of the four techniques as different ways to explain a task to a new employee. Zero-shot: "Write me a refund policy." (You trust their judgment.) One-shot: "Here's how we wrote the last refund policy — write one for returns." (You give one template.) Few-shot: "Here are three examples of how our company writes all policies — now write this new one." (You show them the company style guide.) Chain-of-thought: "To write a good policy, first review our legal obligations, then check competitor policies, then draft a proposal in our standard format, then review it against our brand voice — now go." (You give them a step-by-step process.)

Where the metaphor breaks: Chain-of-thought can backfire if the reasoning steps you provide are wrong or lead the AI down an incorrect logical path. Also, few-shot examples can mislead the AI if the examples accidentally share a quirk (e.g., all end with "Thank you") that the AI then copies inappropriately.

⚙️ What happens, step by step

Start with zero-shot to see baseline quality — if it's good enough, stop here (cheapest and fastest)
If the FORMAT is wrong (wrong structure, wrong length) → add one-shot: provide one example output
If the TONE or STYLE is inconsistent → upgrade to few-shot: provide 3–6 examples that demonstrate the right style
If the REASONING or LOGIC is flawed → use chain-of-thought: break the task into explicit numbered steps
For most real business tasks, few-shot + chain-of-thought combined produces the most reliable outputs
Compare all four techniques side-by-side on the same task to calibrate your choice

🧪 Try it / see it

Click each technique to see its prompt shape.

Just the instruction, no examples.

Write a customer email apologizing for a late shipment.

Use when: simple task, you trust the AI's defaults.

“Choosing the right technique can dramatically improve results depending on the task.” — speaker, at 00:19:24

📸 More screenshots from this section (7 frames — click any to enlarge)

Scene 10

Vector Databases — Storing Meaning, Not Just Words

▶ Watch this section ⏱ starts at 00:26:44 Pi Step 3Pi Step 4Pi Step 5

Altitude: Storage layer — vector databases are what make semantic search possible at scale; this is the "library filing system" that enables RAG

😩 The confusion

Traditional keyword search is brittle — it only finds documents containing the EXACT words searched; employees search "vacation policy" and miss "time off guidelines"; understanding why keyword search fails and how vector databases fix it

💡 The relief

Vector databases store the *meaning* of text as numbers (embeddings), so searching for "vacation" automatically finds documents about "time off", "annual leave", "PTO" — because they all live close together in meaning-space

📚 Every technical term, explained (10 terms)

SQL database: the traditional type of database that stores data in rows and columns (like a spreadsheet); searching is done with exact keyword matching using SQL queries (e.g., SELECT * FROM documents WHERE content LIKE '%vacation%'); fast and reliable but only finds exact word matches
Keyword search: searching by matching the literal characters in the search term to characters in the document; fails when the user uses different words than the document uses
Vector database: a database designed specifically to store and search embedding vectors (lists of numbers representing meaning); instead of comparing text characters, it compares number patterns to find semantic matches; examples: Pinecone, ChromaDB, Weaviate, Qdrant
Pinecone: a popular cloud-based vector database service; managed (someone else runs the servers for you); good for production applications
ChromaDB: an open-source vector database you can run locally on your computer; popular for development and prototyping; free to use
Semantic search: searching by meaning; the query is converted to an embedding vector and compared to all stored vectors to find the closest semantic matches; what vector databases enable
Similarity score: a number (typically 0–1) indicating how semantically similar a stored document chunk is to the query; you set a threshold (e.g., 0.7) below which matches are ignored
Similarity threshold: the minimum similarity score required for a document to be returned as a match; setting it too high misses relevant results; too low returns irrelevant noise
Chunk: a fragment of a larger document that gets stored as a single entry in the vector database; instead of storing the entire 100-page employee handbook as one embedding, you split it into hundreds of smaller chunks and embed each one separately
Chunk overlap: when splitting a document into chunks, you intentionally let adjacent chunks share some text (e.g., the last 100 characters of chunk 1 are also the first 100 characters of chunk 2); prevents important context from being cut in half at a boundary

🌉 Concrete analogy

Imagine two filing systems for a library. The old keyword filing system puts books on shelves alphabetically by title — "Time Off Policy" and "Vacation Guidelines" end up on completely different shelves. The new vector filing system puts books on shelves by topic — all books about "taking leave from work" are in the same area, whether they're titled "vacation", "PTO", "annual leave", or "time off". When you walk in and ask for books about taking a break from work, the vector system finds all of them; the keyword system only finds books with those exact words in the title.

Where the metaphor breaks: The library analogy implies a clear physical separation between topics — but in vector space, concepts can be "close" to each other across multiple dimensions simultaneously (e.g., "apple" is close to both "fruit" and "iPhone" in different dimensions). Also, the library analogy doesn't capture the complexity of choosing chunk size and overlap — cut the book into pages or sentences or paragraphs? Each choice affects search quality differently.

⚙️ What happens, step by step

You take TechCorp's 500 GB of documents and run them through an embedding model (e.g., OpenAI's text-embedding model)
Each document is split into chunks (e.g., 500-character pieces with 100-character overlap between adjacent chunks)
Each chunk is converted to an embedding vector (a list of 1536 numbers)
Each chunk's text + its vector is stored in the vector database (e.g., ChromaDB)
When an employee asks a question, the question is also converted to an embedding vector
The vector database compares the question's vector to all stored chunk vectors using cosine similarity
Chunks with similarity above the threshold are returned as the most relevant results

🧪 Try it / see it

Type a query. See how SQL (exact word match) vs. Vector DB (meaning match) respond.

SQL Keyword Search

Vector Semantic Search

“Instead of searching by value, we can now search by meaning.” — speaker, at 00:26:44

📸 More screenshots from this section (6 frames — click any to enlarge)

Scene 11

Embeddings in the Database — Chunking, Dimensions, and Scoring

▶ Watch this section ⏱ starts at 00:28:52 Pi Step 5Pi Step 7

Altitude: Implementation detail layer — once you decide to use a vector database, these three concepts (chunking, dimensions, scoring) are the tuning knobs that determine how well it actually works

😩 The confusion

You've accepted that vector databases are good, but now face the practical question: how do you actually set one up? Why doesn't it just work automatically? Why is there "tuning" involved?

💡 The relief

Three concrete parameters control vector DB quality: chunk size (how big each piece is), dimensionality (how many numbers capture the meaning), and similarity threshold (how strict "similar enough" is) — understanding each gives you control over accuracy

📚 Every technical term, explained (10 terms)

Dimensionality: the number of numbers in each embedding vector; more dimensions = more nuance captured, but larger storage and slower search; 1536 is the standard for OpenAI's text-embedding-ada-002; think of it as how many different "angles" of meaning are captured
Chunk size: the maximum number of characters (or tokens) in each piece of text that gets stored as a single vector; too large = each chunk contains too many different topics and the embedding becomes blurry; too small = chunks lose context and adjacent information is separated
Chunk overlap: the number of characters shared between adjacent chunks; prevents a sentence from being split awkwardly at a boundary; e.g., if chunk 1 ends at character 500 and you have 100-character overlap, chunk 2 starts at character 400
Scoring threshold: the minimum cosine similarity score a chunk must have to be returned as a search result; e.g., 0.75 means only chunks 75% similar or more to the query are returned; prevents irrelevant chunks from polluting the results
Cosine similarity: the mathematical measure of how similar two vectors are; 1.0 = identical, 0.0 = unrelated, -1.0 = opposite meaning; the standard metric for vector database retrieval
Semantic drift: when a chunk's embedding doesn't accurately represent its content because it contains too many different topics mixed together (caused by too-large chunk size)
Context preservation: ensuring that the meaning of text isn't lost when it's split into chunks; overlapping chunks and sentence-aware splitting both help with this
Retrieval: the act of querying the vector database to get back relevant chunks; one of the three steps in RAG (retrieval, augmentation, generation)
False positive: a chunk returned by the search that is NOT actually relevant to the query; caused by a similarity threshold that is too low
False negative: a relevant chunk NOT returned because its similarity score fell below the threshold; caused by a threshold that is too high

🌉 Concrete analogy

Imagine you're cutting a large book into index cards to file them in a topic-based drawer system. Chunking is deciding how much text goes on each card — too much and the card covers too many topics (hard to file correctly); too little and you lose context (a card that says "must not request" without the surrounding sentence is meaningless). Dimensionality is how many different attributes you write on each card to describe its meaning (just "topic" vs. "topic + tone + formality + time period"). The scoring threshold is how strict the librarian is when matching your query to a card — very strict means fewer but more relevant matches; loose means more matches but some irrelevant ones.

Where the metaphor breaks: The index card analogy breaks down for very long documents where the optimal chunk size depends on document structure (legal docs need larger chunks to preserve clause context; customer support transcripts work better with sentence-level chunks). Also, the analogy doesn't capture that chunk overlap can cause duplicate information to appear in search results.

⚙️ What happens, step by step

Choose chunk size based on document type (e.g., 500 characters for business policies; larger for legal documents)
Choose chunk overlap (e.g., 100 characters) to prevent context from being cut at boundaries
Split all documents into overlapping chunks using a text splitter (LangChain's RecursiveCharacterTextSplitter is a common tool)
Choose embedding model and its dimensionality (e.g., OpenAI text-embedding → 1536 dimensions)
Convert each chunk to an embedding vector using the embedding model
Store each (chunk text + embedding vector) pair in the vector database
Set the similarity threshold for retrieval (e.g., 0.75) based on testing — too high misses results, too low adds noise

🧪 Try it / see it

Adjust chunk size & overlap. See how the same paragraph gets split.

Chunk size: 80 chars Overlap: 20 chars

“Setting up a score threshold based on the question can help you limit those low similarities to count as a match.” — speaker, at 00:28:52

📸 More screenshots from this section (6 frames — click any to enlarge)

Scene 12

RAG — Retrieval-Augmented Generation

▶ Watch this section ⏱ starts at 00:35:15 Pi Step 4Pi Step 5Pi Step 3

Altitude: System architecture layer — RAG is the core pattern that enables AI to answer questions about private, large, or up-to-date data that wasn't in the LLM's training; this is the breakthrough that makes enterprise AI assistants possible

😩 The confusion

You understand LLMs and vector databases separately, but not how they connect — how does the AI actually "know" about TechCorp's 500 GB of documents when the documents were never part of its training?

💡 The relief

RAG is the three-step bridge: (1) Retrieve relevant chunks from the vector DB, (2) Augment the LLM's prompt by injecting those chunks as context, (3) Generate an answer using the LLM with that fresh context — so the LLM "knows" about your documents at answer time without being retrained

📚 Every technical term, explained (11 terms)

RAG (Retrieval-Augmented Generation): a design pattern for AI systems where the AI first retrieves relevant information from a database, injects it into the prompt, and then generates an answer using that retrieved information; combines the search power of vector databases with the language generation power of LLMs
Retrieval: the R in RAG; the step where the user's question is converted to an embedding and used to search the vector database for relevant document chunks
Augmentation: the A in RAG; the step where the retrieved document chunks are inserted into the LLM's prompt as additional context; "augmenting" the prompt with fresh, relevant information
Generation: the G in RAG; the step where the LLM reads the augmented prompt (question + retrieved context) and generates the final answer; the LLM uses the provided context rather than relying solely on its training
Context injection: another name for augmentation; literally copying relevant text chunks into the prompt so the LLM can read and reference them when formulating its answer
Pre-training knowledge: what the LLM already knows from its training (static, can become outdated); RAG provides a way to add current, private knowledge without retraining
Fine-tuning: an alternative to RAG where you actually modify the LLM's weights (internal parameters) by training it on your custom data; more expensive and time-consuming than RAG, and the knowledge can still become outdated
Hallucination: when an LLM confidently generates false information; RAG reduces hallucination by grounding the answer in retrieved documents; you can also instruct the LLM: "if the answer isn't in the context, say you don't know"
Source attribution: showing the user which document(s) the AI used to answer their question; a key feature of production RAG systems for trust and verification
Grounded answer: an answer that is explicitly based on provided source documents, not just the LLM's training; RAG produces grounded answers
Q&A engine: a system that accepts questions and produces answers from a specific knowledge base; what a complete RAG system implements

🌉 Concrete analogy

RAG is like an open-book exam. Without RAG, the AI is a student taking a closed-book exam — it can only answer from what it memorized during training, and it may confidently write wrong answers when it didn't have the material. With RAG, the student is allowed to open a specific textbook at exam time. The question says "What's TechCorp's refund policy?" — the student (AI) opens the textbook (vector database), finds the relevant pages (retrieved chunks), reads them, and writes an answer based on what's on those specific pages rather than guessing.

Where the metaphor breaks: The open-book exam analogy breaks when the "book" (vector database) has poor chunking or low-quality embeddings — even with the book open, if the right page is hard to find or the text is garbled, the student can't answer well. Also, RAG doesn't help when the question requires reasoning across many different documents simultaneously — the retrieval step might only return a few chunks, missing crucial connections.

⚙️ What happens, step by step

User asks: "What's our remote work policy for international employees?"
RETRIEVAL: The question is converted to an embedding vector
RETRIEVAL: The vector is compared to all stored document chunks in the vector database
RETRIEVAL: The top N most similar chunks are retrieved (e.g., top 3–5 chunks from policy documents)
AUGMENTATION: The retrieved chunks are injected into the LLM prompt as context: "Use ONLY the following documents to answer the question: [chunk 1] [chunk 2] [chunk 3]. Question: What's our remote work policy for international employees?"
GENERATION: The LLM reads the augmented prompt and generates a specific, grounded answer based on the retrieved documents
Source attribution: The answer is annotated with which document(s) it came from

🧪 Try it / see it

The 3-step RAG pipeline. Click "Run" to watch it animate.

1

RETRIEVE

Question → embedding → vector DB → top chunks

▶

2

AUGMENT

Inject chunks into prompt as context

▶

3

GENERATE

LLM answers using ONLY the retrieved context

Click "Run pipeline" to see what flows through each step.

“RAG is a very powerful system that can instantly improve the depth of knowledge beyond its training data.” — speaker, at 00:35:15

📸 More screenshots from this section (6 frames — click any to enlarge)

Scene 13

Building the Full RAG Pipeline — Code Demo

▶ Watch this section ⏱ starts at 00:38:42 Pi Step 5Pi Step 6

Altitude: Implementation layer — this is where the RAG concept becomes concrete code you can actually run; connects abstract ideas to real Python + LangChain steps

😩 The confusion

After understanding RAG conceptually, you face the gap between "I get the idea" and "I could actually build this" — what does each step look like in code?

💡 The relief

The five-task lab walk-through shows exactly which code component handles each RAG step: ChromaDB for storage, embedding models for vectorization, prompt templates for augmentation, and LLM for generation — with the key guardrail of "only answer from retrieved context"

📚 Every technical term, explained (10 terms)

ChromaDB client: the Python object that manages the connection to a ChromaDB database; you create it with `chromadb.Client()` and use it to create/access collections
Collection: a named group of embeddings inside ChromaDB, similar to a table in SQL; e.g., "techcorp_rag" is a collection holding all of TechCorp's document chunks
Temperature: a parameter that controls how creative/random the LLM's responses are; 0.0 = very predictable and consistent (good for factual Q&A); 1.0 = more varied and creative (good for brainstorming); for RAG you typically want low temperature (0.0–0.3)
Max tokens: the maximum number of tokens the LLM is allowed to generate in its response; setting this prevents extremely long responses; e.g., 500 max tokens limits answers to ~375 words
Top P (nucleus sampling): another parameter controlling LLM randomness; works with temperature; typically left at default (1.0) for RAG applications
System prompt for RAG: the instruction given to the LLM that says "you must ONLY use the provided context to answer; if the answer isn't in the context, say so" — this is the critical guardrail preventing hallucination
Paragraph-based chunking: splitting documents at natural paragraph boundaries rather than at a fixed character count; preserves complete thoughts; recommended for RAG over fixed-size chunking
Source attribution: adding to the AI's response information about which document(s) it drew from; increases user trust and allows verification
Hallucination prevention: the practice of instructing the LLM to say "I don't have that information in the provided documents" when the context doesn't contain the answer, rather than making something up
Production-ready: a system that is stable, reliable, and scalable enough to be used by real users in a real business (vs. a prototype only used for testing)

🌉 Concrete analogy

Building the RAG pipeline is like setting up an automated reference librarian. Task 1 (vector store) is setting up the filing cabinet. Task 2 (chunking) is cutting documents into well-sized index cards and filing them. Task 3 (LLM integration) is hiring the librarian (connecting the AI). Task 4 (prompt template) is giving the librarian their rule book: "Only answer from the files in our cabinet — if it's not in our files, say so." Task 5 (full pipeline) is opening the library to the public: customer walks in with a question → librarian searches the files → reads the relevant cards → writes an answer → cites which file it came from.

Where the metaphor breaks: The RAG pipeline breaks when documents aren't chunked appropriately for the question type — a question requiring information from three different policy sections might only retrieve chunks from one section. Also, if the embedding model and the LLM were trained on different vocabularies (rare but possible), retrieval accuracy can degrade.

⚙️ What happens, step by step

Task 1 — Set up vector store: initialize ChromaDB client, create a collection called "techcorp_rag", configure the embedding model (all-MiniLM-L6-v2)
Task 2 — Document processing: split TechCorp documents into paragraph-based chunks with smart overlap; store each chunk + its embedding in ChromaDB
Task 3 — LLM integration: connect OpenAI GPT-4.1 Mini with specific parameters (temperature=0.1 for consistency, max_tokens=500); test simple generation before adding retrieval
Task 4 — RAG prompt template: build a prompt that always injects retrieved context and includes the guardrail instruction "only use the provided documents; if not found, say so"
Task 5 — Full pipeline: user query → embed query → search ChromaDB → retrieve top 3 chunks → inject into prompt → LLM generates answer → append source attribution
Validation: test with queries like "work from home policy" and verify the system surfaces "remote work guidelines" correctly

🧪 Try it / see it

The 5 lab tasks that build a working RAG system.

Task 1 — Set up vector store
client = chromadb.Client() col = client.create_collection("techcorp_rag")
Task 2 — Process documents
splitter = RecursiveCharacterTextSplitter(chunk_size=500, overlap=100) col.add(documents=chunks, embeddings=emb_model.embed(chunks))
Task 3 — Connect LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1, max_tokens=500)
Task 4 — RAG prompt template (the guardrail!)
"Use ONLY this context to answer. If not present, say 'I don't have that information.'"
Task 5 — Full pipeline
query → embed → search → inject → LLM → answer + sources

“Every answer points back to the document it was derived from — this transforms the system into a full production-ready Q&A engine.” — speaker, at 00:38:42

📸 More screenshots from this section (7 frames — click any to enlarge)

Scene 14

LangGraph — Multi-Step Workflows and Decision Trees

▶ Watch this section ⏱ starts at 00:42:27 Pi Step 3Pi Step 4Pi Step 5

Altitude: Orchestration layer — LangGraph adds the ability to build AI workflows that branch, loop, and pass state between steps; it's what you need when a single question-answer interaction isn't enough

😩 The confusion

You've built a RAG chatbot but realize it can only do one thing at a time — for complex multi-step tasks (like "analyze our EU data privacy compliance, check all relevant regulations, identify gaps, and generate a report") a simple Q&A flow completely breaks down

💡 The relief

LangGraph models workflows as a graph of nodes (individual processing steps) connected by edges (the rules that control which step runs next); shared state passes information between steps; conditional edges enable branching and looping — making complex multi-step AI workflows possible

📚 Every technical term, explained (12 terms)

LangGraph: an extension of LangChain specifically for building stateful, multi-step AI workflows; models the workflow as a directed graph where each step is a node and connections between steps are edges
Graph: a structure consisting of nodes (points) connected by edges (lines); LangGraph uses this metaphor to model AI workflows
Node: an individual processing step in a LangGraph workflow; implemented as a Python function; takes the current state as input and returns an updated state as output; examples: "search documents", "extract content", "evaluate compliance", "generate report"
Edge: a connection between two nodes that defines which node runs next; in LangGraph, edges can be fixed ("always go from A to B") or conditional ("go to B if X, go to C if Y")
Conditional edge: an edge whose destination depends on the current state; the router function inspects the state and returns the name of the next node to run; this is what enables branching logic (if/then behavior) in a workflow
State: the shared memory that all nodes in a LangGraph workflow can read from and write to; implemented as a Python dictionary (typed with TypedDict); as each node runs, it updates the state with its results which are then available to later nodes
StateGraph: the LangGraph class that holds the entire workflow definition (all nodes, edges, and state schema); you build the graph by calling methods on this object
Shared state: the concept of a single data structure that persists throughout the entire workflow and is accessible and modifiable by every node; like a shared whiteboard that each team member can update as they complete their step
Workflow: a sequence (possibly branching and looping) of processing steps that together accomplish a complex task
Iteration: running the same step multiple times in a loop until a condition is met; LangGraph supports this via conditional edges that route back to an earlier node
Multi-step reasoning: breaking a complex request into a series of sub-tasks, each handled by a different node; LangGraph orchestrates the sequence and manages data flow between sub-tasks
GDPR: General Data Protection Regulation; the European Union's main data privacy law; requires companies handling EU citizens' data to follow strict rules about how data is collected, stored, and processed

🌉 Concrete analogy

LangGraph is like a flowchart that actually executes. Imagine a compliance review flowchart on a whiteboard: "Box 1: Gather all relevant policy docs → Box 2: Extract key content → Box 3: Check GDPR compliance (score it) → Diamond: Is score above 75%? YES → Box 5: Write report. NO → Box 1: Gather more docs (loop back)." LangGraph turns that whiteboard flowchart into working code, where each box is a Python function (node), each arrow is an edge, and the diamond is a conditional edge. The state (the shared whiteboard) carries information — documents found, compliance score, identified gaps — from one box to the next.

Where the metaphor breaks: The flowchart analogy implies a rigid, pre-defined path — but LangGraph workflows can get complex with multiple parallel paths, nested graphs, and human-in-the-loop checkpoints that don't fit neatly on a whiteboard. Also, flowcharts don't capture the fact that LangGraph nodes can call LLMs internally, making each node potentially expensive and slow.

⚙️ What happens, step by step

Define the state schema: a Python TypedDict listing all the data fields that will be shared between nodes (e.g., topic, documents, current_document, compliance_score, gaps, recommendations)
Write each node as a Python function that takes state as input and returns a partial state update (e.g., node 1 populates the "documents" field; node 3 populates "compliance_score")
Create a StateGraph and add all nodes to it
Add edges to connect nodes in order
Add conditional edges at decision points: write a router function that reads the state and returns the name of the next node
Compile and run the graph with an initial state (e.g., {"topic": "EU data privacy policy"})
The graph executes: each node runs, updates state, and the next node is determined by edges until the END node is reached

🧪 Try it / see it

A LangGraph workflow = nodes (functions) + edges (rules for what runs next) + shared state.

The state dict (e.g. {"topic":..., "documents":..., "score":...}) is passed and updated by every node.

“With LangGraph, this becomes a graph where each node handles a specific responsibility.” — speaker, at 00:42:27

📸 More screenshots from this section (6 frames — click any to enlarge)

Scene 15

Building a LangGraph Research Agent — Code Demo

▶ Watch this section ⏱ starts at 00:45:31 Pi Step 5Pi Step 6

Altitude: Implementation layer — the LangGraph lab builds a research agent step by step, from a single node to a full multi-tool intelligent agent; the most complex code demo in the video

😩 The confusion

LangGraph sounds powerful but also abstract — you need to see each component built from scratch to understand how nodes, edges, state, and tools actually come together in real code

💡 The relief

The seven progressive lab tasks build up from "hello world node" to a complete research agent with conditional routing, calculator tool, and web search — each task adds exactly one new concept so no single step is overwhelming

📚 Every technical term, explained (10 terms)

TypedDict: a Python type hint (a label that tells Python what type of data to expect) used to define the structure of the LangGraph state; e.g., `class AgentState(TypedDict): messages: list[str]` says the state has a "messages" field that holds a list of strings
END: a special LangGraph constant that marks the terminal node of the graph; when a conditional edge routes to END, the workflow stops
Greeting node / Enhancement node: the first two example nodes in the lab; the greeting node adds "Hello!" to state, the enhancement node takes that and adds "Welcome to our service!" — demonstrates how state accumulates through nodes
Multi-step flow: adding more nodes (draft, review) that each add a layer of processing; simulates real pipelines like "outline → draft → review → publish"
Conditional routing: using a router function that inspects the message content and returns the next node's name; e.g., "if the message contains a math operation, route to calculator; otherwise route to text handler"
Calculator tool: a simple Python function registered as a tool that can evaluate mathematical expressions; demonstrates tool integration in LangGraph
DuckDuckGo web search: a free web search API that LangGraph can use as a tool; the agent calls it when the query requires current information from the internet
create_react_agent: a LangGraph function that automatically creates an agent capable of choosing which tools to call based on the query; "ReAct" stands for Reasoning + Acting (a specific pattern of alternating between thinking and tool use)
Tool routing: the agent's ability to look at a query and decide which tool(s) to call; e.g., "42 * 17 = ?" → calculator tool; "What's the weather in Paris?" → web search tool; "Tell me about AI agents" → LLM knowledge
Dynamic tool orchestration: the agent deciding at runtime which combination of tools to use; not pre-programmed paths but genuine decision-making by the LLM

🌉 Concrete analogy

Building the LangGraph research agent is like assembling a team of specialists and teaching them to cooperate. Task 1 (imports) = hiring the team. Task 2 (first nodes) = assigning the first two people their jobs. Task 3 (edges) = drawing the org chart showing who hands work to whom. Task 4 (multi-step) = adding more team members for more complex tasks. Task 5 (conditional routing) = adding a dispatcher who decides which specialist handles each incoming request. Task 6 (calculator tool) = giving the math specialist a calculator. Task 7 (web search) = giving the research specialist internet access. The final result is a team that receives any question and automatically decides who handles it.

Where the metaphor breaks: The team analogy implies specialists always know their domain perfectly — but the agent's routing decisions are made by an LLM, which can make incorrect routing decisions (routing a simple calculation to web search, or vice versa). Also, adding too many tools can confuse the agent about which to choose.

⚙️ What happens, step by step

Task 1: Import StateGraph, END, TypedDict; define a simple state schema with a "messages" field
Task 2: Write two node functions (greeting and enhancement); each takes state, appends to messages, returns updated state
Task 3: Create StateGraph, add both nodes, add edges connecting them, compile and invoke with initial state
Task 4: Add draft and review nodes; chain them all together; run and observe state accumulation
Task 5: Write a router function that inspects message content; use add_conditional_edges to create branching; test with different input types
Task 6: Define a calculator function, register it as a tool; add tool node; router detects math queries and routes to calculator
Task 7: Add DuckDuckGo search tool; use create_react_agent with both calculator and search tools; run queries of different types and observe the agent routing them correctly

🧪 Try it / see it

7 progressive tasks. Each adds exactly one new concept.

Task 1 — imports + state schema (TypedDict)
Task 2 — 2 node functions (greeting, enhancement)
Task 3 — connect them with edges, compile, invoke
Task 4 — add draft/review nodes (multi-step chain)
Task 5 — conditional edge with a router function
Task 6 — register a calculator tool
Task 7 — add DuckDuckGo search + create_react_agent() — the agent picks tools itself

from langgraph.graph import StateGraph, END
from typing import TypedDict

class AgentState(TypedDict):
    messages: list[str]

def greeting(state):
    state["messages"].append("Hello!")
    return state

graph = StateGraph(AgentState)
graph.add_node("greet", greeting)
graph.set_entry_point("greet")
graph.add_edge("greet", END)
app = graph.compile()
print(app.invoke({"messages": []}))

“This is dynamic tool orchestration, the foundation of modern AI agents.” — speaker, at 00:45:31

📸 More screenshots from this section (7 frames — click any to enlarge)

Scene 16

MCP — The Universal Plug for AI Tools

▶ Watch this section ⏱ starts at 00:49:31 Pi Step 4Pi Step 5Pi Step 7

Altitude: Extension layer — MCP is what lets your AI agent reach beyond its built-in tools into ANY external system with minimal custom coding; this is the "plug it in and it just works" layer

😩 The confusion

You've built a LangGraph agent but now need it to query TechCorp's customer database, check inventory, look up support tickets — each one requires custom integration code; writing a new integration for every external tool is unsustainable

💡 The relief

MCP (Model Context Protocol) is a standard that lets you describe a tool in one place (a "server") and have any MCP-compatible AI agent (a "client") automatically discover and use it — like how USB allows any device to plug into any computer without writing custom drivers

📚 Every technical term, explained (11 terms)

MCP (Model Context Protocol): an open standard published by Anthropic (the makers of Claude) in November 2024; defines a common language for AI agents to discover and call external tools, databases, and APIs; think of it as "USB for AI tools"
MCP server: a small program you run that exposes one or more tools in the MCP format; it describes what each tool does, what inputs it needs, and what output to expect; once running, any MCP-compatible AI agent can automatically use its tools
MCP client: the AI agent side of an MCP connection; the agent connects to one or more MCP servers, discovers what tools are available, and can call them as needed
Tool decorator: in code, `@mcp.tool()` is placed above a Python function to label it as an MCP tool; this is how you tell the MCP server "this function is available as a tool for AI agents to call"
Self-describing interface: when the MCP server not only exposes the tool but also tells the AI agent what the tool does, what parameters it accepts, and what types of values it returns; the AI agent doesn't need a human to explain the tool to it
STDIO transport: the mechanism used to send data between MCP server and client in the lab; STDIO stands for "standard input/output" — the simplest way for two programs on the same machine to communicate
FastMCP: a Python library that makes building MCP servers very easy; you just write regular Python functions and decorate them with @mcp.tool(); the library handles all the protocol details
Traditional API: a web endpoint that requires the developer to write specific code to call it, understand its exact parameters, and handle its specific response format; switching to a different API means rewriting integration code
Community MCP servers: MCP servers written by other developers and shared publicly; there are already community-built MCP servers for GitHub, Slack, databases, web search, and more — you can plug these into your agent without writing any server code yourself
Self-determining: the AI agent's ability to read the MCP tool's description and figure out how and when to call it, without the developer explicitly programming "call tool X when Y happens"
Human-in-the-loop: a design pattern where the AI agent pauses at certain points and waits for a human to approve or provide input before continuing; MCP can support this in advanced workflows

🌉 Concrete analogy

MCP is exactly like USB. Before USB, every device (mouse, keyboard, printer, camera) had its own unique connector and required custom drivers — it was chaos. USB created one standard port and protocol: plug in any USB device and your computer can figure out what it is and how to use it. MCP does the same for AI tools. Before MCP, every tool integration required custom code. With MCP, any tool that follows the standard can be plugged into any MCP-compatible AI agent. The MCP server is the USB device. The MCP client (your agent) is the USB port. The tool descriptions are the USB protocol that lets them understand each other automatically.

Where the metaphor breaks: The USB analogy breaks for performance-critical integrations — MCP adds a layer of abstraction that can introduce latency compared to direct API calls. Also, the self-describing nature works well for simple tools but for very complex APIs (like a full ERP system with hundreds of endpoints) the tool descriptions can become confusing for the AI agent.

⚙️ What happens, step by step

You write an MCP server using FastMCP: create a server object (`mcp = FastMCP("customer-db")`), then write tool functions decorated with `@mcp.tool()`
Each tool function has a clear description (docstring), typed input parameters, and a return type — this is how the MCP client (AI agent) learns what the tool does
You run the MCP server (it stays running in the background as a separate process)
Your LangGraph agent is configured as an MCP client that connects to the running server
When the agent receives a query, it fetches the list of available tools from the MCP server
The agent decides autonomously which tool to call based on the query (e.g., order status query → call get_order_status tool)
The MCP server executes the function and returns the result to the agent, which uses it to generate the final response

🧪 Try it / see it

MCP = USB for AI tools. Plug any server into any agent — no custom wiring.

📦 MCP Server

@mcp.tool() get_order(id)

@mcp.tool() check_inventory(sku)

@mcp.tool() create_ticket(...)

⟵ standard MCP protocol ⟶

🤖 Your Agent (MCP Client)

Auto-discovers tools.
Auto-reads their descriptions.
Auto-decides when to call them.

Without MCP: write a custom integration for each tool. With MCP: list of tools + descriptions, and the agent figures it out.

“MCP functions like an API, but with crucial differences that make it perfect for AI agents.” — speaker, at 00:49:31

📸 More screenshots from this section (8 frames — click any to enlarge)

Scene 17

Putting It All Together — The Complete AI Agent System

▶ Watch this section ⏱ starts at 00:55:19 Pi Step 8Pi Step 1

Altitude: Summit — you now see how all the pieces form a complete, real-world AI system; this is the "full picture" after building each individual component

😩 The confusion

After learning 6–7 different technologies, you may feel like you have a bag of parts but not a clear picture of the assembled machine — how do all of these actually work together in TechCorp's final product?

💡 The relief

The final system combines every concept into a coherent flow: LLM brain + context window + embeddings + vector DB + RAG + LangChain + LangGraph + MCP + prompt engineering = an AI agent that can search 500 GB of documents in under 30 seconds, with 24/7 availability, context-aware responses, and memory of the conversation

📚 Every technical term, explained (10 terms)

System integration: connecting multiple separate components (LLM, vector DB, memory, tools) into a single working application
24/7 availability: the AI agent runs continuously as long as the application is running; no sick days, no time zones, no office hours
End-to-end pipeline: the complete sequence from raw user input all the way to final answer, with all intermediate steps handled automatically
Latency reduction: how much faster the new system is compared to the old approach; the video cites a reduction from 30 minutes (manual search) to under 30 seconds (AI-powered search)
Accuracy improvement: higher quality and more relevant answers thanks to context-aware semantic search (RAG) vs. keyword search or guessing
Chat history / conversation memory: the persistent record of the conversation that allows the agent to maintain context across multiple messages in the same session; built with LangChain's MemorySaver
Predictive analytics: using historical patterns in data to forecast future events; the next frontier beyond the system built in this video
Proactive compliance agent: an AI agent that monitors company behavior continuously and flags potential compliance issues before they become problems, rather than answering compliance questions reactively
Workflow automation: automating multi-step business processes (like document review, compliance checking, customer onboarding) using AI agent workflows (LangGraph)
Living intelligent system: the speaker's phrase for a system that actively reasons about data and takes proactive action, as opposed to a static document repository that just stores information

🌉 Concrete analogy

The complete TechCorp AI agent is like upgrading from a paper filing room with one part-time librarian to a fully automated digital research center open 24/7. Before: employees spent 30 minutes manually searching filing cabinets, often couldn't find what they needed, and the librarian went home at 5pm. After: any employee types a question in natural language, the system searches 500 GB of documents in under 30 seconds using meaning-based search, returns a grounded answer with source citations, and remembers the conversation for follow-up questions — at any hour, any day. Each technology you learned is one part of this research center: LLM = the analyst brain, vector DB = the semantic filing system, RAG = the "look it up before answering" protocol, LangChain = the software framework that connects everything, LangGraph = the complex workflow manager, MCP = the universal connector to external systems, prompt engineering = the instructions given to the analyst about how to respond.

Where the metaphor breaks: This "everything works together perfectly" summary glosses over real-world challenges: maintaining and updating the vector database as company documents change (document ingestion pipelines), managing API costs at scale, handling edge cases where the agent gives wrong answers confidently, security and access control (not all employees should see all documents), and latency at high traffic volumes.

⚙️ What happens, step by step

An employee types a question in the chat interface (e.g., "What's the remote work policy for international employees?")
The LangChain agent receives the message and adds it to conversation memory (context window)
The agent's LangGraph workflow kicks in: it determines this is a document search query
RAG pipeline: the question is embedded and compared against TechCorp's 500 GB vector database; top 3 relevant chunks are retrieved
The chunks are injected into the LLM's prompt along with the question and conversation history
The LLM (Claude or GPT-4) generates a grounded, cited answer based on the retrieved documents
If the question requires accessing live data (e.g., customer order status), the agent calls the appropriate MCP server tool
The answer is returned to the employee in seconds, with source documents cited

🧪 Try it / see it

The full TechCorp agent — all 16 concepts working together.

👤 Employee types question

↓

🧠 LangChain receives → adds to conversation MEMORY

↓

📊 LangGraph router decides: search? calculate? tool call?

↓

🔎 RAG: embed query → vector DB search → top chunks

↓

📝 Inject chunks into LLM prompt (with system guardrails)

↓

🔌 If live data needed: call MCP tools (order DB, calendar, ...)

↓

✅ LLM generates grounded answer + source citations

↓

⚡ Returned in <30s vs 30 min manual search

“The shift from static documents to living intelligent systems marks a turning point not just for Tech Corp, but for how every other business can unlock the full value of its knowledge.” — speaker, at 00:55:19

📸 More screenshots from this section (4 frames — click any to enlarge)

🧭 How to use this page (read first if you're new)

😩 The confusion

💡 The relief

🌉 Concrete analogy

⚙️ What happens, step by step

🧪 Try it / see it

😩 The confusion

💡 The relief

🌉 Concrete analogy

⚙️ What happens, step by step

🧪 Try it / see it

😩 The confusion

💡 The relief

🌉 Concrete analogy

⚙️ What happens, step by step

🧪 Try it / see it

😩 The confusion

💡 The relief

🌉 Concrete analogy

⚙️ What happens, step by step

🧪 Try it / see it

😩 The confusion

💡 The relief

🌉 Concrete analogy

⚙️ What happens, step by step

🧪 Try it / see it

Plain LLM

Agent

😩 The confusion

💡 The relief

🌉 Concrete analogy

⚙️ What happens, step by step

🧪 Try it / see it

😩 The confusion

💡 The relief

🌉 Concrete analogy

⚙️ What happens, step by step

🧪 Try it / see it

😩 The confusion

💡 The relief

🌉 Concrete analogy

⚙️ What happens, step by step

🧪 Try it / see it

😩 The confusion

💡 The relief

🌉 Concrete analogy

⚙️ What happens, step by step

🧪 Try it / see it

😩 The confusion

💡 The relief

🌉 Concrete analogy

⚙️ What happens, step by step

🧪 Try it / see it

SQL Keyword Search

Vector Semantic Search

😩 The confusion

💡 The relief

🌉 Concrete analogy

⚙️ What happens, step by step

🧪 Try it / see it

😩 The confusion

💡 The relief

🌉 Concrete analogy

⚙️ What happens, step by step

🧪 Try it / see it

RETRIEVE

AUGMENT

GENERATE

😩 The confusion

💡 The relief

🌉 Concrete analogy

⚙️ What happens, step by step

🧪 Try it / see it

😩 The confusion

💡 The relief

🌉 Concrete analogy

⚙️ What happens, step by step

🧪 Try it / see it

😩 The confusion

💡 The relief