F1: GenAI Foundations
This module covers the core mechanics of Generative AI. Understanding these fundamentals is essential before diving into RAG, agents, or any FrootAI solution play.
What Is a Large Language Model?โ
An LLM is a statistical next-token predictor. Given a sequence of tokens, it calculates probability distributions over the entire vocabulary and samples the next token. Repeat this thousands of times and you get coherent text, code, or structured output.
Input: "The capital of France is"
Output: " Paris" (probability: 0.97)
:::info Key Insight LLMs don't "understand" โ they learn statistical patterns from trillions of tokens of training data. This is why grounding (connecting to real data) and guardrails (constraining outputs) are critical. See the AI Glossary for formal definitions. :::
Tokens โ The Currency of AIโ
Tokens are sub-word units produced by Byte-Pair Encoding (BPE). They are how models read, think, and charge.
| Text | Token Count | Ratio |
|---|---|---|
"Hello, world!" | 4 tokens | 1 token โ 1.3 words |
"Antidisestablishmentarianism" | 6 tokens | 1 token โ 0.17 words |
{"name": "Alice"} | 7 tokens | JSON is token-expensive |
| Average English prose | ~100 tokens | 1 token โ 0.75 words |
Cost formula: cost = (input_tokens ร input_price) + (output_tokens ร output_price)
:::tip Token Budget
Always set max_tokens in production. An unbounded response can burn through your budget on a single runaway generation. FrootAI solution plays configure this in config/openai.json.
:::
Key Generation Parametersโ
| Parameter | Range | Default | Effect |
|---|---|---|---|
temperature | 0โ2 | 1.0 | Controls randomness. 0 = deterministic, 1 = balanced, 2 = creative chaos |
top_p | 0โ1 | 1.0 | Nucleus sampling โ considers tokens within cumulative probability p |
max_tokens | 1โ128K | Model limit | Hard cap on output length |
seed | int | None | Enables reproducible outputs (same seed + temp 0 = same result) |
frequency_penalty | -2โ2 | 0 | Reduces repetition of already-used tokens |
Never set both temperature and top_p to non-default values simultaneously โ they interact unpredictably. Pick one.
Context Windows โ Model Memoryโ
The context window is the maximum number of tokens a model can process in a single request (input + output combined).
| Model | Context Window | ~Pages of Text |
|---|---|---|
| GPT-4o | 128K | ~200 pages |
| GPT-4o-mini | 128K | ~200 pages |
| GPT-4.1 | 1M | ~1,500 pages |
| Claude Sonnet 4 | 200K | ~300 pages |
| Llama 3.1 405B | 128K | ~200 pages |
| Gemini 2.5 Pro | 1M | ~1,500 pages |
Exceeding the context window causes truncation โ the model silently drops older tokens. RAG (see F2) solves this by retrieving only relevant chunks.
Model Parameters & VRAMโ
When someone says "a 7B model," they mean 7 billion trainable weights. More parameters generally means better reasoning but higher infrastructure cost.
VRAM formula: VRAM โ params ร bytes_per_param ร 1.2 (overhead)
| Model Size | FP32 | FP16 | INT8 | INT4 |
|---|---|---|---|---|
| 7B | 28 GB | 14 GB | 7 GB | 3.5 GB |
| 13B | 52 GB | 26 GB | 13 GB | 6.5 GB |
| 70B | 280 GB | 140 GB | 70 GB | 35 GB |
| 405B | 1.6 TB | 810 GB | 405 GB | 203 GB |
Quantization โ Shrinking Modelsโ
Quantization reduces precision of model weights to lower VRAM and increase throughput:
- FP32 โ Full precision, baseline quality, 4 bytes/param
- FP16/BF16 โ Half precision, negligible quality loss, 2 bytes/param (production standard)
- INT8 โ 8-bit integers, ~1% quality loss, 1 byte/param
- INT4 (GPTQ/AWQ) โ Aggressive compression, noticeable quality loss on complex reasoning
For self-hosted models, start with INT8 quantization โ it offers the best quality-to-cost ratio. Only go to INT4 if VRAM is severely constrained. See FrootAI Play 12 for AKS model serving patterns.
Embeddings โ Semantic Vectorsโ
Embeddings convert text into dense vectors (e.g., 1536 or 3072 dimensions) where semantic similarity = vector proximity.
embed("king") - embed("man") + embed("woman") โ embed("queen")
Used for: semantic search, RAG retrieval, clustering, anomaly detection, recommendation. See cosine similarity in the glossary.
Training vs Inferenceโ
| Aspect | Training | Inference |
|---|---|---|
| Goal | Learn weights from data | Generate outputs from learned weights |
| Compute | Thousands of GPUs, weeks/months | Single GPU or API call, milliseconds |
| Cost | $2Mโ$100M+ per frontier model | $0.15โ$60 per 1M tokens |
| Who does it | OpenAI, Meta, Google, Anthropic | You, via API or self-hosted |
99% of FrootAI solution plays use inference only โ calling pre-trained models via API. Plays 13 (Fine-Tuning) and 12 (Model Serving) cover the exceptions.
Practical Example โ Azure OpenAI Chat Completionโ
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint="https://YOUR-RESOURCE.openai.azure.com",
api_version="2024-12-01-preview",
azure_deployment="gpt-4o",
# Uses DefaultAzureCredential via AZURE_CLIENT_ID โ never hardcode keys
)
response = client.chat.completions.create(
model="gpt-4o",
temperature=0.7,
max_tokens=500,
seed=42,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain transformers in 3 sentences."},
],
)
print(response.choices[0].message.content)
# Usage: response.usage.prompt_tokens, response.usage.completion_tokens
Key Takeawaysโ
- Tokens are the universal unit โ understand them for cost, latency, and context management
- Temperature 0 + seed gives deterministic outputs for reproducible pipelines
- Context window โ quality โ more context doesn't mean better answers (noise hurts)
- Quantization makes self-hosting viable โ INT8 is the sweet spot
- Always set
max_tokensโ unbounded generation is a cost and safety risk
Next: F2: LLM Landscape โ