JSON Complete Guide — RFC 8259, Data Types & AI Usage

What is JSON?

JSON stands for JavaScript Object Notation. It is a lightweight, text-based, language-independent data interchange format — designed to be easy for both humans to read and write, and for machines to parse and generate.

Despite having "JavaScript" in its name, JSON is completely language-agnostic. It is natively supported in Python, Go, Rust, Java, Ruby, PHP, Swift, Kotlin, and dozens of other languages. It has become the dominant format for data exchange on the web, replacing XML in most modern applications.

RFC 8259

Current Standard

ECMA-404

Grammar Spec

Data Types

2002

Year Created

💡 Core Design Goals

Per the ECMA-404 specification, JSON was designed to be minimal, portable, textual, and a safe subset of JavaScript. It shares a small subset of ECMAScript syntax with all other programming languages, making it the universal language of structured data.

History & Official Standards

JSON was created by Douglas Crockford at State Software in 2001. It was derived from the object literal syntax of JavaScript (ECMAScript), but was designed to be usable from any language. The first public JSON website, json.org, went online in 2002.

2001

JSON Invented

Douglas Crockford and colleagues at State Software coin the term "JSON" and begin using it as a lightweight alternative to XML for browser-to-server communication.

2002

json.org Published

Crockford publishes json.org with the first specification and syntax diagrams. The format spreads quickly among web developers tired of verbose XML.

2006

RFC 4627 — First IETF Standard

JSON receives its first formal internet standard as RFC 4627, authored by Crockford. This legitimized JSON in enterprise and government contexts.

2013

ECMA-404 Published

Ecma International publishes ECMA-404, providing a clean, grammar-only specification of JSON syntax — free of the opinionated guidance in the IETF RFC.

2017

RFC 8259 — Current Standard

The IETF publishes RFC 8259, making UTF-8 mandatory for JSON transmitted over a network and resolving ambiguities in earlier versions. This is the definitive standard today.

2017

ISO/IEC 21778:2017

JSON is also standardized as an international ISO standard, completing its journey from a blog post to a global, multi-body-ratified internet standard.

JSON Format Specification Sources: RFC 8259 · ECMA-404 · ISO/IEC 21778

Current IETF Standard

RFC 8259 (Dec 2017)

Mandatory UTF-8 for networked JSON

ECMA Standard

ECMA-404 (2nd ed. 2017)

Grammar-only, no semantic restrictions

ISO Standard

ISO/IEC 21778:2017

International ratification

MIME Type

application/json

Registered Internet Media Type

File Extension

.json

Universally recognized

Encoding

UTF-8 (required for network)

UTF-16 and UTF-32 also specified

The 6 JSON Data Types

JSON supports exactly six primitive value types. This small set — and nothing more — is what makes JSON both powerful and interoperable. No functions, no dates, no binary data, no comments. Just these six types.

String

"Hello, world"

"jsonl.ai"

"" (empty ok)

Unicode text in double quotes. Supports escape sequences like \n \t \\ \"

Number

-7.5

1.6e10

Integer or float. No distinction. No NaN, no Infinity. Stored as IEEE 754 double in most implementations.

Boolean

true

false

Must be lowercase. True, TRUE, "true" are all invalid JSON.

Null

null

Represents the intentional absence of any value. Must be lowercase. NULL, Null are invalid.

Object

{

"key": "value"

}

Unordered set of key-value pairs. Keys must be strings. Values can be any JSON type, including nested objects.

Array

[ 1, "a", true ]

[] (empty ok)

Ordered sequence of values. Values can be mixed types. Arrays can be nested inside objects and vice versa.

⚠ What JSON Does NOT Support

JSON has no date type (use ISO 8601 strings like "2025-03-21T10:00:00Z"), no comments (despite popular demand), no undefined, no binary, no NaN or Infinity, and no trailing commas. These are the most common gotchas for developers coming from JavaScript.

Syntax Rules & Anatomy

JSON's grammar is intentionally minimal. Whitespace (spaces, tabs, newlines) between tokens is ignored and used purely for readability. The entire format is defined by just a few structural characters: { } [ ] : ,

anatomy.json
JSON
{                                       ← object opens
  "model": "gpt-4o",                  ← string value
  "context_window": 128000,            ← number value
  "multimodal": true,                  ← boolean value
  "fine_tune_id": null,               ← null value
  "capabilities": [                   ← array value
    "text",
    "vision",
    "code"                             ← no trailing comma on last item
  ],
  "pricing": {                         ← nested object
    "input_per_1m": 5.00,
    "output_per_1m": 15.00
  }                                     ← no comma after last member
}                                       ← object closes

Objects

A JSON object is an unordered collection of name/value pairs wrapped in curly braces {}. Each name must be a string (in double quotes), followed by a colon :, then the value. Pairs are separated by commas. The order of members is not significant per the spec — parsers may return them in any order.

📌 Key Rule: No Duplicate Keys

RFC 8259 recommends against duplicate keys within a single object. The behavior of implementations that encounter duplicate names is "unpredictable" per the spec. For interoperability, always use unique keys in your JSON objects.

Arrays

A JSON array is an ordered sequence of values wrapped in square brackets []. Values are separated by commas. Arrays are zero-indexed and can contain values of mixed types — including other objects and arrays (enabling arbitrarily deep nesting).

arrays-example.json
JSON
// Homogeneous array (all strings)
["user", "system", "assistant"]

// Heterogeneous array (mixed types — valid!)
[1, "hello", true, null, {"key": "val"}]

// Array of objects (the most common pattern in APIs)
[
  {"role": "user", "content": "Hello"},
  {"role": "assistant", "content": "Hi! How can I help?"}
]

Strings & Escape Sequences

JSON strings must use double quotes (single quotes are invalid). Any Unicode character can appear in a string. Special characters must be escaped with a backslash \.

Escape Sequence	Character	Notes
`\"`	Double quote	Required — unescaped `"` ends the string
`\\`	Backslash	Required — single `\` starts an escape
`\/`	Forward slash	Optional — useful in HTML contexts
`\n`	Newline (LF)	Most common whitespace escape
`\r`	Carriage return	Used with `\n` for CRLF
`\t`	Tab	Horizontal tab character
`\b`	Backspace	Rarely used
`\f`	Form feed	Rarely used
`\uXXXX`	Unicode code point	e.g. `\u00e9` = é

Numbers

JSON makes no distinction between integers and floating-point numbers — there is only "number." Numbers may be positive or negative, integer or decimal, with optional scientific notation. JSON does not allow NaN, Infinity, or -Infinity. Leading zeros (like 007) are prohibited except for 0 itself.

numbers.json
JSON
{
  "integer":     42,       ✓ valid
  "negative":    -17,      ✓ valid
  "float":       3.14159,  ✓ valid
  "scientific":  1.6e-19,  ✓ valid (Planck's constant-ish)
  "zero":        0,         ✓ valid
  // "leading_zero": 007      ✗ INVALID — leading zeros banned
  // "nan": NaN               ✗ INVALID — NaN not in JSON spec
  // "inf": Infinity          ✗ INVALID — Infinity not in JSON spec
}

JSON in AI & LLMs

JSON is the invisible infrastructure of every AI system. It is the format for API requests and responses, model configurations, evaluation datasets, agent tool calls, and structured output modes. Understanding JSON deeply means understanding how AI systems are built and communicate.

The OpenAI Chat Completions API

Every call to the OpenAI, Anthropic, or Google AI API sends and receives JSON. Here is a complete real-world API request and response:

api-request.json
JSON
{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is JSON?"
    }
  ],
  "temperature": 0.7,
  "max_tokens": 256
}

api-response.json
JSON
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "JSON is a..."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 96
  }
}

JSON Mode & Structured Output

Modern LLMs support a JSON mode — a setting that forces the model to output valid JSON every time. This is critical for building AI agents that need to interface with other systems, databases, or APIs.

json-mode-request.json

JSON

// Force structured JSON output from GPT-4o
{
  "model": "gpt-4o",
  "response_format": { "type": "json_object" },
  "messages": [{
    "role": "user",
    "content": "Extract: name, age, email from: 'Hi I am Alice, 32, [email protected]'"
  }]
}

// Guaranteed response — valid JSON every time:
// { "name": "Alice", "age": 32, "email": "[email protected]" }

HuggingFace config.json

Every model on HuggingFace Hub stores its architecture and configuration in a config.json file — a JSON document that defines the model's architecture, vocabulary size, layer counts, and attention parameters.

config.json (Llama-style)
JSON
{
  "architectures": ["LlamaForCausalLM"],
  "model_type": "llama",
  "hidden_size": 4096,
  "intermediate_size": 11008,
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "vocab_size": 32000,
  "max_position_embeddings": 4096,
  "torch_dtype": "float16"
}

Python: Parse, Build & Validate

Python's built-in json module handles all standard JSON operations. For large-scale or high-performance scenarios, the third-party orjson library offers 5–10× faster parsing.

Parsing JSON (Reading)

parse_json.py
Python
import json

# ── From a string ─────────────────────────────────────────────
json_str = '{"model": "gpt-4o", "temperature": 0.7, "active": true}'
data = json.loads(json_str)       # loads = load from String
print(data["model"])               # → "gpt-4o"
print(type(data["temperature"]))   # → <class 'float'>
print(type(data["active"]))         # → <class 'bool'>

# ── From a file ───────────────────────────────────────────────
with open("config.json", "r", encoding="utf-8") as f:
    config = json.load(f)           # load = load from File

# ── Type mapping: JSON → Python ───────────────────────────────
# JSON object    → dict
# JSON array     → list
# JSON string    → str
# JSON number    → int or float (Python decides)
# JSON true      → True
# JSON false     → False
# JSON null      → None

Serializing JSON (Writing)

serialize_json.py
Python
import json

data = {
    "model": "gpt-4o",
    "temperature": 0.7,
    "messages": [{"role": "user", "content": "Hello"}],
    "active": True,
    "notes": None
}

# ── To a string ───────────────────────────────────────────────
compact = json.dumps(data)
# → '{"model": "gpt-4o", "temperature": 0.7, ...}'

pretty = json.dumps(data, indent=2, ensure_ascii=False)
# → nicely indented, Unicode preserved

sorted_keys = json.dumps(data, sort_keys=True)
# → keys in alphabetical order (good for diffs)

# ── To a file ─────────────────────────────────────────────────
with open("output.json", "w", encoding="utf-8") as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

# ── Python → JSON type mapping ────────────────────────────────
# dict       → JSON object
# list/tuple → JSON array
# str        → JSON string
# int/float  → JSON number
# True       → true
# False      → false
# None       → null

JSON Schema Validation with Pydantic

validate_with_pydantic.py
Python
from pydantic import BaseModel, ValidationError
from typing import Optional
import json

# Define expected schema as a Pydantic model
class LLMConfig(BaseModel):
    model: str
    temperature: float = 0.7
    max_tokens: int
    stream: bool = False
    system_prompt: Optional[str] = None

# Validate JSON from an API or file
raw_json = '{"model": "gpt-4o", "max_tokens": 1024}'

try:
    config = LLMConfig(**json.loads(raw_json))
    print(config.model)            # → "gpt-4o"
    print(config.temperature)       # → 0.7 (default)
except ValidationError as e:
    print(f"Schema validation failed: {e}")

# This pattern is the foundation of structured LLM output —
# force the model to output JSON, then validate with Pydantic.

JSON vs XML vs JSONL

Property	JSON	XML	JSONL
Human readable	✓ Very readable	⚠ Verbose	✓ Line-by-line
Verbosity	Minimal	High (opening + closing tags)	Minimal
Comments	✗ Not supported	✓ Supported	✗ Not supported
Streaming	⚠ Needs full parse	✓ SAX parser	✓ Line-by-line
Schema standard	JSON Schema (draft)	XSD (W3C standard)	Per-line JSON Schema
Namespaces	✗ Not supported	✓ Full namespace support	✗ Not supported
AI / LLM usage	✓ APIs, configs, output	✗ Legacy, rarely used	✓ Training datasets
File size	Small	Large (2–3× JSON)	Small

Valid vs Invalid JSON Examples

✓ Valid JSON

{
  "name": "Alice",
  "age": 30,
  "active": true,
  "score": null
}

Well-formed object with 4 different value types

✗ Invalid — Single Quotes

{
  'name': 'Alice',
  'age': 30
}

JSON requires double quotes for all strings and keys

✓ Valid — Nested Structures

{
  "user": {
    "id": 1,
    "tags": ["admin", "user"]
  }
}

Objects and arrays can be nested to any depth

✗ Invalid — Trailing Comma

{
  "name": "Alice",
  "age": 30,
}

The trailing comma after the last member is invalid in JSON

✓ Valid — Primitive Root

"just a string"

42

true

null

Per RFC 8259, any JSON value is a valid JSON text — not just objects

✗ Invalid — Comments

{
  // This is a comment
  "name": "Alice",
  /* block comment */
  "age": 30
}

JSON has no comment syntax. Use JSON5 or JSONC if you need comments.

Common Mistakes

Single quotes instead of double quotes — JSON requires double quotes for all strings and all object keys. Single quotes are JavaScript syntax, not JSON syntax.
Trailing commas — A comma after the last element of an object or array is valid JavaScript but invalid JSON. This trips up many developers copying JS code into a JSON file.
Comments — JSON has no comment syntax. Using // comment or /* comment */ makes a file invalid JSON. Use a README or external documentation instead.
Unquoted keys — JavaScript allows { name: "Alice" } but JSON requires { "name": "Alice" }. All keys must be quoted strings.
Using NaN or Infinity — These JavaScript number values have no representation in JSON. Use null as a sentinel value or handle them before serialization.
Dates as raw Date objects — JSON has no date type. Always serialize dates as ISO 8601 strings: "2025-03-21T10:00:00Z".
Forgetting ensure_ascii=False — Python's json.dumps() escapes non-ASCII characters by default. Add ensure_ascii=False to preserve Unicode characters as-is.
Assuming key order is preserved — The JSON spec says object member ordering is not significant. Most modern parsers do preserve insertion order, but you should never rely on it.
Using JSON for binary data — JSON is a text format. Binary data (images, audio) must be base64-encoded before embedding in JSON, which increases size by ~33%. Consider a separate binary channel instead.

Frequently Asked Questions

Does JSON support comments?

No. Comments were deliberately excluded from JSON by Douglas Crockford. He later explained that allowing comments would enable people to use JSON as a config file format and add parsing directives — which would break interoperability. If you need comments in config files, use JSONC (JSON with Comments, used by VS Code) or JSON5. For everything else, keep documentation in a separate file.

Is JSON the same as a JavaScript object literal?

No — JSON is a strict subset of JavaScript object syntax. Key differences: JSON requires double quotes on all keys (JS allows unquoted), JSON forbids trailing commas (JS allows them), JSON forbids comments (JS allows them), and JSON forbids undefined as a value. You can always safely embed valid JSON in a JavaScript file, but not vice versa.

What's the difference between RFC 8259 and ECMA-404?

Both are authoritative standards for JSON, and they describe the same grammar. The key difference is scope: ECMA-404 is a pure grammar specification — it defines only what is syntactically valid JSON. RFC 8259 adds interoperability guidance on top: it mandates UTF-8 for networked JSON, recommends against duplicate keys, and addresses security concerns. For building real systems, follow RFC 8259.

How do I handle large JSON files in Python without running out of memory?

For large JSON files, use a streaming parser like ijson — it lets you parse incrementally without loading the whole file into memory. Alternatively, consider whether your data should be in JSONL format instead, which is natively streamable line-by-line.

Why is JSON preferred over XML in modern APIs?

JSON is lighter (no opening/closing tags), maps directly to data structures in most programming languages, is faster to parse, and is more readable at a glance. XML remains useful for documents with mixed content (text and tags), namespace requirements, or rich schema validation — but for data-only APIs, JSON won every practical comparison.

Continue Learning

Up next

JSONL: The Complete Guide to the AI Training Format

Read Guide →

Up next

Markdown: Why Every LLM Outputs It by Default

Read Guide →

Tool

JSON Formatter & Validator — Format and validate JSON instantly

Try Free Tool →

Deep Dive

How JSON, JSONL, and Markdown power the full AI pipeline

Read Article →