The AI Data Formats Reference

The Three Languages
AI Speaks

JSON feeds data. JSONL trains models. Markdown communicates results. Master all three and you'll understand the complete lifecycle of modern AI.

Layer 1 · Data
{ JSON }
The universal data structure
Layer 2 · Training
{ JSONL }
The fine-tuning format
Layer 3 · Communication
# Markdown
The LLM output language
RFC 8259
JSON Standard
3 Rules
JSONL Spec
2004
Markdown Born
↓ 70%
Token Savings w/ MD

How all three formats work together

Every large language model uses all three formats at different stages. Here's the complete picture.

🗃️
JSON
Raw Data
──▶
📄
JSONL
Training Set
──▶
🤖
GPU
Fine-tuning
──▶
🧠
LLM
Model
──▶
📝
Markdown
Output
{ }
JSON — The Data Layer
Every API response, config file, and dataset starts as JSON. It's the universal language for structured data across all programming languages — standardized as RFC 8259 and ECMA-404.
JSONL — The Training Layer
When you fine-tune GPT, Mistral, or Llama, you provide a .jsonl file — one JSON object per line. It's how OpenAI, Google, and Anthropic format their training data.
#
Markdown — The Output Layer
ChatGPT, Claude, Gemini — all default to Markdown output. It's 50–70% token-efficient vs HTML, and both humans and machines can read it without parsing.

Syntax at a glance

JSON — Data Types
"hello"String
42Number
true / falseBoolean
nullNull
{ "k": v }Object
[ 1, 2, 3 ]Array
JSONL — Format Rules
UTF-8Encoding required
1 obj / lineOne JSON value per line
\nLine terminator
.jsonlFile extension
no [ ]No array wrapper
streamableLine-by-line processing
Markdown — Key Syntax
# H1 ## H2Headings
**bold**Bold text
*italic*Italic text
`code`Inline code
- itemList item
[text](url)Link

JavaScript Object
Notation

A lightweight, text-based, language-independent data interchange format. Standardized as RFC 8259 (IETF) and ECMA-404. Created by Douglas Crockford in 2002.

Complete JSON Example

model-config.json
{
  "model": "gpt-4o",
  "version": 2024,
  "active": true,
  "temperature": 0.7,
  "system_prompt": null,
  "capabilities": [
    "text", "vision", "code"
  ],
  "limits": {
    "max_tokens": 128000,
    "rate_limit": 10000
  }
}

The 6 Data Types

String
"hello AI"
Number
42 / 3.14
Boolean
true / false
Null
null
Object
{ "k": v }
Array
[ 1, 2, 3 ]

Standards

IETF RFC 8259Current Internet Standard
ECMA-404Grammar Specification
ISO/IEC 21778International Standard (2017)

JSON in AI & Machine Learning

API Responses
Every OpenAI, Anthropic, and Google AI API returns JSON. The entire ChatGPT and Claude API is JSON in/out.
Model Config
Model architecture, hyperparameters, training configuration — all stored as JSON files (config.json in HuggingFace).
Structured Output
JSON mode in GPT-4o, Claude, and Gemini forces the model to return valid JSON — critical for building agents and pipelines.

JSON Lines
— The AI Training Format

One JSON object per line. No wrappers. Pure streaming. This is the format that powers fine-tuning for GPT, Llama, Mistral, and virtually every major LLM.

The 3 Official Rules (jsonlines.org)

  • 01 UTF-8 Encoding — All JSONL files must be UTF-8 encoded. No byte order mark (BOM) allowed, per RFC 8259.
  • 02 Each Line is a Valid JSON Value — Most commonly objects or arrays. A blank line is not valid. null is valid; an empty line is not.
  • 03 Line Terminator is \n\r\n is also supported. A final newline is recommended but not required.

OpenAI Fine-Tuning Format

training-data.jsonl
// Line 1 — one complete example
{"messages":[{"role":"system","content":"You are a helpful AI."},{"role":"user","content":"What is JSON?"},{"role":"assistant","content":"JSON is a lightweight data format."}]}

// Line 2 — next independent example
{"messages":[{"role":"user","content":"Explain JSONL"},{"role":"assistant","content":"JSONL is one JSON per line."}]}

// Line 3 — continues forever, one per line
{"messages":[{"role":"user","content":"What is Markdown?"},{"role":"assistant","content":"A plain text formatting language."}]}

Python: Read & Write JSONL

jsonl_utils.py
import json

# ── Read JSONL ──────────────────
with open('data.jsonl', 'r') as f:
    for line in f:
        obj = json.loads(line.strip())
        print(obj)  # process each

# ── Write JSONL ──────────────────
records = [
    {"role": "user", "text": "Hello"},
    {"role": "assistant", "text": "Hi!"}
]
with open('out.jsonl', 'w') as f:
    for r in records:
        f.write(json.dumps(r) + '\n')

Who uses JSONL?

OpenAI
All GPT-3.5, GPT-4 fine-tuning uses JSONL with the messages format (system/user/assistant roles).
Google Vertex AI
Gemini and PaLM fine-tuning datasets are provided as JSONL files uploaded to Google Cloud Storage.
Hugging Face
The datasets library natively reads .jsonl files. Most community datasets are distributed in JSONL format.
Mistral / Llama
Open-source fine-tuning with tools like Axolotl, LLaMA-Factory, and Unsloth all use JSONL as the data format.

Markdown —
The LLM Output Language

Created by John Gruber in 2004. Today it's the default output format for every major AI system — ChatGPT, Claude, Gemini, and Grok all output Markdown by default.

History Timeline

2004
Markdown Created
John Gruber (with Aaron Swartz) releases Markdown and Markdown.pl on Daring Fireball.
2008
Stack Overflow Adopts
Jeff Atwood brings Markdown to millions of developers via Stack Overflow.
2009
GitHub README Standard
GitHub makes Markdown the standard for README files, creating GitHub Flavored Markdown (GFM).
2014
CommonMark Born
Jeff Atwood, John MacFarlane and others publish CommonMark — an unambiguous, testable specification.
2017
GFM Formally Specified
GitHub releases the formal GFM specification based on CommonMark with tables, task lists, and strikethrough.
2020+
The LLM Era
AI language models default to Markdown output. Notion, Obsidian, Linear all adopt Markdown as primary input format.

Syntax Reference

example.md
# Heading 1
## Heading 2
### Heading 3

Plain paragraph text here.

**bold text**  *italic text*
`inline code`

```python
# fenced code block
print("hello world")
```

- Unordered list item
- Another item
  - Nested item

1. Ordered list
2. Second item

[Link text](https://example.com)

> Blockquote text here

| Col 1 | Col 2 |  (GFM only)
|-------|-------|
| data  | data  |
Why LLMs love Markdown
Markdown reduces token consumption by 50–70% compared to HTML. Its hierarchical structure (headings, lists) gives LLMs a natural roadmap for organizing responses — and it's already in their training data from GitHub, Reddit, and Stack Overflow.

Markdown Flavors

Flavor Creator Year Extra Features Used By
Original Markdown John Gruber 2004 Daring Fireball
CommonMark MacFarlane, Atwood 2014 Unambiguous spec, test suite GitHub, Reddit, Discourse
GFM GitHub 2017 Tables, task lists, strikethrough GitHub, GitLab
MDX Community 2018 JSX components in Markdown Next.js, Gatsby docs

How AI Uses
All Three Formats

From data collection to fine-tuning to deployment — JSON, JSONL, and Markdown each play a critical role in the modern LLM lifecycle.

Complete LLM Lifecycle

🗃️
JSON
Raw Data
Collection
⚙️
Process
Clean &
Format
📄
JSONL
Training
Dataset
🔥
Training
Fine-Tune
Model
🧠
Deploy
LLM
API
📝
Markdown
AI
Response

Format Comparison

Property JSON JSONL Markdown
Primary Use API data, configs Training datasets Formatted text
Human Readable ✓ With formatting ✓ Line-by-line ✓✓ Native
Streamable Needs full parse ✓✓ Line-by-line ✓ Paragraph-by-para
Token Efficiency Medium High (no wrappers) Very High
Official Standard RFC 8259, ECMA-404 jsonlines.org CommonMark, GFM
File Extension .json .jsonl .md, .markdown
MIME Type application/json application/jsonl text/markdown
🔌 RAG Pipelines
Retrieval-Augmented Generation uses JSON to store vector metadata, JSONL for document chunks, and Markdown to format the final prompt context sent to the LLM.
🤖 AI Agents
Agents receive tool results as JSON, communicate with humans using Markdown, and store memory/history logs as JSONL for future training.
📊 Evals & Benchmarks
Model evaluation benchmarks (MMLU, HumanEval) are distributed as JSONL. Results are stored in JSON. Reports are written in Markdown.

Format Tools

Validate, convert, and preview JSON, JSONL, and Markdown directly in your browser. No signup required.

JSONL Validator Free
Paste JSONL below. Each line will be validated against the official jsonlines.org spec.
Output appears here...
JSON Formatter Free
Paste minified JSON to pretty-print it, or paste formatted JSON to minify it.
Output appears here...
Markdown Preview Free
Type Markdown on the left, see the rendered preview on the right instantly.
Preview appears here...
JSON → JSONL Converter Free
Convert a JSON array into JSONL format — one object per line, ready for LLM training.
JSONL output here...

Articles & Guides

Deep dives on JSON, JSONL, Markdown, and how they power modern AI systems.

📄
JSONL LLM
OpenAI Fine-Tuning with JSONL: The Complete Guide
From zero to a fine-tuned GPT-4o model. We cover the exact JSONL format OpenAI expects, common mistakes, and how to validate your training data.
8 min read · JSONL · Fine-Tuning
#
Markdown LLM
Why ChatGPT and Claude Output Markdown by Default
Markdown saves 50–70% of tokens versus HTML. We explain why every major LLM defaults to Markdown and what that means for your applications.
5 min read · Markdown · AI
{ }
JSON JSONL
JSON vs JSONL: When to Use Which Format
JSON is for a single structured document. JSONL is for streaming collections. Here's the definitive guide on when each format is the right tool.
6 min read · JSON · JSONL
📋
Markdown
CommonMark vs GitHub Flavored Markdown: Key Differences
GFM adds tables, task lists, and strikethrough on top of CommonMark. Here's exactly what each spec adds, with examples and a compatibility matrix.
7 min read · Markdown · Spec
🤖
JSONL RAG
Building a RAG Pipeline: JSON for Metadata, JSONL for Chunks
How to structure your document ingestion pipeline using JSON for vector metadata and JSONL for document chunks — the format every major vector DB expects.
10 min read · JSONL · RAG
⚙️
JSON LLM
JSON Mode in GPT-4o and Claude: Forcing Structured Output
Structured JSON output is the key to building reliable AI agents. We cover JSON mode, function calling, and how to use Pydantic to validate AI responses.
9 min read · JSON · AI Agents