The AI Data Formats Reference

The Three Languages
AI Speaks

JSON feeds data. JSONL trains models. Markdown communicates results. Master all three and you'll understand the complete lifecycle of modern AI.

Layer 1 · Data

{ JSON }

The universal data structure

Layer 2 · Training

{ JSONL }

The fine-tuning format

Layer 3 · Communication

# Markdown

The LLM output language

RFC 8259

JSON Standard

3 Rules

JSONL Spec

2004

Markdown Born

↓ 70%

Token Savings w/ MD

The AI Pipeline

How all three formats work together

Every large language model uses all three formats at different stages. Here's the complete picture.

🗃️

JSON

Raw Data

──▶

📄

JSONL

Training Set

──▶

🤖

GPU

Fine-tuning

──▶

🧠

LLM

Model

──▶

📝

Markdown

Output

{ }

JSON — The Data Layer

Every API response, config file, and dataset starts as JSON. It's the universal language for structured data across all programming languages — standardized as RFC 8259 and ECMA-404.

⋮

JSONL — The Training Layer

When you fine-tune GPT, Mistral, or Llama, you provide a .jsonl file — one JSON object per line. It's how OpenAI, Google, and Anthropic format their training data.

#

Markdown — The Output Layer

ChatGPT, Claude, Gemini — all default to Markdown output. It's 50–70% token-efficient vs HTML, and both humans and machines can read it without parsing.

Quick Reference

Syntax at a glance

JSON — Data Types

"hello"String

42Number

true / falseBoolean

nullNull

{ "k": v }Object

[ 1, 2, 3 ]Array

JSONL — Format Rules

UTF-8Encoding required

1 obj / lineOne JSON value per line

\nLine terminator

.jsonlFile extension

no [ ]No array wrapper

streamableLine-by-line processing

Markdown — Key Syntax

# H1 ## H2Headings

**bold**Bold text

*italic*Italic text

`code`Inline code

- itemList item

[text](url)Link

JSON Guide

JavaScript Object
Notation

A lightweight, text-based, language-independent data interchange format. Standardized as RFC 8259 (IETF) and ECMA-404. Created by Douglas Crockford in 2002.

Complete JSON Example

model-config.json
{
  "model": "gpt-4o",
  "version": 2024,
  "active": true,
  "temperature": 0.7,
  "system_prompt": null,
  "capabilities": [
    "text", "vision", "code"
  ],
  "limits": {
    "max_tokens": 128000,
    "rate_limit": 10000
  }
}

The 6 Data Types

String

"hello AI"

Number

42 / 3.14

Boolean

true / false

Null

null

Object

{ "k": v }

Array

[ 1, 2, 3 ]

Standards

IETF RFC 8259Current Internet Standard

ECMA-404Grammar Specification

ISO/IEC 21778International Standard (2017)

JSON in AI & Machine Learning

API Responses

Every OpenAI, Anthropic, and Google AI API returns JSON. The entire ChatGPT and Claude API is JSON in/out.

Model Config

Model architecture, hyperparameters, training configuration — all stored as JSON files (config.json in HuggingFace).

Structured Output

JSON mode in GPT-4o, Claude, and Gemini forces the model to return valid JSON — critical for building agents and pipelines.

JSONL Guide

JSON Lines
— The AI Training Format

One JSON object per line. No wrappers. Pure streaming. This is the format that powers fine-tuning for GPT, Llama, Mistral, and virtually every major LLM.

The 3 Official Rules (jsonlines.org)

01 UTF-8 Encoding — All JSONL files must be UTF-8 encoded. No byte order mark (BOM) allowed, per RFC 8259.
02 Each Line is a Valid JSON Value — Most commonly objects or arrays. A blank line is not valid. null is valid; an empty line is not.
03 Line Terminator is \n — \r\n is also supported. A final newline is recommended but not required.

OpenAI Fine-Tuning Format

training-data.jsonl
// Line 1 — one complete example
{"messages":[{"role":"system","content":"You are a helpful AI."},{"role":"user","content":"What is JSON?"},{"role":"assistant","content":"JSON is a lightweight data format."}]}

// Line 2 — next independent example
{"messages":[{"role":"user","content":"Explain JSONL"},{"role":"assistant","content":"JSONL is one JSON per line."}]}

// Line 3 — continues forever, one per line
{"messages":[{"role":"user","content":"What is Markdown?"},{"role":"assistant","content":"A plain text formatting language."}]}

Python: Read & Write JSONL

jsonl_utils.py
import json

# ── Read JSONL ──────────────────
with open('data.jsonl', 'r') as f:
    for line in f:
        obj = json.loads(line.strip())
        print(obj)  # process each

# ── Write JSONL ──────────────────
records = [
    {"role": "user", "text": "Hello"},
    {"role": "assistant", "text": "Hi!"}
]
with open('out.jsonl', 'w') as f:
    for r in records:
        f.write(json.dumps(r) + '\n')

Who uses JSONL?

OpenAI

All GPT-3.5, GPT-4 fine-tuning uses JSONL with the messages format (system/user/assistant roles).

Google Vertex AI

Gemini and PaLM fine-tuning datasets are provided as JSONL files uploaded to Google Cloud Storage.

Hugging Face

The datasets library natively reads .jsonl files. Most community datasets are distributed in JSONL format.

Mistral / Llama

Open-source fine-tuning with tools like Axolotl, LLaMA-Factory, and Unsloth all use JSONL as the data format.

Markdown Guide

Markdown —
The LLM Output Language

Created by John Gruber in 2004. Today it's the default output format for every major AI system — ChatGPT, Claude, Gemini, and Grok all output Markdown by default.

History Timeline

2004

Markdown Created

John Gruber (with Aaron Swartz) releases Markdown and Markdown.pl on Daring Fireball.

2008

Stack Overflow Adopts

Jeff Atwood brings Markdown to millions of developers via Stack Overflow.

2009

GitHub README Standard

GitHub makes Markdown the standard for README files, creating GitHub Flavored Markdown (GFM).

2014

CommonMark Born

Jeff Atwood, John MacFarlane and others publish CommonMark — an unambiguous, testable specification.

2017

GFM Formally Specified

GitHub releases the formal GFM specification based on CommonMark with tables, task lists, and strikethrough.

2020+

The LLM Era

AI language models default to Markdown output. Notion, Obsidian, Linear all adopt Markdown as primary input format.

Syntax Reference

example.md
# Heading 1
## Heading 2
### Heading 3

Plain paragraph text here.

**bold text**  *italic text*
`inline code`

```python
# fenced code block
print("hello world")
```

- Unordered list item
- Another item
  - Nested item

1. Ordered list
2. Second item

[Link text](https://example.com)

> Blockquote text here

| Col 1 | Col 2 |  (GFM only)
|-------|-------|
| data  | data  |

Why LLMs love Markdown

Markdown reduces token consumption by 50–70% compared to HTML. Its hierarchical structure (headings, lists) gives LLMs a natural roadmap for organizing responses — and it's already in their training data from GitHub, Reddit, and Stack Overflow.

Markdown Flavors

Flavor	Creator	Year	Extra Features	Used By
Original Markdown	John Gruber	2004	—	Daring Fireball
CommonMark	MacFarlane, Atwood	2014	Unambiguous spec, test suite	GitHub, Reddit, Discourse
GFM	GitHub	2017	Tables, task lists, strikethrough	GitHub, GitLab
MDX	Community	2018	JSX components in Markdown	Next.js, Gatsby docs

AI & LLMs

How AI Uses
All Three Formats

From data collection to fine-tuning to deployment — JSON, JSONL, and Markdown each play a critical role in the modern LLM lifecycle.

Complete LLM Lifecycle

🗃️

JSON

Raw Data
Collection

→

⚙️

Process

Clean &
Format

→

📄

JSONL

Training
Dataset

→

🔥

Training

Fine-Tune
Model

→

🧠

Deploy

LLM
API

→

📝

Markdown

AI
Response

Format Comparison

Property	JSON	JSONL	Markdown
Primary Use	API data, configs	Training datasets	Formatted text
Human Readable	✓ With formatting	✓ Line-by-line	✓✓ Native
Streamable	Needs full parse	✓✓ Line-by-line	✓ Paragraph-by-para
Token Efficiency	Medium	High (no wrappers)	Very High
Official Standard	RFC 8259, ECMA-404	jsonlines.org	CommonMark, GFM
File Extension	.json	.jsonl	.md, .markdown
MIME Type	application/json	application/jsonl	text/markdown

🔌 RAG Pipelines

Retrieval-Augmented Generation uses JSON to store vector metadata, JSONL for document chunks, and Markdown to format the final prompt context sent to the LLM.

🤖 AI Agents

Agents receive tool results as JSON, communicate with humans using Markdown, and store memory/history logs as JSONL for future training.

📊 Evals & Benchmarks

Model evaluation benchmarks (MMLU, HumanEval) are distributed as JSONL. Results are stored in JSON. Reports are written in Markdown.

Free Tools

Format Tools

Validate, convert, and preview JSON, JSONL, and Markdown directly in your browser. No signup required.

JSONL Validator Free

Paste JSONL below. Each line will be validated against the official jsonlines.org spec.

Output appears here...

JSON Formatter Free

Paste minified JSON to pretty-print it, or paste formatted JSON to minify it.

Output appears here...

Markdown Preview Free

Type Markdown on the left, see the rendered preview on the right instantly.

Preview appears here...

JSON → JSONL Converter Free

Convert a JSON array into JSONL format — one object per line, ready for LLM training.

JSONL output here...

Blog

Articles & Guides

Deep dives on JSON, JSONL, Markdown, and how they power modern AI systems.

📄

JSONL LLM

OpenAI Fine-Tuning with JSONL: The Complete Guide

From zero to a fine-tuned GPT-4o model. We cover the exact JSONL format OpenAI expects, common mistakes, and how to validate your training data.

8 min read · JSONL · Fine-Tuning

#

Markdown LLM

Why ChatGPT and Claude Output Markdown by Default

Markdown saves 50–70% of tokens versus HTML. We explain why every major LLM defaults to Markdown and what that means for your applications.

5 min read · Markdown · AI

{ }

JSON JSONL

JSON vs JSONL: When to Use Which Format

JSON is for a single structured document. JSONL is for streaming collections. Here's the definitive guide on when each format is the right tool.

6 min read · JSON · JSONL

📋

Markdown

CommonMark vs GitHub Flavored Markdown: Key Differences

GFM adds tables, task lists, and strikethrough on top of CommonMark. Here's exactly what each spec adds, with examples and a compatibility matrix.

7 min read · Markdown · Spec

🤖

JSONL RAG

Building a RAG Pipeline: JSON for Metadata, JSONL for Chunks

How to structure your document ingestion pipeline using JSON for vector metadata and JSONL for document chunks — the format every major vector DB expects.

10 min read · JSONL · RAG

⚙️

JSON LLM

JSON Mode in GPT-4o and Claude: Forcing Structured Output

Structured JSON output is the key to building reliable AI agents. We cover JSON mode, function calling, and how to use Pydantic to validate AI responses.

9 min read · JSON · AI Agents

The Three LanguagesAI Speaks

How all three formats work together

Syntax at a glance

JavaScript ObjectNotation

Complete JSON Example

The 6 Data Types

Standards

JSON in AI & Machine Learning

JSON Lines— The AI Training Format

The 3 Official Rules (jsonlines.org)

OpenAI Fine-Tuning Format

Python: Read & Write JSONL

Who uses JSONL?

Markdown —The LLM Output Language

History Timeline

Syntax Reference

Markdown Flavors

How AI UsesAll Three Formats

Complete LLM Lifecycle

Format Comparison

Format Tools

Articles & Guides

The Three Languages
AI Speaks

JavaScript Object
Notation

JSON Lines
— The AI Training Format

Markdown —
The LLM Output Language

How AI Uses
All Three Formats