What is Markdown?

Markdown is a lightweight markup language that lets you add formatting — headings, bold, italics, links, code — to plain text using simple, intuitive symbols. A # before a line makes it a heading. Wrapping text in **double asterisks** makes it bold. No HTML tags required.

The philosophy behind Markdown is that a document should be readable as plain text, even without being rendered. An email with *emphasized* text reads naturally whether you see the asterisks or the italics. This dual readability — human and machine — is exactly why every LLM has adopted it as its default output format.

2004
Year Created
0.31.2
CommonMark Ver.
↓70%
Tokens vs HTML
.md
File Extension
💡 The Core Idea

Markdown is both a writing format and a conversion tool. You write in plain text with simple punctuation-based conventions. A Markdown parser then converts that text into HTML, PDF, or any other output format. The source is always readable; the rendered output is always beautiful.

History & the CommonMark Standard

Markdown was created by John Gruber (with help from Aaron Swartz) in 2004 and released on his blog Daring Fireball. His goal was simple: create a writing format for the web that reads naturally as plain text and converts cleanly to HTML. It spread rapidly through the early blogging and developer communities.

2004
Markdown 1.0 Released
John Gruber publishes the Markdown syntax description and Markdown.pl — a Perl script that converts Markdown to HTML — on Daring Fireball.
2008
Stack Overflow Launches with Markdown
Jeff Atwood and Joel Spolsky choose Markdown for Stack Overflow. Suddenly millions of developers start writing and reading Markdown daily.
2009
GitHub Adopts Markdown for READMEs
GitHub makes Markdown the standard for README files. This becomes the single biggest driver of Markdown adoption, reaching tens of millions of developers.
2012
GitHub Flavored Markdown (GFM) Born
GitHub extends Markdown with tables, syntax highlighting, task lists, and strikethrough. GFM becomes the de-facto standard for developer documentation.
2014
CommonMark Founded
Jeff Atwood, John MacFarlane, and representatives from GitHub, Stack Overflow, and Reddit release CommonMark — a fully unambiguous, testable Markdown specification with 500+ test cases.
2017
GFM Formally Specified
GitHub publishes the formal GFM specification as a strict superset of CommonMark, adding tables, strikethrough, autolinks, and task lists as official extensions.
2020+
The AI Era — Markdown Goes Universal
ChatGPT, Claude, Gemini, and Grok all adopt Markdown as their default output format. Notion, Obsidian, Linear, and hundreds of apps make Markdown their native input language. The format is no longer niche — it is infrastructure.
Markdown Specifications Sources: CommonMark 0.31.2 · GFM Spec 2017
Original Spec
Daring Fireball (2004)
By John Gruber — intentionally underspecified
CommonMark Version
0.31.2 (Jan 2024)
500+ test cases, BSD-licensed reference impl.
GFM Spec
GitHub.com/gfm/ (2017)
Strict superset of CommonMark + 4 extensions
File Extension
.md or .markdown
.md is universal; .markdown is legacy
MIME Type
text/markdown
Registered in RFC 7763
Encoding
UTF-8 (recommended)
Any encoding is technically valid

Core Syntax Reference

Markdown documents are composed of two kinds of elements: block elements (headings, paragraphs, lists, code blocks — structural elements that occupy their own line) and inline elements (bold, italic, links, inline code — elements that appear within a block of text).

Block Elements

Syntax Result Notes
# Heading 1 <h1> — largest heading Space after # required in CommonMark
## Heading 2 <h2> Up to 6 levels (######)
Heading\n=== <h1> — setext style Alternative ATX style; === for h1, --- for h2
- item or * item Unordered list - + * are equivalent; consistent within list
1. item Ordered list Number value doesn't matter; auto-increments
> text Blockquote Nestable with >>; can contain any blocks
--- or *** Thematic break <hr> 3+ hyphens, asterisks, or underscores
indented Indented code block 4 spaces or 1 tab indent; no language tag
```lang Fenced code block Preferred; supports language identifier

Inline Elements

Syntax Result Notes
**bold** or __bold__ <strong> bold text ** is preferred; __ can conflict in words
*italic* or _italic_ <em> italic text * preferred for intraword; _ can cause issues
***bold italic*** <strong><em> Triple delimiter; widely supported
`inline code` <code> monospace Use `` backtick `` to include backtick in code
[text](url) Inline link Optional title: [text](url "title")
[text][ref] Reference link [ref]: url defined elsewhere in doc
![alt](url) <img> image Same as link but prefixed with !
\*escaped\* Literal * character Backslash escapes any punctuation character

Code Blocks in Depth

Fenced code blocks are the most important Markdown feature for technical writing and AI output. Three or more backticks open a block; the same number closes it. An optional info string (language identifier) immediately follows the opening fence and is used for syntax highlighting.

code-blocks.md
Markdown
// ── Fenced code block with language identifier ──────────────
```python
def load_jsonl(path):
    with open(path) as f:
        return [json.loads(line) for line in f]
```

// ── Longer fence for code containing backticks ───────────────
````markdown
Here's an example with `backticks` inside.
````

// ── Info string is the language identifier ───────────────────
// Supported: python, javascript, json, bash, sql, rust, go ...
// Used by: GitHub, VS Code, Claude, ChatGPT for highlighting

// ── Indented code block (4 spaces) — no language tag ─────────
    def old_style():
        pass   # No syntax highlighting

GitHub Flavored Markdown (GFM)

GFM is a strict superset of CommonMark — it supports every CommonMark feature plus four official extensions. It is the most widely deployed Markdown flavor in the world, used by GitHub, GitLab, Discord, Reddit, Slack, and most AI assistants.

📌 GFM's 4 Official Extensions

Per the formal GFM specification: Tables, Task Lists, Strikethrough, and Autolinks. These four additions on top of CommonMark define GFM. Everything else (mentions, issue references, math) is GitHub-platform-specific, not part of the GFM spec.

gfm-extensions.md
GFM
// ── Extension 1: Tables ──────────────────────────────────────
| Format   | Use Case      | Streamable |
|----------|---------------|:----------:|
| JSON     | API responses | No         |
| JSONL    | Training data | Yes        |
| Markdown | LLM output    | Yes        |

// Alignment: :--- left  |  :---: center  |  ---: right

// ── Extension 2: Task Lists ──────────────────────────────────
- [x] Read the CommonMark spec
- [x] Understand GFM extensions
- [ ] Build the jsonl.ai website
- [ ] Sell domain for $100K

// ── Extension 3: Strikethrough ───────────────────────────────
~~deprecated text~~  or  ~single tilde~

// ── Extension 4: Autolinks ───────────────────────────────────
https://jsonl.ai           ← becomes a clickable link automatically
[email protected]           ← email also auto-linked
www.jsonl.ai               ← www. prefix also supported

GFM also adds GitHub-platform-specific features (not in the formal spec but widely recognized): @mentions, #issue-references, commit SHA autolinks, $LaTeX$ math with MathJax, Mermaid diagram blocks, and collapsible <details> sections.

Markdown Flavors Compared

There is no single canonical Markdown. Instead, there is an ecosystem of flavors — each adding features on top of Gruber's original spec. Here are the major ones you'll encounter:

Original Markdown
Gruber · 2004
The original spec. Intentionally underspecified. No tables, no fenced code blocks. Still the philosophical foundation.
CommonMark
MacFarlane, Atwood · 2014
Unambiguous, fully specified, 500+ test cases. The correct foundation for any new Markdown implementation. Version 0.31.2 is current (Jan 2024).
GFM
GitHub · 2017
CommonMark + tables, task lists, strikethrough, autolinks. The most widely used flavor globally. Used by Claude, ChatGPT, and most AI systems.
MDX
Community · 2018
Markdown + JSX components. Write React components inline in Markdown. Used by Next.js, Remix, and Astro for interactive documentation.
Markdown Extra
Gruber-compatible · 2004+
Adds footnotes, definition lists, abbreviations, and attribute blocks. Popular in PHP CMSes (Drupal, TYPO3).
Pandoc Markdown
MacFarlane · 2006+
The most feature-rich flavor. Adds citations, footnotes, math, tables, definition lists. Powers academic publishing workflows.
Feature Gruber CommonMark GFM MDX
Tables
Fenced code blocks
Task lists
Strikethrough ✓ (~~)
Autolinks ⚠ <url> ⚠ <url> ✓ bare URLs
JSX components
LLM default ⚠ Partial ✓ Yes

Why Every LLM Outputs Markdown

When you ask ChatGPT, Claude, Gemini, or Grok a question, the response almost always uses Markdown — headings, bullets, bold text, code blocks. This is not a coincidence or a style choice. It is a deliberate design decision rooted in token efficiency, training data distribution, and structural clarity.

Token Efficiency

LLMs are billed and limited by tokens. Markdown conveys the same structure as HTML using dramatically fewer characters — and therefore fewer tokens.

Relative token cost to express the same structured content
Markdown
~30%
HTML
~100% (baseline)
XML
~90%
Plain text
~20%

Markdown sits in the sweet spot: far cheaper than HTML (which adds <p>, <strong>, <ul><li> overhead everywhere), while retaining all structural meaning that plain text loses.

Training Data Distribution

LLMs are trained on massive text corpora scraped from the web. A enormous fraction of that data — GitHub READMEs, Stack Overflow posts, Reddit comments, Wikipedia, technical blogs, documentation sites — is already written in Markdown. The model has seen billions of examples of Markdown syntax, making it the path of least resistance when generating structured responses.

Structural Clarity for the Model

Markdown gives LLMs a natural vocabulary for organizing their thoughts. A # signals a main topic. ## signals a subtopic. Bullets organize parallel items. Code blocks isolate code from prose. This hierarchical structure helps the model organize long responses in a way that mirrors how the information is conceptually structured — not just how it looks.

📊 The LLMs.txt Standard (2024)

In September 2024, Jeremy Howard proposed llms.txt — a Markdown file at your site's root, written specifically for AI crawlers. The spec is intentionally built on Markdown because it is the "native language" of LLMs. Clean Markdown reduces hallucinations by 30–70% by eliminating HTML noise. Anthropic, Cursor, and Mintlify have already adopted it.

Python: Parse & Render Markdown

Python has excellent Markdown libraries. The most popular are markdown (original, basic), mistune (fast, customizable), and markdown-it-py (CommonMark-compliant, recommended for new projects).

Basic Rendering with markdown-it-py (CommonMark)

render_markdown.py
Python
# pip install markdown-it-py  (CommonMark-compliant)
from markdown_it import MarkdownIt

# ── CommonMark compliant renderer ─────────────────────────────
md = MarkdownIt()    # defaults to CommonMark

html = md.render("""
# Hello, AI

This is **bold** and this is *italic*.

```python
print("Hello from a code block")
```

- Item one
- Item two
""")
print(html)
# → <h1>Hello, AI</h1>
# → <p>This is <strong>bold</strong>...</p>
# → <pre><code class="language-python">...</code></pre>

# ── Enable GFM extensions ─────────────────────────────────────
from mdit_py_plugins.tasklists import tasklists_plugin

md_gfm = MarkdownIt("gfm-like").use(tasklists_plugin)
html_gfm = md_gfm.render("- [x] Done\n- [ ] Todo")

Processing LLM Output (Strip Markdown → Plain Text)

strip_markdown.py
Python
# pip install strip-markdown  (or use regex for simple cases)
import re

def strip_markdown(text: str) -> str:
    """Remove common Markdown syntax from LLM output."""
    # Remove fenced code blocks (keep content)
    text = re.sub(r'```[^\n]*\n(.*?)\n```', r'\1', text, flags=re.DOTALL)
    # Remove headings (keep text)
    text = re.sub(r'^#{1,6}\s+', '', text, flags=re.MULTILINE)
    # Remove bold/italic markers
    text = re.sub(r'\*{1,3}(.+?)\*{1,3}', r'\1', text)
    text = re.sub(r'_{1,3}(.+?)_{1,3}', r'\1', text)
    # Remove inline code backticks
    text = re.sub(r'`(.+?)`', r'\1', text)
    # Remove links (keep text)
    text = re.sub(r'\[(.+?)\]\(.+?\)', r'\1', text)
    # Remove list markers
    text = re.sub(r'^\s*[-*+]\s+', '', text, flags=re.MULTILINE)
    return text.strip()

# Useful when feeding LLM output to TTS or a non-Markdown display
llm_response = "## Summary\n\n**JSON** is used for *API responses*."
clean = strip_markdown(llm_response)
# → "Summary\n\nJSON is used for API responses."

Extract Code Blocks from LLM Output

extract_code.py
Python
import re
from dataclasses import dataclass

@dataclass
class CodeBlock:
    language: str
    code: str

def extract_code_blocks(markdown: str) -> list[CodeBlock]:
    """Extract all fenced code blocks from LLM Markdown output."""
    pattern = r'```(\w*)\n(.*?)```'
    matches = re.findall(pattern, markdown, flags=re.DOTALL)
    return [CodeBlock(lang or "text", code.strip()) for lang, code in matches]

# Usage: extract runnable code from LLM response
response = """
Here's the solution:

```python
x = 42
print(x)
```

And the config:

```json
{"debug": true}
```
"""

blocks = extract_code_blocks(response)
for block in blocks:
    print(f"{block.language}: {block.code[:40]}...")
# python: x = 42\nprint(x)...
# json: {"debug": true}...

Source vs Rendered: Side-by-Side

The best way to understand Markdown is to see the source and the rendered output together. Here are the most important patterns:

Markdown Source
# Heading 1
## Heading 2

Normal paragraph text.

**bold** and *italic* text.

- First item
- Second item
  - Nested item

> A blockquote
Rendered Output

Heading 1

Heading 2

Normal paragraph text.

bold and italic text.

  • First item
  • Second item
    • Nested item
A blockquote
GFM Table Source
| Format | Tokens | AI? |
|--------|:------:|----:|
| MD     | Low    | Yes |
| HTML   | High   | No  |
| JSON   | Medium | Yes |
Rendered Table
FormatTokensAI?
MDLowYes
HTMLHighNo
JSONMediumYes
Links & Images Source
[Visit jsonl.ai](https://jsonl.ai)

[Ref-style link][ref]

[ref]: https://jsonl.ai "JSONL Guide"

![Alt text](logo.png)
Rendered Links

Visit jsonl.ai

Ref-style link

[image renders here]

Common Mistakes

  • Forgetting the space after # — CommonMark and GFM require a space between the # and the heading text. #Heading is not a heading; # Heading is. This was formalized in CommonMark to remove ambiguity.
  • Using underscores for emphasis inside words_italic_ works for whole words but fails inside compound words like some_variable_name. Use *asterisks* when the emphasis marker might appear adjacent to non-whitespace characters.
  • Mixing list marker types inconsistently — Within a single list, use only -, only *, or only +. Mixing them creates separate lists in some renderers.
  • Missing blank lines around block elements — Headings, lists, and code blocks generally need a blank line before and after them to be parsed correctly. Leaving them out produces unexpected results across different renderers.
  • Indenting list items inconsistently — Nested list items need consistent indentation (typically 2–4 spaces). Different parsers have different rules; CommonMark specifies the exact algorithm.
  • Expecting comments to work — Markdown has no native comment syntax. The HTML comment <!-- comment --> technically works (since Markdown passes raw HTML through), but this is not portable across all renderers.
  • Tables without blank lines — GFM tables need a blank line before them to be correctly parsed. A table immediately following a paragraph may be treated as part of the paragraph in some parsers.
  • Using tabs for indentation — CommonMark defines tab handling very specifically (a tab equals spaces up to the next tab stop, every 4 columns). Mixing spaces and tabs causes inconsistent rendering. Use spaces.

Frequently Asked Questions

What is the difference between CommonMark and GFM?

CommonMark is the base specification — fully unambiguous, 500+ test cases, version 0.31.2 as of January 2024. GFM (GitHub Flavored Markdown) is a strict superset of CommonMark with four additional extensions: tables, task lists, strikethrough, and autolinks. If you're building a Markdown renderer, implement CommonMark first, then add the GFM extensions. GFM is what most AI systems, GitHub, and modern tools actually use.

Why do LLMs output Markdown even when I don't ask for it?

Because they were trained on data that is predominantly Markdown-formatted (GitHub, Stack Overflow, Reddit, documentation sites), and because Markdown was selected as the output format for chat interfaces like ChatGPT and Claude. The models have internalized Markdown as the "correct" way to structure a response. You can usually suppress it by explicitly asking for plain text output, or by setting a system prompt that says "respond in plain text without Markdown formatting."

Can I use Markdown for formatting in JSON strings?

Yes — and this is extremely common in AI APIs. The content field in OpenAI and Anthropic API responses is a plain JSON string that contains Markdown formatting. Your application is responsible for rendering the Markdown. This is why most chat interfaces render LLM output as Markdown rather than displaying the raw asterisks and pound signs.

What's the best Markdown parser for Python?

markdown-it-py is the recommended choice for new projects — it is CommonMark-compliant, actively maintained, and supports GFM extensions via plugins. For simpler use cases, the original markdown library is fine. For maximum performance at scale, mistune is the fastest option. Avoid the original Python-Markdown library for CommonMark compliance — it has many quirks that deviate from the spec.

Do I need to escape special characters in Markdown?

Yes, when you want a literal Markdown character rather than its formatting effect. Backslash-escape any of these: \ ` * _ { } [ ] ( ) # + - . !. For example, write \*literal asterisk\* to prevent italic formatting. Inside code spans and fenced code blocks, no escaping is needed — content is treated as literal.

Complete the Trilogy