What is Markdown?
Markdown is a lightweight markup language that lets you add formatting — headings, bold, italics, links, code — to plain text using simple, intuitive symbols. A # before a line makes it a heading. Wrapping text in **double asterisks** makes it bold. No HTML tags required.
The philosophy behind Markdown is that a document should be readable as plain text, even without being rendered. An email with *emphasized* text reads naturally whether you see the asterisks or the italics. This dual readability — human and machine — is exactly why every LLM has adopted it as its default output format.
Markdown is both a writing format and a conversion tool. You write in plain text with simple punctuation-based conventions. A Markdown parser then converts that text into HTML, PDF, or any other output format. The source is always readable; the rendered output is always beautiful.
History & the CommonMark Standard
Markdown was created by John Gruber (with help from Aaron Swartz) in 2004 and released on his blog Daring Fireball. His goal was simple: create a writing format for the web that reads naturally as plain text and converts cleanly to HTML. It spread rapidly through the early blogging and developer communities.
Core Syntax Reference
Markdown documents are composed of two kinds of elements: block elements (headings, paragraphs, lists, code blocks — structural elements that occupy their own line) and inline elements (bold, italic, links, inline code — elements that appear within a block of text).
Block Elements
| Syntax | Result | Notes |
|---|---|---|
# Heading 1 |
<h1> — largest heading | Space after # required in CommonMark |
## Heading 2 |
<h2> | Up to 6 levels (######) |
Heading\n=== |
<h1> — setext style | Alternative ATX style; === for h1, --- for h2 |
- item or * item |
Unordered list | - + * are equivalent; consistent within list |
1. item |
Ordered list | Number value doesn't matter; auto-increments |
> text |
Blockquote | Nestable with >>; can contain any blocks |
--- or *** |
Thematic break <hr> | 3+ hyphens, asterisks, or underscores |
indented |
Indented code block | 4 spaces or 1 tab indent; no language tag |
```lang |
Fenced code block | Preferred; supports language identifier |
Inline Elements
| Syntax | Result | Notes |
|---|---|---|
**bold** or __bold__ |
<strong> bold text | ** is preferred; __ can conflict in words |
*italic* or _italic_ |
<em> italic text | * preferred for intraword; _ can cause issues |
***bold italic*** |
<strong><em> | Triple delimiter; widely supported |
`inline code` |
<code> monospace | Use `` backtick `` to include backtick in code |
[text](url) |
Inline link | Optional title: [text](url "title") |
[text][ref] |
Reference link | [ref]: url defined elsewhere in doc |
 |
<img> image | Same as link but prefixed with ! |
\*escaped\* |
Literal * character | Backslash escapes any punctuation character |
Code Blocks in Depth
Fenced code blocks are the most important Markdown feature for technical writing and AI output. Three or more backticks open a block; the same number closes it. An optional info string (language identifier) immediately follows the opening fence and is used for syntax highlighting.
// ── Fenced code block with language identifier ────────────── ```python def load_jsonl(path): with open(path) as f: return [json.loads(line) for line in f] ``` // ── Longer fence for code containing backticks ─────────────── ````markdown Here's an example with `backticks` inside. ```` // ── Info string is the language identifier ─────────────────── // Supported: python, javascript, json, bash, sql, rust, go ... // Used by: GitHub, VS Code, Claude, ChatGPT for highlighting // ── Indented code block (4 spaces) — no language tag ───────── def old_style(): pass # No syntax highlighting
GitHub Flavored Markdown (GFM)
GFM is a strict superset of CommonMark — it supports every CommonMark feature plus four official extensions. It is the most widely deployed Markdown flavor in the world, used by GitHub, GitLab, Discord, Reddit, Slack, and most AI assistants.
Per the formal GFM specification: Tables, Task Lists, Strikethrough, and Autolinks. These four additions on top of CommonMark define GFM. Everything else (mentions, issue references, math) is GitHub-platform-specific, not part of the GFM spec.
// ── Extension 1: Tables ────────────────────────────────────── | Format | Use Case | Streamable | |----------|---------------|:----------:| | JSON | API responses | No | | JSONL | Training data | Yes | | Markdown | LLM output | Yes | // Alignment: :--- left | :---: center | ---: right // ── Extension 2: Task Lists ────────────────────────────────── - [x] Read the CommonMark spec - [x] Understand GFM extensions - [ ] Build the jsonl.ai website - [ ] Sell domain for $100K // ── Extension 3: Strikethrough ─────────────────────────────── ~~deprecated text~~ or ~single tilde~ // ── Extension 4: Autolinks ─────────────────────────────────── https://jsonl.ai ← becomes a clickable link automatically [email protected] ← email also auto-linked www.jsonl.ai ← www. prefix also supported
GFM also adds GitHub-platform-specific features (not in the formal spec but widely recognized): @mentions, #issue-references, commit SHA autolinks, $LaTeX$ math with MathJax, Mermaid diagram blocks, and collapsible <details> sections.
Markdown Flavors Compared
There is no single canonical Markdown. Instead, there is an ecosystem of flavors — each adding features on top of Gruber's original spec. Here are the major ones you'll encounter:
| Feature | Gruber | CommonMark | GFM | MDX |
|---|---|---|---|---|
| Tables | ✗ | ✗ | ✓ | ✓ |
| Fenced code blocks | ✗ | ✓ | ✓ | ✓ |
| Task lists | ✗ | ✗ | ✓ | ✓ |
| Strikethrough | ✗ | ✗ | ✓ (~~) | ✓ |
| Autolinks | ⚠ <url> | ⚠ <url> | ✓ bare URLs | ✓ |
| JSX components | ✗ | ✗ | ✗ | ✓ |
| LLM default | ✗ | ⚠ Partial | ✓ Yes | ✗ |
Why Every LLM Outputs Markdown
When you ask ChatGPT, Claude, Gemini, or Grok a question, the response almost always uses Markdown — headings, bullets, bold text, code blocks. This is not a coincidence or a style choice. It is a deliberate design decision rooted in token efficiency, training data distribution, and structural clarity.
Token Efficiency
LLMs are billed and limited by tokens. Markdown conveys the same structure as HTML using dramatically fewer characters — and therefore fewer tokens.
Markdown sits in the sweet spot: far cheaper than HTML (which adds <p>, <strong>, <ul><li> overhead everywhere), while retaining all structural meaning that plain text loses.
Training Data Distribution
LLMs are trained on massive text corpora scraped from the web. A enormous fraction of that data — GitHub READMEs, Stack Overflow posts, Reddit comments, Wikipedia, technical blogs, documentation sites — is already written in Markdown. The model has seen billions of examples of Markdown syntax, making it the path of least resistance when generating structured responses.
Structural Clarity for the Model
Markdown gives LLMs a natural vocabulary for organizing their thoughts. A # signals a main topic. ## signals a subtopic. Bullets organize parallel items. Code blocks isolate code from prose. This hierarchical structure helps the model organize long responses in a way that mirrors how the information is conceptually structured — not just how it looks.
In September 2024, Jeremy Howard proposed llms.txt — a Markdown file at your site's root, written specifically for AI crawlers. The spec is intentionally built on Markdown because it is the "native language" of LLMs. Clean Markdown reduces hallucinations by 30–70% by eliminating HTML noise. Anthropic, Cursor, and Mintlify have already adopted it.
Python: Parse & Render Markdown
Python has excellent Markdown libraries. The most popular are markdown (original, basic), mistune (fast, customizable), and markdown-it-py (CommonMark-compliant, recommended for new projects).
Basic Rendering with markdown-it-py (CommonMark)
# pip install markdown-it-py (CommonMark-compliant) from markdown_it import MarkdownIt # ── CommonMark compliant renderer ───────────────────────────── md = MarkdownIt() # defaults to CommonMark html = md.render(""" # Hello, AI This is **bold** and this is *italic*. ```python print("Hello from a code block") ``` - Item one - Item two """) print(html) # → <h1>Hello, AI</h1> # → <p>This is <strong>bold</strong>...</p> # → <pre><code class="language-python">...</code></pre> # ── Enable GFM extensions ───────────────────────────────────── from mdit_py_plugins.tasklists import tasklists_plugin md_gfm = MarkdownIt("gfm-like").use(tasklists_plugin) html_gfm = md_gfm.render("- [x] Done\n- [ ] Todo")
Processing LLM Output (Strip Markdown → Plain Text)
# pip install strip-markdown (or use regex for simple cases) import re def strip_markdown(text: str) -> str: """Remove common Markdown syntax from LLM output.""" # Remove fenced code blocks (keep content) text = re.sub(r'```[^\n]*\n(.*?)\n```', r'\1', text, flags=re.DOTALL) # Remove headings (keep text) text = re.sub(r'^#{1,6}\s+', '', text, flags=re.MULTILINE) # Remove bold/italic markers text = re.sub(r'\*{1,3}(.+?)\*{1,3}', r'\1', text) text = re.sub(r'_{1,3}(.+?)_{1,3}', r'\1', text) # Remove inline code backticks text = re.sub(r'`(.+?)`', r'\1', text) # Remove links (keep text) text = re.sub(r'\[(.+?)\]\(.+?\)', r'\1', text) # Remove list markers text = re.sub(r'^\s*[-*+]\s+', '', text, flags=re.MULTILINE) return text.strip() # Useful when feeding LLM output to TTS or a non-Markdown display llm_response = "## Summary\n\n**JSON** is used for *API responses*." clean = strip_markdown(llm_response) # → "Summary\n\nJSON is used for API responses."
Extract Code Blocks from LLM Output
import re from dataclasses import dataclass @dataclass class CodeBlock: language: str code: str def extract_code_blocks(markdown: str) -> list[CodeBlock]: """Extract all fenced code blocks from LLM Markdown output.""" pattern = r'```(\w*)\n(.*?)```' matches = re.findall(pattern, markdown, flags=re.DOTALL) return [CodeBlock(lang or "text", code.strip()) for lang, code in matches] # Usage: extract runnable code from LLM response response = """ Here's the solution: ```python x = 42 print(x) ``` And the config: ```json {"debug": true} ``` """ blocks = extract_code_blocks(response) for block in blocks: print(f"{block.language}: {block.code[:40]}...") # python: x = 42\nprint(x)... # json: {"debug": true}...
Source vs Rendered: Side-by-Side
The best way to understand Markdown is to see the source and the rendered output together. Here are the most important patterns:
# Heading 1 ## Heading 2 Normal paragraph text. **bold** and *italic* text. - First item - Second item - Nested item > A blockquote
Heading 1
Heading 2
Normal paragraph text.
bold and italic text.
- First item
- Second item
- Nested item
A blockquote
| Format | Tokens | AI? | |--------|:------:|----:| | MD | Low | Yes | | HTML | High | No | | JSON | Medium | Yes |
| Format | Tokens | AI? |
|---|---|---|
| MD | Low | Yes |
| HTML | High | No |
| JSON | Medium | Yes |
[Visit jsonl.ai](https://jsonl.ai) [Ref-style link][ref] [ref]: https://jsonl.ai "JSONL Guide" 
Common Mistakes
- Forgetting the space after
#— CommonMark and GFM require a space between the#and the heading text.#Headingis not a heading;# Headingis. This was formalized in CommonMark to remove ambiguity. - Using underscores for emphasis inside words —
_italic_works for whole words but fails inside compound words likesome_variable_name. Use*asterisks*when the emphasis marker might appear adjacent to non-whitespace characters. - Mixing list marker types inconsistently — Within a single list, use only
-, only*, or only+. Mixing them creates separate lists in some renderers. - Missing blank lines around block elements — Headings, lists, and code blocks generally need a blank line before and after them to be parsed correctly. Leaving them out produces unexpected results across different renderers.
- Indenting list items inconsistently — Nested list items need consistent indentation (typically 2–4 spaces). Different parsers have different rules; CommonMark specifies the exact algorithm.
- Expecting comments to work — Markdown has no native comment syntax. The HTML comment
<!-- comment -->technically works (since Markdown passes raw HTML through), but this is not portable across all renderers. - Tables without blank lines — GFM tables need a blank line before them to be correctly parsed. A table immediately following a paragraph may be treated as part of the paragraph in some parsers.
- Using tabs for indentation — CommonMark defines tab handling very specifically (a tab equals spaces up to the next tab stop, every 4 columns). Mixing spaces and tabs causes inconsistent rendering. Use spaces.
Frequently Asked Questions
What is the difference between CommonMark and GFM?
CommonMark is the base specification — fully unambiguous, 500+ test cases, version 0.31.2 as of January 2024. GFM (GitHub Flavored Markdown) is a strict superset of CommonMark with four additional extensions: tables, task lists, strikethrough, and autolinks. If you're building a Markdown renderer, implement CommonMark first, then add the GFM extensions. GFM is what most AI systems, GitHub, and modern tools actually use.
Why do LLMs output Markdown even when I don't ask for it?
Because they were trained on data that is predominantly Markdown-formatted (GitHub, Stack Overflow, Reddit, documentation sites), and because Markdown was selected as the output format for chat interfaces like ChatGPT and Claude. The models have internalized Markdown as the "correct" way to structure a response. You can usually suppress it by explicitly asking for plain text output, or by setting a system prompt that says "respond in plain text without Markdown formatting."
Can I use Markdown for formatting in JSON strings?
Yes — and this is extremely common in AI APIs. The content field in OpenAI and Anthropic API responses is a plain JSON string that contains Markdown formatting. Your application is responsible for rendering the Markdown. This is why most chat interfaces render LLM output as Markdown rather than displaying the raw asterisks and pound signs.
What's the best Markdown parser for Python?
markdown-it-py is the recommended choice for new projects — it is CommonMark-compliant, actively maintained, and supports GFM extensions via plugins. For simpler use cases, the original markdown library is fine. For maximum performance at scale, mistune is the fastest option. Avoid the original Python-Markdown library for CommonMark compliance — it has many quirks that deviate from the spec.
Do I need to escape special characters in Markdown?
Yes, when you want a literal Markdown character rather than its formatting effect. Backslash-escape any of these: \ ` * _ { } [ ] ( ) # + - . !. For example, write \*literal asterisk\* to prevent italic formatting. Inside code spans and fenced code blocks, no escaping is needed — content is treated as literal.