What is JSONL?

JSONL stands for JSON Lines — also called newline-delimited JSON or nd-JSON. It is a text file format where every single line contains exactly one valid JSON value (almost always a JSON object), separated by a newline character (\n).

Think of it this way: if regular JSON is a single document, JSONL is a collection of documents — a log, a dataset, a conversation history — where every record is completely independent and can be read one at a time without parsing the entire file.

3
Official Rules
.jsonl
File Extension
UTF-8
Required Encoding
Streamable Lines

JSONL is the format of choice for LLM fine-tuning datasets because it can be streamed line-by-line — you can process a 100 GB training file without loading it all into memory. It's also the format for log files, event streams, and any situation where you need to append records over time without rewriting the entire file.

💡 Quick Definition

JSONL = one JSON object per line, UTF-8 encoded, newline-separated. That's it. No array wrappers, no commas between records, no outer structure. Each line is completely self-contained.

The 3 Official Rules — Official Specification (jsonlines.org)

The official JSONL specification is maintained at jsonlines.org by Ian Ward. It defines exactly three requirements that every valid JSONL file must satisfy:

01

UTF-8 Encoding

All JSONL files must be encoded in UTF-8. No byte order mark (BOM) is allowed — this is consistent with the JSON standard (RFC 8259, section 8.1). Encodings other than UTF-8 are very unlikely to be valid when decoded as UTF-8, so accidental misinterpretation of characters is rare.

02

Each Line is a Valid JSON Value

The most common values are objects {} and arrays [], but any valid JSON value is permitted — including strings, numbers, booleans, and even null. A blank line is not a valid JSON value and must not appear in a JSONL file.

03

Line Terminator is \n

Lines are separated by the newline character \n (U+000A). The Windows-style \r\n is also acceptable because surrounding whitespace is implicitly ignored when parsing JSON values. Including a line terminator after the last line is strongly recommended but not required.

JSONL Format Specification Source: jsonlines.org
File Extension
.jsonl
May also be .ndjson (newline-delimited)
MIME Type
application/jsonl
Not yet formally standardized
Encoding
UTF-8 (required)
No BOM allowed
Line Separator
\n or \r\n
Trailing newline recommended
Line Content
Valid JSON value
Object, array, string, number, bool, null
Compression
.jsonl.gz / .jsonl.bz2
gzip/bzip2 recommended for large files

JSONL vs JSON: Key Differences

The most common question developers ask is: when should I use JSONL instead of JSON? The answer depends on whether you're working with a single document or a collection of records.

Property JSON (.json) JSONL (.jsonl)
Structure Single document One record per line
Streaming ✗ Must parse entire file ✓ Line-by-line processing
Appending ✗ Requires full rewrite ✓ Just append a new line
Memory usage Loads entire file One line at a time
Best for Config files, API responses, single documents Training datasets, log files, event streams
Invalid line Breaks entire file Only breaks that line
LLM fine-tuning ⚠ Not standard ✓ Universal standard
⚠ Common Mistake

Developers often try to wrap JSONL records in a JSON array [{...}, {...}]. This produces a valid .json file — not a valid .jsonl file. The JSONL format has no outer wrapper. Every line is independent.

JSONL in AI & LLM Fine-Tuning

JSONL has become the universal standard for LLM training datasets. The reason is straightforward: training datasets can contain millions of examples and be many gigabytes in size. JSONL allows training frameworks to stream examples one at a time without loading the entire dataset into GPU memory.

Every major LLM provider and open-source training framework uses JSONL as the required input format for fine-tuning.

OpenAI Fine-Tuning Format

OpenAI's fine-tuning API requires training data in JSONL format where each line contains a messages array — the same format used by the Chat Completions API. Each message has a role (system, user, or assistant) and content.

openai-training.jsonl
JSONL
// ── REQUIRED FORMAT for GPT-4o fine-tuning ───────────────────────────────
// Each line = one complete conversation example
// Minimum: 10 examples | Recommended: 50–100+ examples

{"messages": [
  {"role": "system", "content": "You are a helpful customer support agent for Acme Corp."},
  {"role": "user", "content": "How do I reset my password?"},
  {"role": "assistant", "content": "To reset your password, go to Settings → Security → Reset Password and enter your registered email."}
]}
{"messages": [
  {"role": "system", "content": "You are a helpful customer support agent for Acme Corp."},
  {"role": "user", "content": "What is your refund policy?"},
  {"role": "assistant", "content": "We offer full refunds within 30 days of purchase, no questions asked."}
]}
{"messages": [
  {"role": "user", "content": "Can I change my shipping address after ordering?"},
  {"role": "assistant", "content": "Yes, if the order hasn't shipped yet. Contact us within 2 hours of placing the order."}
]}
📌 OpenAI Requirements

The system role is optional per conversation. Each conversation must end with an assistant message. OpenAI recommends at least 50 training examples for meaningful fine-tuning results, with 100–500 being typical for good performance.

Google Vertex AI, Mistral, and Llama Formats

While all platforms use JSONL, the internal schema of each line varies by provider. Here's what each major platform expects:

OpenAI / Azure OpenAI
{"messages": [{role, content}]}
Chat format with system/user/assistant roles. Used for GPT-3.5 Turbo, GPT-4o fine-tuning.
Google Vertex AI
{"messages": [{author, content}]}
Gemini and PaLM 2 fine-tuning. Uses "author" instead of "role". Upload to GCS bucket.
Mistral AI
{"messages": [{role, content}]}
Same format as OpenAI. Mistral-7B and Mixtral fine-tuning via mistral.ai platform.
Llama (via Axolotl)
{"instruction": "...", "output": "..."}
Alpaca-style format. Also supports ShareGPT conversation format for multi-turn.
Hugging Face
{"text": "..."} or custom
Flexible — any schema works with the datasets library. JSONL is the most common upload format.
Anthropic Claude
{"prompt": "...", "completion": "..."}
RLHF and Constitutional AI training data. Fine-tuning via the Anthropic API (enterprise).

Python Code: Read, Write & Validate

Python's built-in json module handles JSONL perfectly. There's no need for a third-party library for basic use — just iterate over lines and call json.loads() on each one.

Reading JSONL

read_jsonl.py
Python
import json

# ── Method 1: Simple iteration ────────────────────────────────
with open("training_data.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:      # skip empty lines defensively
            continue
        record = json.loads(line)
        print(record)

# ── Method 2: Load all into a list ────────────────────────────
def load_jsonl(path: str) -> list:
    with open(path, "r", encoding="utf-8") as f:
        return [json.loads(line) for line in f if line.strip()]

records = load_jsonl("training_data.jsonl")
print(f"Loaded {len(records)} training examples")

# ── Method 3: Generator for large files (memory-efficient) ────
def stream_jsonl(path: str):
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                yield json.loads(line)

for record in stream_jsonl("100gb_dataset.jsonl"):  # won't OOM
    process(record)

Writing JSONL

write_jsonl.py
Python
import json

# ── Write a list to JSONL ─────────────────────────────────────
def save_jsonl(records: list, path: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        for record in records:
            f.write(json.dumps(record, ensure_ascii=False) + "\n")

# ensure_ascii=False preserves Unicode (emoji, CJK, etc.)

training_data = [
    {"messages": [{"role": "user", "content": "What is JSONL?"},
                   {"role": "assistant", "content": "JSON Lines — one JSON per line."}]},
    {"messages": [{"role": "user", "content": "Why use JSONL for training?"},
                   {"role": "assistant", "content": "It's streamable — process billions of rows without RAM issues."}]},
]
save_jsonl(training_data, "output.jsonl")

# ── Append to existing JSONL (no rewrite!) ───────────────────
def append_jsonl(record: dict, path: str) -> None:
    with open(path, "a", encoding="utf-8") as f:  # "a" = append mode
        f.write(json.dumps(record, ensure_ascii=False) + "\n")

# This is one of JSONL's key advantages over JSON arrays

Validating JSONL

validate_jsonl.py
Python
import json
from dataclasses import dataclass
from typing import Generator

@dataclass
class ValidationError:
    line_num: int
    message: str

def validate_jsonl(path: str) -> list[ValidationError]:
    errors = []
    with open(path, "r", encoding="utf-8") as f:
        for i, line in enumerate(f, 1):
            stripped = line.rstrip("\n\r")
            if not stripped:
                errors.append(ValidationError(i, "Empty line — not valid in JSONL"))
                continue
            try:
                json.loads(stripped)
            except json.JSONDecodeError as e:
                errors.append(ValidationError(i, str(e)))
    return errors

# Usage
errors = validate_jsonl("training_data.jsonl")
if not errors:
    print("✓ All lines valid")
else:
    for e in errors:
        print(f"✗ Line {e.line_num}: {e.message}")

Convert JSON Array → JSONL

convert_to_jsonl.py
Python
import json

def json_array_to_jsonl(input_path: str, output_path: str) -> int:
    """Convert a JSON array file to JSONL. Returns record count."""
    with open(input_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    if not isinstance(data, list):
        raise ValueError("Input JSON must be an array at the top level")

    with open(output_path, "w", encoding="utf-8") as f:
        for record in data:
            f.write(json.dumps(record, ensure_ascii=False) + "\n")

    return len(data)

count = json_array_to_jsonl("data.json", "data.jsonl")
print(f"Converted {count} records to JSONL")

Valid vs Invalid JSONL Examples

Understanding what makes JSONL invalid helps you avoid the most common errors. Here are side-by-side examples:

✓ Valid JSONL
{"id": 1, "text": "Hello"}
{"id": 2, "text": "World"}
{"id": 3, "text": "Foo"}
3 lines, 3 valid JSON objects, no wrappers
✗ Invalid — JSON Array
[
  {"id": 1, "text": "Hello"},
  {"id": 2, "text": "World"}
]
This is valid JSON, not JSONL. No outer array in JSONL.
✓ Valid — Mixed JSON Types
{"name": "Alice", "score": 95}
null
[1, 2, 3]
"a plain string"
42
Any valid JSON value per line is allowed per the spec
✗ Invalid — Blank Line
{"name": "Alice"}

{"name": "Bob"}
The empty line on line 2 is not a valid JSON value
✓ Valid — Trailing Newline
{"a": 1}
{"b": 2}
{"c": 3}
↵ (final newline)
Trailing newline after last record is recommended
✗ Invalid — Trailing Comma
{"a": 1},
{"b": 2},
{"c": 3}
No commas between lines. Each line must be standalone JSON.

Common Mistakes to Avoid

  • Wrapping in a JSON array — JSONL has no outer [...] wrapper. If you need a JSON array, use .json not .jsonl.
  • Including blank lines — Per the spec, blank lines are not valid JSON values. Many parsers will throw an error or silently skip them. Strip them before uploading to a training API.
  • Using commas between records — Each line is a separate JSON value. Adding a trailing comma (,) after each line makes it invalid JSON on that line.
  • Wrong encoding — JSONL must be UTF-8. If you export from Excel or a legacy system, check the encoding with file -i yourfile.jsonl on Linux/macOS before uploading.
  • Multi-line JSON objects — Each JSON object must be on a single line. Pretty-printed JSON (with newlines and indentation) inside the record will break JSONL parsers.
  • Missing the final newline — Technically optional, but strongly recommended. Some tools misbehave when the last line lacks a trailing \n.
  • Using json.dump() instead of json.dumps() — When writing in Python, use json.dumps(record) + "\n", not json.dump(record, f) (which doesn't add the newline).
🚫 Most Common Fine-Tuning Error

The #1 mistake when uploading training data to OpenAI is sending a JSON array instead of JSONL. The API will reject the file with "Invalid file format." Verify each line is an independent JSON object before uploading.

Best Practices

  • Always validate before uploading — Run a validator (like the one on this site, or the Python code above) on your JSONL file before sending it to a fine-tuning API. A single bad line can reject the entire file.
  • Use ensure_ascii=False in Python — This preserves Unicode characters (emoji, CJK, Arabic, etc.) in their native form instead of escaping them as \uXXXX, which wastes tokens and makes data harder to inspect.
  • Compress large files — For datasets over 100 MB, use gzip compression (.jsonl.gz). Most training frameworks support compressed JSONL natively, and it reduces file size by 5–10× for text data.
  • Shuffle before fine-tuning — Random shuffle your JSONL records before training. Ordered data (all customer service questions first, then all coding questions) can cause catastrophic forgetting.
  • Split train/validation — Keep a 90/10 or 95/5 train/validation split in separate JSONL files. OpenAI's fine-tuning API accepts a separate validation file to report eval metrics.
  • Stream for large datasets — Use a generator function (see the Python examples above) rather than loading everything into memory. A 10 GB JSONL file can be processed with constant ~1 MB RAM usage.
  • Keep records atomic — Each JSONL line should be a self-contained, meaningful unit of training data. Avoid records that only make sense in the context of adjacent records.

Frequently Asked Questions

Is JSONL the same as ndjson?

Yes, essentially. ndjson (newline-delimited JSON) and JSONL (JSON Lines) describe the same format. The term "JSONL" and the .jsonl extension are more common in the AI/ML community; "ndjson" is more common in the data engineering community. Both follow the same rules.

Can a JSONL file contain arrays or just objects?

Per the official specification, any valid JSON value is permitted on each line — objects, arrays, strings, numbers, booleans, and null are all allowed. However, in practice and especially for LLM fine-tuning, every line should be a JSON object ({}). Arrays, strings, and primitives as top-level values are unusual and may not be accepted by training APIs.

How do I handle JSONL with Python without a library?

Python's built-in json module is all you need. Open the file, iterate over lines, strip whitespace, and call json.loads() on each line. See the Python code section above for complete examples.

What's the maximum file size for OpenAI fine-tuning?

OpenAI's fine-tuning API accepts JSONL files up to 1 GB. For very large datasets, you can provide multiple files or use dataset mixing. The minimum is 10 training examples; the recommended minimum for useful results is 50–100 examples.

Can I include comments in JSONL?

No. JSON does not support comments, and neither does JSONL. If you see // comment style lines in examples online, those are documentation annotations — they would make the JSONL file invalid. Use a separate metadata file or README to document your dataset.

Is JSONL an official standard?

JSONL is a de-facto standard maintained at jsonlines.org. It is not yet an IETF RFC or ECMA standard, though the maintainers are working toward a formal RFC for the MIME type registration. For practical purposes, jsonlines.org is the authoritative specification.

Continue Learning