What is JSONL?
JSONL stands for JSON Lines — also called newline-delimited JSON or nd-JSON. It is a text file format where every single line contains exactly one valid JSON value (almost always a JSON object), separated by a newline character (\n).
Think of it this way: if regular JSON is a single document, JSONL is a collection of documents — a log, a dataset, a conversation history — where every record is completely independent and can be read one at a time without parsing the entire file.
JSONL is the format of choice for LLM fine-tuning datasets because it can be streamed line-by-line — you can process a 100 GB training file without loading it all into memory. It's also the format for log files, event streams, and any situation where you need to append records over time without rewriting the entire file.
JSONL = one JSON object per line, UTF-8 encoded, newline-separated. That's it. No array wrappers, no commas between records, no outer structure. Each line is completely self-contained.
The 3 Official Rules — Official Specification (jsonlines.org)
The official JSONL specification is maintained at jsonlines.org by Ian Ward. It defines exactly three requirements that every valid JSONL file must satisfy:
UTF-8 Encoding
All JSONL files must be encoded in UTF-8. No byte order mark (BOM) is allowed — this is consistent with the JSON standard (RFC 8259, section 8.1). Encodings other than UTF-8 are very unlikely to be valid when decoded as UTF-8, so accidental misinterpretation of characters is rare.
Each Line is a Valid JSON Value
The most common values are objects {} and arrays [], but any valid JSON value is permitted — including strings, numbers, booleans, and even null. A blank line is not a valid JSON value and must not appear in a JSONL file.
Line Terminator is \n
Lines are separated by the newline character \n (U+000A). The Windows-style \r\n is also acceptable because surrounding whitespace is implicitly ignored when parsing JSON values. Including a line terminator after the last line is strongly recommended but not required.
JSONL vs JSON: Key Differences
The most common question developers ask is: when should I use JSONL instead of JSON? The answer depends on whether you're working with a single document or a collection of records.
| Property | JSON (.json) | JSONL (.jsonl) |
|---|---|---|
| Structure | Single document | One record per line |
| Streaming | ✗ Must parse entire file | ✓ Line-by-line processing |
| Appending | ✗ Requires full rewrite | ✓ Just append a new line |
| Memory usage | Loads entire file | One line at a time |
| Best for | Config files, API responses, single documents | Training datasets, log files, event streams |
| Invalid line | Breaks entire file | Only breaks that line |
| LLM fine-tuning | ⚠ Not standard | ✓ Universal standard |
Developers often try to wrap JSONL records in a JSON array [{...}, {...}]. This produces a valid .json file — not a valid .jsonl file. The JSONL format has no outer wrapper. Every line is independent.
JSONL in AI & LLM Fine-Tuning
JSONL has become the universal standard for LLM training datasets. The reason is straightforward: training datasets can contain millions of examples and be many gigabytes in size. JSONL allows training frameworks to stream examples one at a time without loading the entire dataset into GPU memory.
Every major LLM provider and open-source training framework uses JSONL as the required input format for fine-tuning.
OpenAI Fine-Tuning Format
OpenAI's fine-tuning API requires training data in JSONL format where each line contains a messages array — the same format used by the Chat Completions API. Each message has a role (system, user, or assistant) and content.
// ── REQUIRED FORMAT for GPT-4o fine-tuning ─────────────────────────────── // Each line = one complete conversation example // Minimum: 10 examples | Recommended: 50–100+ examples {"messages": [ {"role": "system", "content": "You are a helpful customer support agent for Acme Corp."}, {"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "To reset your password, go to Settings → Security → Reset Password and enter your registered email."} ]} {"messages": [ {"role": "system", "content": "You are a helpful customer support agent for Acme Corp."}, {"role": "user", "content": "What is your refund policy?"}, {"role": "assistant", "content": "We offer full refunds within 30 days of purchase, no questions asked."} ]} {"messages": [ {"role": "user", "content": "Can I change my shipping address after ordering?"}, {"role": "assistant", "content": "Yes, if the order hasn't shipped yet. Contact us within 2 hours of placing the order."} ]}
The system role is optional per conversation. Each conversation must end with an assistant message. OpenAI recommends at least 50 training examples for meaningful fine-tuning results, with 100–500 being typical for good performance.
Google Vertex AI, Mistral, and Llama Formats
While all platforms use JSONL, the internal schema of each line varies by provider. Here's what each major platform expects:
Python Code: Read, Write & Validate
Python's built-in json module handles JSONL perfectly. There's no need for a third-party library for basic use — just iterate over lines and call json.loads() on each one.
Reading JSONL
import json # ── Method 1: Simple iteration ──────────────────────────────── with open("training_data.jsonl", "r", encoding="utf-8") as f: for line in f: line = line.strip() if not line: # skip empty lines defensively continue record = json.loads(line) print(record) # ── Method 2: Load all into a list ──────────────────────────── def load_jsonl(path: str) -> list: with open(path, "r", encoding="utf-8") as f: return [json.loads(line) for line in f if line.strip()] records = load_jsonl("training_data.jsonl") print(f"Loaded {len(records)} training examples") # ── Method 3: Generator for large files (memory-efficient) ──── def stream_jsonl(path: str): with open(path, "r", encoding="utf-8") as f: for line in f: line = line.strip() if line: yield json.loads(line) for record in stream_jsonl("100gb_dataset.jsonl"): # won't OOM process(record)
Writing JSONL
import json # ── Write a list to JSONL ───────────────────────────────────── def save_jsonl(records: list, path: str) -> None: with open(path, "w", encoding="utf-8") as f: for record in records: f.write(json.dumps(record, ensure_ascii=False) + "\n") # ensure_ascii=False preserves Unicode (emoji, CJK, etc.) training_data = [ {"messages": [{"role": "user", "content": "What is JSONL?"}, {"role": "assistant", "content": "JSON Lines — one JSON per line."}]}, {"messages": [{"role": "user", "content": "Why use JSONL for training?"}, {"role": "assistant", "content": "It's streamable — process billions of rows without RAM issues."}]}, ] save_jsonl(training_data, "output.jsonl") # ── Append to existing JSONL (no rewrite!) ─────────────────── def append_jsonl(record: dict, path: str) -> None: with open(path, "a", encoding="utf-8") as f: # "a" = append mode f.write(json.dumps(record, ensure_ascii=False) + "\n") # This is one of JSONL's key advantages over JSON arrays
Validating JSONL
import json from dataclasses import dataclass from typing import Generator @dataclass class ValidationError: line_num: int message: str def validate_jsonl(path: str) -> list[ValidationError]: errors = [] with open(path, "r", encoding="utf-8") as f: for i, line in enumerate(f, 1): stripped = line.rstrip("\n\r") if not stripped: errors.append(ValidationError(i, "Empty line — not valid in JSONL")) continue try: json.loads(stripped) except json.JSONDecodeError as e: errors.append(ValidationError(i, str(e))) return errors # Usage errors = validate_jsonl("training_data.jsonl") if not errors: print("✓ All lines valid") else: for e in errors: print(f"✗ Line {e.line_num}: {e.message}")
Convert JSON Array → JSONL
import json def json_array_to_jsonl(input_path: str, output_path: str) -> int: """Convert a JSON array file to JSONL. Returns record count.""" with open(input_path, "r", encoding="utf-8") as f: data = json.load(f) if not isinstance(data, list): raise ValueError("Input JSON must be an array at the top level") with open(output_path, "w", encoding="utf-8") as f: for record in data: f.write(json.dumps(record, ensure_ascii=False) + "\n") return len(data) count = json_array_to_jsonl("data.json", "data.jsonl") print(f"Converted {count} records to JSONL")
Valid vs Invalid JSONL Examples
Understanding what makes JSONL invalid helps you avoid the most common errors. Here are side-by-side examples:
{"id": 1, "text": "Hello"}
{"id": 2, "text": "World"}
{"id": 3, "text": "Foo"}[
{"id": 1, "text": "Hello"},
{"id": 2, "text": "World"}
]{"name": "Alice", "score": 95}
null
[1, 2, 3]
"a plain string"
42{"name": "Alice"}
{"name": "Bob"}{"a": 1}
{"b": 2}
{"c": 3}
↵ (final newline){"a": 1},
{"b": 2},
{"c": 3}Common Mistakes to Avoid
- Wrapping in a JSON array — JSONL has no outer
[...]wrapper. If you need a JSON array, use.jsonnot.jsonl. - Including blank lines — Per the spec, blank lines are not valid JSON values. Many parsers will throw an error or silently skip them. Strip them before uploading to a training API.
- Using commas between records — Each line is a separate JSON value. Adding a trailing comma (
,) after each line makes it invalid JSON on that line. - Wrong encoding — JSONL must be UTF-8. If you export from Excel or a legacy system, check the encoding with
file -i yourfile.jsonlon Linux/macOS before uploading. - Multi-line JSON objects — Each JSON object must be on a single line. Pretty-printed JSON (with newlines and indentation) inside the record will break JSONL parsers.
- Missing the final newline — Technically optional, but strongly recommended. Some tools misbehave when the last line lacks a trailing
\n. - Using
json.dump()instead ofjson.dumps()— When writing in Python, usejson.dumps(record) + "\n", notjson.dump(record, f)(which doesn't add the newline).
The #1 mistake when uploading training data to OpenAI is sending a JSON array instead of JSONL. The API will reject the file with "Invalid file format." Verify each line is an independent JSON object before uploading.
Best Practices
- Always validate before uploading — Run a validator (like the one on this site, or the Python code above) on your JSONL file before sending it to a fine-tuning API. A single bad line can reject the entire file.
- Use
ensure_ascii=Falsein Python — This preserves Unicode characters (emoji, CJK, Arabic, etc.) in their native form instead of escaping them as\uXXXX, which wastes tokens and makes data harder to inspect. - Compress large files — For datasets over 100 MB, use
gzipcompression (.jsonl.gz). Most training frameworks support compressed JSONL natively, and it reduces file size by 5–10× for text data. - Shuffle before fine-tuning — Random shuffle your JSONL records before training. Ordered data (all customer service questions first, then all coding questions) can cause catastrophic forgetting.
- Split train/validation — Keep a 90/10 or 95/5 train/validation split in separate JSONL files. OpenAI's fine-tuning API accepts a separate validation file to report eval metrics.
- Stream for large datasets — Use a generator function (see the Python examples above) rather than loading everything into memory. A 10 GB JSONL file can be processed with constant ~1 MB RAM usage.
- Keep records atomic — Each JSONL line should be a self-contained, meaningful unit of training data. Avoid records that only make sense in the context of adjacent records.
Frequently Asked Questions
Is JSONL the same as ndjson?
Yes, essentially. ndjson (newline-delimited JSON) and JSONL (JSON Lines) describe the same format. The term "JSONL" and the .jsonl extension are more common in the AI/ML community; "ndjson" is more common in the data engineering community. Both follow the same rules.
Can a JSONL file contain arrays or just objects?
Per the official specification, any valid JSON value is permitted on each line — objects, arrays, strings, numbers, booleans, and null are all allowed. However, in practice and especially for LLM fine-tuning, every line should be a JSON object ({}). Arrays, strings, and primitives as top-level values are unusual and may not be accepted by training APIs.
How do I handle JSONL with Python without a library?
Python's built-in json module is all you need. Open the file, iterate over lines, strip whitespace, and call json.loads() on each line. See the Python code section above for complete examples.
What's the maximum file size for OpenAI fine-tuning?
OpenAI's fine-tuning API accepts JSONL files up to 1 GB. For very large datasets, you can provide multiple files or use dataset mixing. The minimum is 10 training examples; the recommended minimum for useful results is 50–100 examples.
Can I include comments in JSONL?
No. JSON does not support comments, and neither does JSONL. If you see // comment style lines in examples online, those are documentation annotations — they would make the JSONL file invalid. Use a separate metadata file or README to document your dataset.
Is JSONL an official standard?
JSONL is a de-facto standard maintained at jsonlines.org. It is not yet an IETF RFC or ECMA standard, though the maintainers are working toward a formal RFC for the MIME type registration. For practical purposes, jsonlines.org is the authoritative specification.