Multi-Format Annotation: An In-Depth Guide to YOLO/VOC/COCO Formats

Introduction: Why Formats Matter

The annotation format is the "data interface protocol" of any AI project. With the same set of images and bounding boxes, a format mismatch typically doesn't just "slightly degrade training performance" — it causes outright failures:

Training scripts can't read the data: Path structure, field names, or category mappings don't match the framework's conventions.
Coordinate semantics are misinterpreted: (xywh) vs (xyxy), normalized vs pixel, top-left vs center — at best the boxes shift; at worst everything is wrong.
Category IDs are completely misaligned: For example, treating category_id=1 as class 1 when your training config starts from 0.
Maintenance becomes painful: When handing off across teams or tools, missing metadata (category tables, image dimensions, segmentation info) leads to repeated rework.

This article focuses on object detection (bounding box) as the main thread, and dives deep into three mainstream formats: YOLO, VOC (Pascal VOC), and COCO. You will learn:

What each format actually expresses (coordinate systems, file organization, metadata)
The conventions training frameworks care about most (category mapping, dataset splits, empty annotations)
Key details for format conversion (formulas, boundary clamping, mapping tables)

YOLO Format

Format Characteristics

The most common "YOLO annotation format" used by the YOLO series (especially YOLOv5/v8/v9/v10) is essentially: one .txt file per image, one object box per line. It prioritizes simplicity and high throughput, so metadata is typically placed in a separate configuration file (e.g., data.yaml).

File Structure:

dataset/
├── images/
│   ├── image001.jpg
│   └── image002.jpg
└── labels/
    ├── image001.txt
    └── image002.txt

A more training-friendly organization usually includes train/val/test (or train/valid/test):

dataset/
├── images/
│   ├── train/
│   ├── val/
│   └── test/
├── labels/
│   ├── train/
│   ├── val/
│   └── test/
└── data.yaml

Annotation File Format (image001.txt):

class_id center_x center_y width height
0 0.5 0.5 0.3 0.4
1 0.2 0.3 0.1 0.2

Format Description:

class_id: Category ID (typically starting from 0, strictly corresponding one-to-one with the category table)
center_x, center_y: Bounding box center coordinates, normalized to ([0,1]) relative to image width and height
width, height: Bounding box width and height, normalized to ([0,1]) relative to image width and height

Coordinate System (The Most Common Pitfall)

YOLO's xywh uses "center point + width/height" and is typically normalized:

Converting from YOLO (normalized) to pixel (xyxy):
- (x_{center}=center_x \times W), (y_{center}=center_y \times H)
- (w=width \times W), (h=height \times H)
- (x_{min}=x_{center}-\frac{w}{2}), (y_{min}=y_{center}-\frac{h}{2})
- (x_{max}=x_{center}+\frac{w}{2}), (y_{max}=y_{center}+\frac{h}{2})
Converting from pixel (xyxy) to YOLO (normalized):
- (x_{center}=\frac{x_{min}+x_{max}}{2W}), (y_{center}=\frac{y_{min}+y_{max}}{2H})
- (width=\frac{x_{max}-x_{min}}{W}), (height=\frac{y_{max}-y_{min}}{H})

Two things to do when converting/exporting:

Clamp: Ensure results fall within the valid range of ([0,1]) or ([0,W/H]) to avoid errors or NaN during training.
Preserve sufficient decimal places: Generally 6 decimal places is enough; excessive rounding can cause small objects to "jitter."

Key Features:

Uses normalized coordinates (0-1)
Simple format, small file size
Ideal for YOLO series models

Category Table and `data.yaml`

YOLO training typically also requires a data configuration file (using the Ultralytics/YOLOv5 ecosystem as an example):

# data.yaml
path: /abs/path/to/dataset
train: images/train
val: images/val
test: images/test

names:
  0: car
  1: person

The names order/ID must exactly match your class_id. If you change the category order between different datasets or exports, the model will "learn the wrong labels" without throwing an error — this is one of the most insidious and costly pitfalls.

Empty Annotations and Missing Files

Images with no annotations: It's recommended to keep the corresponding empty .txt file (file exists but is empty), or handle it per the training framework's requirements; don't "just leave the file missing," as the data loading logic may treat it as an exception.
Image and label share the same name: image001.jpg corresponds to image001.txt. Changing the extension (jpg/png) usually doesn't matter, but the filename (without extension) must match.

Use Cases

Recommended for:

YOLOv5/v8/v9/v10 training
Object detection projects
Projects requiring fast training

Not recommended for:

Projects requiring pixel-level precision
Projects requiring detailed metadata

Additional note: The YOLO format doesn't "not support" more complex tasks. For example, YOLOv8 appends polygon points/keypoint fields to the label line for segmentation/keypoint tasks. However, once you enter these extended formats, convention differences across tools and training code become significantly larger — it's recommended to use a unified toolchain for export and training.

VOC Format

Format Characteristics

VOC (Pascal VOC) uses XML to describe annotation information for a single image. It offers strong readability and explicit fields, making it suitable for "manual inspection" and traditional detection frameworks (such as early Faster R-CNN toolchains).

File Structure:

dataset/
├── images/
│   ├── image001.jpg
│   └── image002.jpg
└── annotations/
    ├── image001.xml
    └── image002.xml

The classic VOC dataset directory is typically more "standardized":

VOCdevkit/
└── VOC2007/
    ├── JPEGImages/
    ├── Annotations/
    └── ImageSets/
        └── Main/
            ├── train.txt
            ├── val.txt
            └── test.txt

Here, train.txt/val.txt/test.txt contain image IDs (without extensions), used to explicitly define dataset splits.

Annotation File Format (image001.xml):

<annotation>
    <filename>image001.jpg</filename>
    <size>
        <width>640</width>
        <height>480</height>
        <depth>3</depth>
    </size>
    <object>
        <name>car</name>
        <bndbox>
            <xmin>100</xmin>
            <ymin>50</ymin>
            <xmax>300</xmax>
            <ymax>200</ymax>
        </bndbox>
    </object>
</annotation>

Format Description:

Absolute pixel coordinates: Typically with the image's top-left corner as the origin ((0,0)), x-axis pointing right, y-axis pointing down.
Bounding box semantics are (xyxy): xmin,ymin,xmax,ymax (top-left and bottom-right corners).
Includes image dimensions: size/width,height,depth for validating coordinate correctness and visualization.
Extensible metadata: Fields like pose, truncated, difficult are still read by many VOC toolchains.

Key Field Reference (Commonly Used but Often Overlooked)

A more complete object typically looks like this:

<object>
  <name>car</name>
  <pose>Unspecified</pose>
  <truncated>0</truncated>
  <difficult>0</difficult>
  <bndbox>
    <xmin>100</xmin>
    <ymin>50</ymin>
    <xmax>300</xmax>
    <ymax>200</ymax>
  </bndbox>
</object>

truncated: Whether the object is truncated (e.g., cut off by the image boundary).
difficult: Hard sample flag. Some evaluation/training pipelines choose to ignore objects with difficult=1.

The "Closed Interval" Controversy of VOC Coordinates

Different tools interpret whether xmax/ymax "includes the pixel" differently (a historical legacy). The safest practice is:

Ensure (xmax > xmin) and (ymax > ymin) during conversion
Avoid generating out-of-bounds coordinates (less than 0 or greater than width/height)
Confirm in visual spot-checks whether boxes are offset by 1 pixel (if so, fix uniformly at the export stage)

Key Features:

Uses absolute coordinates
Includes detailed metadata
Ideal for Faster R-CNN and similar models

Use Cases

Recommended for:

Faster R-CNN training
Projects requiring detailed metadata
Projects requiring good readability

Not recommended for:

YOLO training (requires conversion)
Projects requiring fast training

Additional note: If your project requires frequent cross-tool collaboration (annotation -> QA -> training -> feedback), VOC's "human-readable XML" advantage is very significant. However, when data volume is large, the read/write performance and management cost of massive XML files will gradually become apparent.

COCO Format

Format Characteristics

COCO is a "data-engineering-friendly" annotation format: it uses one (or a few) JSON files to describe all images, categories, and annotations. It's naturally suited for statistics, querying, filtering, merging, and version management. It supports not only detection boxes but also natively supports instance segmentation and keypoints.

File Structure:

dataset/
├── images/
│   ├── image001.jpg
│   └── image002.jpg
└── annotations/
    └── instances_train.json

Annotation File Format (instances_train.json):

{
    "images": [
        {
            "id": 1,
            "file_name": "image001.jpg",
            "width": 640,
            "height": 480
        }
    ],
    "annotations": [
        {
            "id": 1,
            "image_id": 1,
            "category_id": 1,
            "bbox": [100, 50, 200, 150],
            "area": 30000,
            "iscrowd": 0
        }
    ],
    "categories": [
        {
            "id": 1,
            "name": "car"
        }
    ]
}

Format Description:

bbox semantics are (xywh): bbox=[x, y, w, h], where x,y is the top-left pixel coordinate, not the center point.
image_id: The image this annotation belongs to (linked to images[].id).
category_id: The category this annotation belongs to (linked to categories[].id). Note: COCO's category_id often doesn't start from 0 and doesn't need to be consecutive.
area: Object area, commonly w*h (detection box) or polygon area (segmentation).
iscrowd: Whether it's a "crowd/dense object," related to segmentation's RLE/polygon representation.

segmentation / keypoints (Why COCO Is More "Versatile")

Even if you're currently only doing detection, it's worth understanding COCO's two power fields, since many public datasets and training frameworks rely on them:

segmentation: Instance segmentation masks
- When iscrowd=0, commonly uses polygon arrays: [[x1,y1,x2,y2,...], ...]
- When iscrowd=1, commonly uses RLE (Run-Length Encoding) compressed masks
keypoints: Keypoint tasks (e.g., human pose), typically an array of length (3K) (x,y,v), where v indicates visibility

COCO's "Global Consistency" Constraints (Critical for Engineering)

COCO JSON typically needs to satisfy:

images[].id must be unique, annotations[].id must be unique
annotations[].image_id must be findable in images
annotations[].category_id must be findable in categories

When you "merge multiple datasets" or "incrementally append annotations," ID conflicts are the most common issue. Best practice: rebase id values during merging and establish a stable category mapping table.

Key Features:

Uses absolute coordinates
Structured data
Suitable for multiple model types

Use Cases

Recommended for:

COCO dataset training
Projects requiring structured data
Projects requiring rich metadata

Not recommended for:

YOLO training (requires conversion)
Projects requiring fast training

Additional note: COCO isn't "slow" — what's slow is parsing a massive JSON file in full at every epoch. In practice, engineering teams typically cache, index, or convert to faster internal formats, but COCO remains the go-to for "external delivery/standard alignment."

Format Comparison

Feature	YOLO	VOC	COCO
Coordinate System	Normalized (0-1)	Absolute Pixel	Absolute Pixel
File Format	TXT	XML	JSON
File Size	Small	Medium	Large
Readability	Low	High	Medium
Metadata	Minimal	Moderate	Rich
Target Models	YOLO Series	Faster R-CNN	Multiple Models
Conversion Difficulty	Easy	Medium	Hard

Additional dimensions (closer to real project selection):

Dimension	YOLO	VOC	COCO
Natively Supported Tasks	Detection (extensible to segmentation/keypoints)	Primarily detection	Detection + Segmentation + Keypoints
Category System	Relies on external `names` mapping	`<name>` writes class name directly	`category_id` requires lookup
Dataset Splitting	Relies on directories or list files	Traditionally uses `ImageSets/Main/*.txt`	Commonly multiple JSONs (train/val)
Merging/Filtering/Statistics	Requires traversing many small files	Requires traversing many XMLs	Most data-engineering-friendly
Manual Inspection	Average	Very convenient	Requires tools or scripts

Format Selection Guide

You can use this "decision framework" for quick selection:

Training YOLO ecosystem (Ultralytics/YOLOv5 family) for detection only -> Prefer YOLO
Doing instance segmentation/keypoints, or aligning data with mainstream public datasets -> Prefer COCO
Relying on human readability, need to attach business fields in XML, or using traditional detection toolchains -> VOC still works well

Choosing YOLO Format

Suitable scenarios:

Using YOLO series models
Need fast training
File size sensitive

Advantages:

Simple format
Small files
Fast training

Considerations:

Ensure names and class_id always match (strongly recommend version control)
Update labels in sync after data augmentation/cropping (especially mosaic, random crop)

Choosing VOC Format

Suitable scenarios:

Using Faster R-CNN
Need detailed metadata
Need readability

Advantages:

Good readability
Rich metadata
Good compatibility

Considerations:

Maintain xmin<xmax, ymin<ymax, and ensure no out-of-bounds values
Confirm xmax/ymax boundary semantics across tools to avoid 1-pixel offsets

Choosing COCO Format

Suitable scenarios:

Using COCO dataset
Need structured data
Need rich metadata

Advantages:

Structured data
Rich metadata
Compatible with multiple models

Considerations:

Category id may not be consecutive; training typically requires building a category_id -> 0..N-1 mapping
Merging datasets requires handling images.id/annotations.id conflicts

Format Conversion

Three "Must-Dos" Before Converting

No matter which direction you're converting between YOLO/VOC/COCO, lock down these three points first for stability:

Category Mapping Table:
- VOC uses class names, COCO uses category_id, YOLO uses class_id.
- Recommended: maintain a unified class_name -> yolo_id mapping, then derive VOC/COCO mappings from it.
Coordinate Convention:
- YOLO: Normalized center-point (xywh)
- VOC: Pixel (xyxy)
- COCO: Pixel (xywh) (top-left corner + width/height)
Boundary and Empty Annotation Strategy:
- Uniformly clamp out-of-bounds boxes, filter boxes with too-small area (e.g., (w<1) or (h<1))
- Whether to keep empty annotation samples (recommended: yes)

YOLO -> VOC

Conversion Steps:

Read the YOLO format file
Get image dimensions
Calculate absolute coordinates
Generate XML file

Code Example:

def yolo_to_voc(yolo_file, img_width, img_height):
    # Read YOLO format
    with open(yolo_file, 'r') as f:
        lines = f.readlines()

    # Convert to VOC format
    annotations = []
    for line in lines:
        class_id, cx, cy, w, h = map(float, line.split())
        xmin = int((cx - w/2) * img_width)
        ymin = int((cy - h/2) * img_height)
        xmax = int((cx + w/2) * img_width)
        ymax = int((cy + h/2) * img_height)
        annotations.append((class_id, xmin, ymin, xmax, ymax))

    return annotations

VOC -> YOLO

Conversion Steps:

Read the XML file
Get image dimensions
Calculate normalized coordinates
Generate TXT file

Code Example (minimal working version demonstrating core formulas; you'll need to add your own class_name -> class_id mapping):

import xml.etree.ElementTree as ET

def voc_to_yolo(xml_file, class_name_to_id):
    tree = ET.parse(xml_file)
    root = tree.getroot()

    size = root.find("size")
    img_w = float(size.find("width").text)
    img_h = float(size.find("height").text)

    yolo_lines = []
    for obj in root.findall("object"):
        name = obj.find("name").text.strip()
        if name not in class_name_to_id:
            continue

        bnd = obj.find("bndbox")
        xmin = float(bnd.find("xmin").text)
        ymin = float(bnd.find("ymin").text)
        xmax = float(bnd.find("xmax").text)
        ymax = float(bnd.find("ymax").text)

        # clamp & sanity
        xmin = max(0.0, min(xmin, img_w))
        xmax = max(0.0, min(xmax, img_w))
        ymin = max(0.0, min(ymin, img_h))
        ymax = max(0.0, min(ymax, img_h))
        if xmax <= xmin or ymax <= ymin:
            continue

        cx = (xmin + xmax) / 2.0 / img_w
        cy = (ymin + ymax) / 2.0 / img_h
        w = (xmax - xmin) / img_w
        h = (ymax - ymin) / img_h

        class_id = class_name_to_id[name]
        yolo_lines.append(f"{class_id} {cx:.6f} {cy:.6f} {w:.6f} {h:.6f}")

    return yolo_lines

COCO -> YOLO

Conversion Steps:

Read the JSON file
Get image dimensions
Calculate normalized coordinates
Generate TXT files

Code Example (demonstrating COCO bbox (xywh) -> YOLO normalized center-point (xywh)):

import json
from collections import defaultdict

def coco_to_yolo(coco_json_path, category_id_to_yolo_id):
    data = json.load(open(coco_json_path, "r", encoding="utf-8"))

    images = {img["id"]: img for img in data.get("images", [])}
    ann_by_image = defaultdict(list)
    for ann in data.get("annotations", []):
        ann_by_image[ann["image_id"]].append(ann)

    # Returns: file_name -> yolo_lines
    result = {}
    for image_id, img in images.items():
        W, H = float(img["width"]), float(img["height"])
        file_name = img["file_name"]
        lines = []

        for ann in ann_by_image.get(image_id, []):
            cid = ann["category_id"]
            if cid not in category_id_to_yolo_id:
                continue
            x, y, w, h = map(float, ann["bbox"])  # COCO: top-left xywh
            if w <= 0 or h <= 0:
                continue

            cx = (x + w / 2.0) / W
            cy = (y + h / 2.0) / H
            nw = w / W
            nh = h / H

            # clamp
            cx = max(0.0, min(1.0, cx))
            cy = max(0.0, min(1.0, cy))
            nw = max(0.0, min(1.0, nw))
            nh = max(0.0, min(1.0, nh))
            if nw == 0 or nh == 0:
                continue

            yolo_id = category_id_to_yolo_id[cid]
            lines.append(f"{yolo_id} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}")

        result[file_name] = lines

    return result

Recommended: Post-Conversion Validation

After format conversion, always run a lightweight validation process (it can save a huge amount of training time):

Random visual spot-check: Sample at least 50 images to verify boxes aren't systematically shifted or abnormally sized.
Statistical check: Verify that per-class object counts, per-image object counts, and box area distributions match expectations.
Empty annotation ratio: A sudden spike in empty annotations usually means something went wrong with matching/mapping (e.g., filename mismatch or broken ID links).

Using TjMakeBot for Multi-Format Annotation

TjMakeBot's Advantages:

Multi-Format Support
- YOLO format
- VOC format
- COCO format
- CSV format
Format Conversion
- Cross-format conversion
- One-click export to multiple formats
- Compatible with mainstream training frameworks
What you really need to focus on: Whether the category mapping and coordinate conventions are "locked down" during export. A good tool should be able to:
- Solidify the category table (carry/generate names, categories, etc. during export)
- Unify coordinate conventions (clearly YOLO normalized center-point, COCO top-left (xywh))
- Handle out-of-bounds/abnormal boxes in a controlled way (clamp, filter, or alert)
Batch Processing
- Batch export
- Batch conversion
- Improved efficiency
Free (Basic Features)
- No usage limits
- No feature restrictions
- Low barrier to entry

Start Using TjMakeBot for Multi-Format Annotation for Free ->

Conclusion

Choosing the right annotation format is fundamental to AI project success. YOLO, VOC, and COCO each have their strengths — choose the format that best fits your project requirements.

Remember:

YOLO format is ideal for YOLO series models
VOC format is ideal for Faster R-CNN
COCO format is suitable for multiple models
Format conversion is always feasible

Additional advice (from real project lessons learned):

Manage your "category table" like code: versioned, traceable, and rollback-capable.
Don't wait until training to discover problems: make visual spot-checks and statistical checks a standard post-export step.
Unify coordinate conventions across your team: document clearly in your project README or data specification ((xywh)/(xyxy), normalized or not, boundary-inclusive or not).

Choose TjMakeBot for multi-format annotation and conversion!

Legal Disclaimer: The content of this article is for reference only and does not constitute any legal, commercial, or technical advice. When using any tools or methods, please comply with applicable laws and regulations, respect intellectual property rights, and obtain necessary authorizations. All company names, product names, and trademarks mentioned in this article are the property of their respective owners.

About the Author: The TjMakeBot team focuses on AI data annotation tool development, committed to supporting multiple annotation formats to meet diverse project needs.