Introduction: Why Formats Matter
The annotation format is the "data interface protocol" of any AI project. With the same set of images and bounding boxes, a format mismatch typically doesn't just "slightly degrade training performance" — it causes outright failures:
- Training scripts can't read the data: Path structure, field names, or category mappings don't match the framework's conventions.
- Coordinate semantics are misinterpreted: (xywh) vs (xyxy), normalized vs pixel, top-left vs center — at best the boxes shift; at worst everything is wrong.
- Category IDs are completely misaligned: For example, treating
category_id=1as class 1 when your training config starts from 0. - Maintenance becomes painful: When handing off across teams or tools, missing metadata (category tables, image dimensions, segmentation info) leads to repeated rework.
This article focuses on object detection (bounding box) as the main thread, and dives deep into three mainstream formats: YOLO, VOC (Pascal VOC), and COCO. You will learn:
- What each format actually expresses (coordinate systems, file organization, metadata)
- The conventions training frameworks care about most (category mapping, dataset splits, empty annotations)
- Key details for format conversion (formulas, boundary clamping, mapping tables)
YOLO Format
Format Characteristics
The most common "YOLO annotation format" used by the YOLO series (especially YOLOv5/v8/v9/v10) is essentially: one .txt file per image, one object box per line. It prioritizes simplicity and high throughput, so metadata is typically placed in a separate configuration file (e.g., data.yaml).
File Structure:
dataset/
├── images/
│ ├── image001.jpg
│ └── image002.jpg
└── labels/
├── image001.txt
└── image002.txt
A more training-friendly organization usually includes train/val/test (or train/valid/test):
dataset/
├── images/
│ ├── train/
│ ├── val/
│ └── test/
├── labels/
│ ├── train/
│ ├── val/
│ └── test/
└── data.yaml
Annotation File Format (image001.txt):
class_id center_x center_y width height
0 0.5 0.5 0.3 0.4
1 0.2 0.3 0.1 0.2
Format Description:
class_id: Category ID (typically starting from 0, strictly corresponding one-to-one with the category table)center_x, center_y: Bounding box center coordinates, normalized to ([0,1]) relative to image width and heightwidth, height: Bounding box width and height, normalized to ([0,1]) relative to image width and height
Coordinate System (The Most Common Pitfall)
YOLO's xywh uses "center point + width/height" and is typically normalized:
-
Converting from YOLO (normalized) to pixel (xyxy):
- (x_{center}=center_x \times W), (y_{center}=center_y \times H)
- (w=width \times W), (h=height \times H)
- (x_{min}=x_{center}-\frac{w}{2}), (y_{min}=y_{center}-\frac{h}{2})
- (x_{max}=x_{center}+\frac{w}{2}), (y_{max}=y_{center}+\frac{h}{2})
-
Converting from pixel (xyxy) to YOLO (normalized):
- (x_{center}=\frac{x_{min}+x_{max}}{2W}), (y_{center}=\frac{y_{min}+y_{max}}{2H})
- (width=\frac{x_{max}-x_{min}}{W}), (height=\frac{y_{max}-y_{min}}{H})
Two things to do when converting/exporting:
- Clamp: Ensure results fall within the valid range of ([0,1]) or ([0,W/H]) to avoid errors or NaN during training.
- Preserve sufficient decimal places: Generally 6 decimal places is enough; excessive rounding can cause small objects to "jitter."
Key Features:
- Uses normalized coordinates (0-1)
- Simple format, small file size
- Ideal for YOLO series models
Category Table and data.yaml
YOLO training typically also requires a data configuration file (using the Ultralytics/YOLOv5 ecosystem as an example):
# data.yaml
path: /abs/path/to/dataset
train: images/train
val: images/val
test: images/test
names:
0: car
1: person
The names order/ID must exactly match your class_id. If you change the category order between different datasets or exports, the model will "learn the wrong labels" without throwing an error — this is one of the most insidious and costly pitfalls.
Empty Annotations and Missing Files
- Images with no annotations: It's recommended to keep the corresponding empty
.txtfile (file exists but is empty), or handle it per the training framework's requirements; don't "just leave the file missing," as the data loading logic may treat it as an exception. - Image and label share the same name:
image001.jpgcorresponds toimage001.txt. Changing the extension (jpg/png) usually doesn't matter, but the filename (without extension) must match.
Use Cases
Recommended for:
- YOLOv5/v8/v9/v10 training
- Object detection projects
- Projects requiring fast training
Not recommended for:
- Projects requiring pixel-level precision
- Projects requiring detailed metadata
Additional note: The YOLO format doesn't "not support" more complex tasks. For example, YOLOv8 appends polygon points/keypoint fields to the label line for segmentation/keypoint tasks. However, once you enter these extended formats, convention differences across tools and training code become significantly larger — it's recommended to use a unified toolchain for export and training.
VOC Format
Format Characteristics
VOC (Pascal VOC) uses XML to describe annotation information for a single image. It offers strong readability and explicit fields, making it suitable for "manual inspection" and traditional detection frameworks (such as early Faster R-CNN toolchains).
File Structure:
dataset/
├── images/
│ ├── image001.jpg
│ └── image002.jpg
└── annotations/
├── image001.xml
└── image002.xml
The classic VOC dataset directory is typically more "standardized":
VOCdevkit/
└── VOC2007/
├── JPEGImages/
├── Annotations/
└── ImageSets/
└── Main/
├── train.txt
├── val.txt
└── test.txt
Here, train.txt/val.txt/test.txt contain image IDs (without extensions), used to explicitly define dataset splits.
Annotation File Format (image001.xml):
<annotation>
<filename>image001.jpg</filename>
<size>
<width>640</width>
<height>480</height>
<depth>3</depth>
</size>
<object>
<name>car</name>
<bndbox>
<xmin>100</xmin>
<ymin>50</ymin>
<xmax>300</xmax>
<ymax>200</ymax>
</bndbox>
</object>
</annotation>
Format Description:
- Absolute pixel coordinates: Typically with the image's top-left corner as the origin ((0,0)), x-axis pointing right, y-axis pointing down.
- Bounding box semantics are (xyxy):
xmin,ymin,xmax,ymax(top-left and bottom-right corners). - Includes image dimensions:
size/width,height,depthfor validating coordinate correctness and visualization. - Extensible metadata: Fields like
pose, truncated, difficultare still read by many VOC toolchains.
Key Field Reference (Commonly Used but Often Overlooked)
A more complete object typically looks like this:
<object>
<name>car</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>100</xmin>
<ymin>50</ymin>
<xmax>300</xmax>
<ymax>200</ymax>
</bndbox>
</object>
truncated: Whether the object is truncated (e.g., cut off by the image boundary).difficult: Hard sample flag. Some evaluation/training pipelines choose to ignore objects withdifficult=1.
The "Closed Interval" Controversy of VOC Coordinates
Different tools interpret whether xmax/ymax "includes the pixel" differently (a historical legacy). The safest practice is:
- Ensure (xmax > xmin) and (ymax > ymin) during conversion
- Avoid generating out-of-bounds coordinates (less than 0 or greater than width/height)
- Confirm in visual spot-checks whether boxes are offset by 1 pixel (if so, fix uniformly at the export stage)
Key Features:
- Uses absolute coordinates
- Includes detailed metadata
- Ideal for Faster R-CNN and similar models
Use Cases
Recommended for:
- Faster R-CNN training
- Projects requiring detailed metadata
- Projects requiring good readability
Not recommended for:
- YOLO training (requires conversion)
- Projects requiring fast training
Additional note: If your project requires frequent cross-tool collaboration (annotation -> QA -> training -> feedback), VOC's "human-readable XML" advantage is very significant. However, when data volume is large, the read/write performance and management cost of massive XML files will gradually become apparent.
COCO Format
Format Characteristics
COCO is a "data-engineering-friendly" annotation format: it uses one (or a few) JSON files to describe all images, categories, and annotations. It's naturally suited for statistics, querying, filtering, merging, and version management. It supports not only detection boxes but also natively supports instance segmentation and keypoints.
File Structure:
dataset/
├── images/
│ ├── image001.jpg
│ └── image002.jpg
└── annotations/
└── instances_train.json
Annotation File Format (instances_train.json):
{
"images": [
{
"id": 1,
"file_name": "image001.jpg",
"width": 640,
"height": 480
}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [100, 50, 200, 150],
"area": 30000,
"iscrowd": 0
}
],
"categories": [
{
"id": 1,
"name": "car"
}
]
}
Format Description:
- bbox semantics are (xywh):
bbox=[x, y, w, h], wherex,yis the top-left pixel coordinate, not the center point. image_id: The image this annotation belongs to (linked toimages[].id).category_id: The category this annotation belongs to (linked tocategories[].id). Note: COCO's category_id often doesn't start from 0 and doesn't need to be consecutive.area: Object area, commonlyw*h(detection box) or polygon area (segmentation).iscrowd: Whether it's a "crowd/dense object," related to segmentation's RLE/polygon representation.
segmentation / keypoints (Why COCO Is More "Versatile")
Even if you're currently only doing detection, it's worth understanding COCO's two power fields, since many public datasets and training frameworks rely on them:
- segmentation: Instance segmentation masks
- When
iscrowd=0, commonly uses polygon arrays:[[x1,y1,x2,y2,...], ...] - When
iscrowd=1, commonly uses RLE (Run-Length Encoding) compressed masks
- When
- keypoints: Keypoint tasks (e.g., human pose), typically an array of length (3K) (
x,y,v), wherevindicates visibility
COCO's "Global Consistency" Constraints (Critical for Engineering)
COCO JSON typically needs to satisfy:
images[].idmust be unique,annotations[].idmust be uniqueannotations[].image_idmust be findable inimagesannotations[].category_idmust be findable incategories
When you "merge multiple datasets" or "incrementally append annotations," ID conflicts are the most common issue. Best practice: rebase id values during merging and establish a stable category mapping table.
Key Features:
- Uses absolute coordinates
- Structured data
- Suitable for multiple model types
Use Cases
Recommended for:
- COCO dataset training
- Projects requiring structured data
- Projects requiring rich metadata
Not recommended for:
- YOLO training (requires conversion)
- Projects requiring fast training
Additional note: COCO isn't "slow" — what's slow is parsing a massive JSON file in full at every epoch. In practice, engineering teams typically cache, index, or convert to faster internal formats, but COCO remains the go-to for "external delivery/standard alignment."
Format Comparison
| Feature | YOLO | VOC | COCO |
|---|---|---|---|
| Coordinate System | Normalized (0-1) | Absolute Pixel | Absolute Pixel |
| File Format | TXT | XML | JSON |
| File Size | Small | Medium | Large |
| Readability | Low | High | Medium |
| Metadata | Minimal | Moderate | Rich |
| Target Models | YOLO Series | Faster R-CNN | Multiple Models |
| Conversion Difficulty | Easy | Medium | Hard |
Additional dimensions (closer to real project selection):
| Dimension | YOLO | VOC | COCO |
|---|---|---|---|
| Natively Supported Tasks | Detection (extensible to segmentation/keypoints) | Primarily detection | Detection + Segmentation + Keypoints |
| Category System | Relies on external names mapping |
<name> writes class name directly |
category_id requires lookup |
| Dataset Splitting | Relies on directories or list files | Traditionally uses ImageSets/Main/*.txt |
Commonly multiple JSONs (train/val) |
| Merging/Filtering/Statistics | Requires traversing many small files | Requires traversing many XMLs | Most data-engineering-friendly |
| Manual Inspection | Average | Very convenient | Requires tools or scripts |
Format Selection Guide
You can use this "decision framework" for quick selection:
- Training YOLO ecosystem (Ultralytics/YOLOv5 family) for detection only -> Prefer YOLO
- Doing instance segmentation/keypoints, or aligning data with mainstream public datasets -> Prefer COCO
- Relying on human readability, need to attach business fields in XML, or using traditional detection toolchains -> VOC still works well
Choosing YOLO Format
Suitable scenarios:
- Using YOLO series models
- Need fast training
- File size sensitive
Advantages:
- Simple format
- Small files
- Fast training
Considerations:
- Ensure
namesandclass_idalways match (strongly recommend version control) - Update labels in sync after data augmentation/cropping (especially mosaic, random crop)
Choosing VOC Format
Suitable scenarios:
- Using Faster R-CNN
- Need detailed metadata
- Need readability
Advantages:
- Good readability
- Rich metadata
- Good compatibility
Considerations:
- Maintain
xmin<xmax,ymin<ymax, and ensure no out-of-bounds values - Confirm
xmax/ymaxboundary semantics across tools to avoid 1-pixel offsets
Choosing COCO Format
Suitable scenarios:
- Using COCO dataset
- Need structured data
- Need rich metadata
Advantages:
- Structured data
- Rich metadata
- Compatible with multiple models
Considerations:
- Category
idmay not be consecutive; training typically requires building acategory_id -> 0..N-1mapping - Merging datasets requires handling
images.id/annotations.idconflicts
Format Conversion
Three "Must-Dos" Before Converting
No matter which direction you're converting between YOLO/VOC/COCO, lock down these three points first for stability:
- Category Mapping Table:
- VOC uses class names, COCO uses
category_id, YOLO usesclass_id. - Recommended: maintain a unified
class_name -> yolo_idmapping, then derive VOC/COCO mappings from it.
- VOC uses class names, COCO uses
- Coordinate Convention:
- YOLO: Normalized center-point (xywh)
- VOC: Pixel (xyxy)
- COCO: Pixel (xywh) (top-left corner + width/height)
- Boundary and Empty Annotation Strategy:
- Uniformly clamp out-of-bounds boxes, filter boxes with too-small area (e.g., (w<1) or (h<1))
- Whether to keep empty annotation samples (recommended: yes)
YOLO -> VOC
Conversion Steps:
- Read the YOLO format file
- Get image dimensions
- Calculate absolute coordinates
- Generate XML file
Code Example:
def yolo_to_voc(yolo_file, img_width, img_height):
# Read YOLO format
with open(yolo_file, 'r') as f:
lines = f.readlines()
# Convert to VOC format
annotations = []
for line in lines:
class_id, cx, cy, w, h = map(float, line.split())
xmin = int((cx - w/2) * img_width)
ymin = int((cy - h/2) * img_height)
xmax = int((cx + w/2) * img_width)
ymax = int((cy + h/2) * img_height)
annotations.append((class_id, xmin, ymin, xmax, ymax))
return annotations
VOC -> YOLO
Conversion Steps:
- Read the XML file
- Get image dimensions
- Calculate normalized coordinates
- Generate TXT file
Code Example (minimal working version demonstrating core formulas; you'll need to add your own class_name -> class_id mapping):
import xml.etree.ElementTree as ET
def voc_to_yolo(xml_file, class_name_to_id):
tree = ET.parse(xml_file)
root = tree.getroot()
size = root.find("size")
img_w = float(size.find("width").text)
img_h = float(size.find("height").text)
yolo_lines = []
for obj in root.findall("object"):
name = obj.find("name").text.strip()
if name not in class_name_to_id:
continue
bnd = obj.find("bndbox")
xmin = float(bnd.find("xmin").text)
ymin = float(bnd.find("ymin").text)
xmax = float(bnd.find("xmax").text)
ymax = float(bnd.find("ymax").text)
# clamp & sanity
xmin = max(0.0, min(xmin, img_w))
xmax = max(0.0, min(xmax, img_w))
ymin = max(0.0, min(ymin, img_h))
ymax = max(0.0, min(ymax, img_h))
if xmax <= xmin or ymax <= ymin:
continue
cx = (xmin + xmax) / 2.0 / img_w
cy = (ymin + ymax) / 2.0 / img_h
w = (xmax - xmin) / img_w
h = (ymax - ymin) / img_h
class_id = class_name_to_id[name]
yolo_lines.append(f"{class_id} {cx:.6f} {cy:.6f} {w:.6f} {h:.6f}")
return yolo_lines
COCO -> YOLO
Conversion Steps:
- Read the JSON file
- Get image dimensions
- Calculate normalized coordinates
- Generate TXT files
Code Example (demonstrating COCO bbox (xywh) -> YOLO normalized center-point (xywh)):
import json
from collections import defaultdict
def coco_to_yolo(coco_json_path, category_id_to_yolo_id):
data = json.load(open(coco_json_path, "r", encoding="utf-8"))
images = {img["id"]: img for img in data.get("images", [])}
ann_by_image = defaultdict(list)
for ann in data.get("annotations", []):
ann_by_image[ann["image_id"]].append(ann)
# Returns: file_name -> yolo_lines
result = {}
for image_id, img in images.items():
W, H = float(img["width"]), float(img["height"])
file_name = img["file_name"]
lines = []
for ann in ann_by_image.get(image_id, []):
cid = ann["category_id"]
if cid not in category_id_to_yolo_id:
continue
x, y, w, h = map(float, ann["bbox"]) # COCO: top-left xywh
if w <= 0 or h <= 0:
continue
cx = (x + w / 2.0) / W
cy = (y + h / 2.0) / H
nw = w / W
nh = h / H
# clamp
cx = max(0.0, min(1.0, cx))
cy = max(0.0, min(1.0, cy))
nw = max(0.0, min(1.0, nw))
nh = max(0.0, min(1.0, nh))
if nw == 0 or nh == 0:
continue
yolo_id = category_id_to_yolo_id[cid]
lines.append(f"{yolo_id} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}")
result[file_name] = lines
return result
Recommended: Post-Conversion Validation
After format conversion, always run a lightweight validation process (it can save a huge amount of training time):
- Random visual spot-check: Sample at least 50 images to verify boxes aren't systematically shifted or abnormally sized.
- Statistical check: Verify that per-class object counts, per-image object counts, and box area distributions match expectations.
- Empty annotation ratio: A sudden spike in empty annotations usually means something went wrong with matching/mapping (e.g., filename mismatch or broken ID links).
Using TjMakeBot for Multi-Format Annotation
TjMakeBot's Advantages:
-
Multi-Format Support
- YOLO format
- VOC format
- COCO format
- CSV format
-
Format Conversion
- Cross-format conversion
- One-click export to multiple formats
- Compatible with mainstream training frameworks
What you really need to focus on: Whether the category mapping and coordinate conventions are "locked down" during export. A good tool should be able to:
- Solidify the category table (carry/generate
names,categories, etc. during export) - Unify coordinate conventions (clearly YOLO normalized center-point, COCO top-left (xywh))
- Handle out-of-bounds/abnormal boxes in a controlled way (clamp, filter, or alert)
-
Batch Processing
- Batch export
- Batch conversion
- Improved efficiency
-
Free (Basic Features)
- No usage limits
- No feature restrictions
- Low barrier to entry
Start Using TjMakeBot for Multi-Format Annotation for Free ->
Related Reading
- YOLO Dataset Complete Guide: From Zero to Model Training
- Why Do 90% of AI Projects Fail? Data Labeling Quality Is Key
- Free vs Paid Annotation Tools: How to Choose the Right One?
Conclusion
Choosing the right annotation format is fundamental to AI project success. YOLO, VOC, and COCO each have their strengths — choose the format that best fits your project requirements.
Remember:
- YOLO format is ideal for YOLO series models
- VOC format is ideal for Faster R-CNN
- COCO format is suitable for multiple models
- Format conversion is always feasible
Additional advice (from real project lessons learned):
- Manage your "category table" like code: versioned, traceable, and rollback-capable.
- Don't wait until training to discover problems: make visual spot-checks and statistical checks a standard post-export step.
- Unify coordinate conventions across your team: document clearly in your project README or data specification ((xywh)/(xyxy), normalized or not, boundary-inclusive or not).
Choose TjMakeBot for multi-format annotation and conversion!
Legal Disclaimer: The content of this article is for reference only and does not constitute any legal, commercial, or technical advice. When using any tools or methods, please comply with applicable laws and regulations, respect intellectual property rights, and obtain necessary authorizations. All company names, product names, and trademarks mentioned in this article are the property of their respective owners.
About the Author: The TjMakeBot team focuses on AI data annotation tool development, committed to supporting multiple annotation formats to meet diverse project needs.
Recommended Reading
- Agricultural AI: A Practical Guide to Crop Pest Detection Annotation
- Semantic Segmentation vs Instance Segmentation: In-Depth Analysis and Annotation Strategy Guide
- AI-Assisted vs Manual Annotation: An In-Depth Cost-Benefit Analysis
- China's Data Labeling Market: Application Characteristics and User Needs
- Retail E-Commerce AI: Practical Methods for Product Recognition Annotation
- Cognitive Bias in Data Labeling: How to Avoid Annotation Errors
- Edge Computing and Lightweight Models: Optimization Strategies for Annotation Data
- The Evolution of Data Annotation Tools
Keywords: YOLO format, VOC format, COCO format, annotation format, format conversion, multi-format annotation, TjMakeBot
