# 二、数据集整体介绍

COCO数据集是一个大型的、丰富的物体检测，分割和字幕数据集。这个数据集以scene understanding为目标，主要从复杂的日常场景中截取，图像中的目标通过精确的segmentation进行位置的标定。图像包括91类目标，328,000影像和2,500,000个label。目前为止有语义分割的最大数据集，提供的类别有80 类，有超过33 万张图片，其中20 万张有标注，整个数据集中个体的数目超过150 万个。

MS COCO数据集包含很多的分支，截至2019年6月26日的情况如下：

2014 Train/Val： Detection 2015, Captioning 2015, Detection 2016, Keypoints 2016
2014 Testing： Captioning 2015

2015 Testing： Detection 2015, Detection 2016, Keypoints 2016
2017 Train/Val/Test： Detection 2017, Keypoints 2017, Stuff 2017, Detection 2018, Keypoints 2018, Stuff 2018, Panoptic 2018
2017 Unlabeled： [optional data for any competition]

## 数据集格式

COCO有5种类型的标注，分别是：物体检测、关键点检测、实例分割、全景分割、图片标注，都是对应一个json文件。json是一个大字典，都包含如下的关键字：

{
"info" : info,
"images" : [image],
"annotations" : [annotation],
}


info{
"year" : int,
"version" : str,
"description" : str,
"contributor" : str,
"url" : str,
"date_created" : datetime,
}


info{
"id" : int,
"width" : int,
"height" : int,
"file_name" : str,
"flickr_url" : str,
"coco_url" : str,
"date_captured" : datetime,
}


license{
"id" : int,
"name" : str,
"url" : str,
}


### 目标检测

Each object instance annotation contains a series of fields, including the category id and segmentation mask of the object. The segmentation format depends on whether the instance represents a single object (iscrowd=0 in which case polygons are used) or a collection of objects (iscrowd=1 in which case RLE is used). Note that a single object (iscrowd=0) may require multiple polygons, for example if occluded. Crowd annotations (iscrowd=1) are used to label large groups of objects (e.g. a crowd of people). In addition, an enclosing bounding box is provided for each object (box coordinates are measured from the top left image corner and are 0-indexed). Finally, the categories field of the annotation structure stores the mapping of category id to category and supercategory names. See also the detection task.

annotation{
"id" : int,
"image_id" : int,
"category_id" : int,
"segmentation" : RLE or [polygon],
"area" : float,
"bbox" : [x,y,width,height],
"iscrowd" : 0 or 1,
}

categories[{
"id" : int,
"name" : str,
"supercategory" : str,
}]


### 关键点检测

A keypoint annotation contains all the data of the object annotation
(including id, bbox, etc.) and two additional fields. First,
“keypoints” is a length 3k array where k is the total number of
keypoints defined for the category. Each keypoint has a 0-indexed
location x,y and a visibility flag v defined as v=0: not labeled (in
which case x=y=0), v=1: labeled but not visible, and v=2: labeled and
visible. A keypoint is considered visible if it falls inside the
object segment. “num_keypoints” indicates the number of labeled
keypoints (v>0) for a given object (many objects, e.g. crowds and
small objects, will have num_keypoints=0). Finally, for each category,
the categories struct has two additional fields: “keypoints,” which is
a length k array of keypoint names, and “skeleton”, which defines
connectivity via a list of keypoint edge pairs and is used for
visualization. Currently keypoints are only labeled for the person
category (for most medium/large non-crowd person instances). See also
annotation{
"keypoints" : [x1,y1,v1,...],
"num_keypoints" : int,
"[cloned]" : ...,
}

categories[{
"keypoints" : [str],
"skeleton" : [edge],
"[cloned]" : ...,
}]

"[cloned]": denotes fields copied from object detection annotations defined above.


### 实例分割

The stuff annotation format is identical and fully compatible to the
object detection format above (except iscrowd is unnecessary and set
to 0 by default). We provide annotations in both JSON and png format
for easier access, as well as conversion scripts between the two
formats. In the JSON format, each category present in an image is
encoded with a single RLE annotation (see the Mask API for more
details). The category_id represents the id of the current stuff
category. For more details on stuff categories and supercategories see

### 全景分割

For the panoptic task, each annotation struct is a per-image
annotation rather than a per-object annotation. Each per-image
annotation has two parts: (1) a PNG that stores the class-agnostic
image segmentation and (2) a JSON struct that stores the semantic
information for each image segment. In more detail:

To match an annotation with an image, use the image_id field (that is
annotation.image_id == image.id). For each annotation, per-pixel
segment ids are stored as a single PNG at annotation.file_name. The
PNGs are in a folder with the same name as the JSON, i.e.,
annotations/name/ for annotations/name.json. Each segment (whether
it’s a stuff or thing segment) is assigned a unique id. Unlabeled
pixels (void) are assigned a value of 0. Note that when you load the
PNG as an RGB image, you will need to compute the ids via
ids=R+G256+B256^2. For each annotation, per-segment info is stored
in annotation.segments_info. segment_info.id stores the unique id of
the segment and is used to retrieve the corresponding mask from the
PNG (ids==segment_info.id). category_id gives the semantic category
and iscrowd indicates the segment encompasses a group of objects
(relevant for thing categories only). The bbox and area fields provide
additional info about the segment. The COCO panoptic task has the same
thing categories as the detection task, whereas the stuff categories
differ from those in the stuff task (for details see the panoptic
evaluation page). Finally, each category struct has two additional
fields: isthing that distinguishes stuff and thing categories and
color that is useful for consistent visualization.

annotation{
"image_id" : int,
"file_name" : str,
"segments_info" : [segment_info],
}

segment_info{
"id" : int,
"category_id" : int,
"area" : int,
"bbox" : [x,y,width,height],
"iscrowd" : 0 or 1,
}

categories[{
"id" : int,
"name" : str,
"supercategory" : str,
"isthing" : 0 or 1,
"color" : [R,G,B],
}]


### 图像标注

These annotations are used to store image captions. Each caption
describes the specified image and each image has at least 5 captions
annotation{