[PyTorch] Simple 3D Pose Baseline implementation (ICCV’17)

4 min readAug 4, 2022

In this post, I review Simple 3D Pose Baseline (A simple yet effective baseline for 3d human pose estimation, also called as SIM) which is be presented at ICCV’17.

Introduction

Most works for estimating human pose for a single person use a single image or video. This model also estimates human pose for a single person. The authors set out to build a system that given 2d joint locations predicts 3d positions in order to understand sources of error. In this work, the authors implemented a lightweight and fast network able to process 300 frames per second. A relatively simple deep feedforward network outperforms the best reported result by about 30% on Human3.6M.

Network Design

The input to the system is an array of 2d joint positions, and the output is a series of joint positions in 3d. After extracting 2d joint location, the authors use a simple neural network which has a small number of parameters and can be easily trained. The building block of the network is a linear layer which is followed by batch normalization, dropout and a RELU activation. The building block is repeated twice and the two blocks are wrapped in a residual connection. The outer block with two blocks is repeated twice.

Experimental evaluation

The authors suggest 2 types of protocols for evaluation.

Protocol #1

Protocol #1 is to calculate the average error in millimetres between the ground truth and prediction across all joints and cameras, after alignment of the root joint. The detailed results under protocol #1 is as follows:

Protocol #2

Protocol #2 uses the error computed after a rigid transformation. In mathematics, a rigid transformation (also called Euclidean transformation) is a geometric transformation of a Euclidean space that preserves the Euclidean distance between every pair of points. (Wikipedia The detailed results under protocol #2 is as follows:

The authors also give example output on test set of Human3.6M. The left figures are 2d observation and middle ones are 3d ground truth. The right ones are 3d predictions.

Implementation

With the help of this repository, I implemented the model with PyTorch so I would like to share the results.

Dataset : Human3.6M

Human3.6M is one of the largest dataset for 3D human pose estimation task. It has 3.6 million 3D human poses and corresponding images. The dataset contains 11 professional actors (6 male, 5 female) with 15 actions for each subject. 2d joint locations and 3d ground truth positions are available.

For data preprocessing, the authors rotate and translate the 3d ground-truth according to the inverse transform of the camera. They uses utilities to deal with the cameras of human3.6m.

The first thing you have to do is to download dataset. Even though Human3.6M is available on the official site, it takes time to confirm my registration. I downloaded preprocessed dataset here. The preprocessed dataset is as follows:

train_2d.pth.tar: The subjects for the training process in 2D
train_2d_ft.pth.tar: The subjects for the training process in 2D with the Stacked Hourglass detection
train_3d.pth.tar: The subjects for the training process in 3D
test_2d.pth.tar: The subjects for the validation process in 2D.
test_2d_ft.pth.tar: The subjects for the validation process in 2D with the Stacked Hourglass detection.
test_3d.pth.tar: Contains the subjects for the validation process in 3D.
stat_3d.pth.tar: The main/std of the 2D inputs and 3D outputs to unnormalize the data and calculate MPJPE.

Train

I firstly download dataset and the repository put into my google drive. After setting directories, I run main.py. You can change running options from opt.py.

# ===============================================================
#                     Running options
# ===============================================================
self.parser.add_argument('--use_hg',         dest='use_hg', action='store_true', help='whether use 2d pose from hourglass')
self.parser.add_argument('--lr',             type=float,  default=1.0e-3)
self.parser.add_argument('--lr_decay',       type=int,    default=100000, help='# steps of lr decay')
self.parser.add_argument('--lr_gamma',       type=float,  default=0.96)
self.parser.add_argument('--epochs',         type=int,    default=200)
self.parser.add_argument('--dropout',        type=float,  default=0.5, help='dropout probability, 1.0 to make no dropout')
self.parser.add_argument('--train_batch',    type=int,    default=64)
self.parser.add_argument('--test_batch',     type=int,    default=64)
self.parser.add_argument('--job',            type=int,    default=8, help='# subprocesses to use for data loading')
self.parser.add_argument('--no_max',         dest='max_norm', action='store_false', help='if use max_norm clip on grad')
self.parser.add_argument('--max',            dest='max_norm', action='store_true', help='if use max_norm clip on grad')
self.parser.set_defaults(max_norm=True)
self.parser.add_argument('--procrustes',     dest='procrustes', action='store_true', help='use procrustes analysis at testing')

I trained 200 epochs, which takes approximately 10~15 hours on Google Colaboratory(Pro+). After training, the checkpoint files are created and ckpt_best.pth.tar is used for testing.

Test

I tested the trained model by loading ckpt_best.pth.tar.

%run 'main.py' --load ('.\\test\\ckpt_best.pth.tar') --test

The mean of errors the original version is 45.5 and I got 44.4.

The errors of each action is as follows:

Thank you for reading and I hope this post is useful.

[PyTorch] Simple 3D Pose Baseline implementation (ICCV’17)

Introduction

Network Design

Experimental evaluation

Implementation

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Riley Learning