Pose estimation process
A video consists of many images combined together, with each image being a frame. The more frames per second (fps) a video has, the smoother it appears. Typically, videos run at 24fps, 30fps, or 60fps. For this POC, we chose 24fps, to get a faster analysis.
In computer vision, when analyzing a video for pose estimation, we process it frame by frame using a library like OpenCV. For each frame, we make predictions using a pose estimation model (such as YOLO in our case). This model returns the locations (x, y coordinates in the image) of various keypoints in the frame. Keypoints can include points like the right knee, left knee, left hip, right hip, and more. Knowing the position of each keypoint in every frame allows us to perform calculations and extract useful information.
YOLO calculates 17 different keypoints, but in our application, we focus on 8 keypoints: left hip, right hip, left knee, right knee, left shoulder, right shoulder, left elbow, and right elbow.
In the pose prediction step, a folder is created containing images corresponding to each frame of the video, along with a JSON file for each frame that includes the predicted keypoints, along with the angles of the knees and the back for each frame.
This information is then used in the performance evaluation, which calculates the number of repetitions, if the angles are in the expected threshold, among other data.
The extracted data of the performance evaluation follows this format, which aids in analyzing squat techniques: