Gesture Perception

the video above shows the demonstration of the gesture perception node which was developed to perceive the start command and pointed posture

Overview

The purpose of gesture perceptual application for sensing human behavior, living with the human the robot must know human behavior and interact with humans. The posture can specify many things whether emotion, desire, or action so gesture perception is necessary for living with humans. And the minor purpose is to command the robot as we say the posture can specify the desire. The basic way to command the robot is the posture and we can adjust the special command following the desire of the developer however it depends on the suitability to use in real life

ros2 system architecture

ros2 system architecture

The gesture detection node consists of 2 services for calling from the behavior root node while the detect sequence is occurs

Installation

  1. CV bridge LTS on ubuntu

    sudo apt-get install ros-(ROS version name)-cv-bridge
    sudo apt-get install ros-(ROS version name)-vision-opencv
    
  2. Tensorflow==2.9.1

    see the installation reference on this link: https://www.tensorflow.org/install

Example

  • First clone the repository from GitHub following this command

    git clone https://github.com/MBSE-2022-1/Software-Team.git -b gesture-perception
    
  • Build the package (navigate to workspace directory before build)

    colcon build --symlink-install
    

    Note

    Always navigate to workspace directory before build and symlink-install is necessary

  • Run the package

    ros2 run gesture_detection gesturedetection.py
    
  • Open realsense camera

    ros2 launch realsense2_camera rs_launch.py \
                rgb_camera.profile:=640x480x30 \
                depth_module.profile:=640x480x30 \
                pointcloud.enable:=true
    
  • Call service

    ros2 service call <service name> std_srvs/srv/Empty
    

API Reference

gesture detection program structure

The gesture detection function consist of image subscriber, extract image feature, preprocessing and classifier model

  • Mediapipe

    the image feature is extracted by mediapipe library Mediapipe hand landmarks are composed of x, y, and z. x and y are normalized to [0.0, 1.0] by the image width and height respectively. z represents the landmark depth with the depth at the wrist being the origin, and the smaller the value the closer the landmark is to the camera. The magnitude of z uses roughly the same scale as x. The preprocessing function will set the wrist position as the origin point and then subtract the other 20 points from the origin point then normalize the position

    See the reference API here: https://google.github.io/mediapipe/solutions/hands.html#python-solution-api

  • preprocessing function

    .. calc_landmark_list(self, landmarks)::

    The hand landmarks from mediapipe are normalized [0.0, 1.0] this function will convert the normalized value to the picture position

    Parameters:

    landmarks: normalize hand landmarks from the result of mediapipe

    Return:

    The same size of the input array respective to the image size with format [x, y, z]

    .. pre_process_landmark(self, landmark_list)::

    This function using for preprocessing the hand landmark by subtracting all hand keypoint with the wrist position value and chaining the position x, y and z together

    Parameters:

    landmark_list: list of hand landmarks respective to the image size in the format [x, y, z]

    Return:

    a dimension list of scale hand landmarks x follow by y and z position

    .. calc_bounding_rect(self, landmarks)::

    this function calculates the landmarks from mediapipe for the bounding box for debugging with the image

    Parameters:

    landmarks: normalize hand landmarks from the result of mediapipe

    Return:

    [x, y, x + w, y + h] respectively to the image size

  • Classifier model architecture

gesture model architecture

Input: 42 length arrays

Output: hand class [‘Open’, ‘Start_cmd’, ‘Pointer’, ‘Close’, ‘OK’]

Subsystem Verification

Detection range

On the first version of the robot, the camera has 160 cm height from the floor which means the maximum detect range should be more than 230 cm from our calculation with Realsense d455 extrinsic matrix

  • test condition

    • this testing result using estimate distance with 2.5D anchors

    • min confidence and tracking of mediapipe is 0.5

  • result

../../_images/gesturetest1.png

Processing time testing

  • test condition

    • Running with CPU Intel NUC on the robot

  • result

    • Each node using 300 MB memory

    • Each node uses 10% CPU when processing with a max frequency

../../_images/gesturetest2.png

Problem and future plan

Problem

  • The classifier model has low performance

  • Detecting the hand only in front of the robot

  • If there are many hands in the camera plane there is no indicator to detect

Gesture perception tasks on robocup@home

Task 5.1 carry my luggage

  • Detect start command

  • Detect pointing position

Task 5.3 Farewell

  • Detect tired user

  • Detect calling and signal to leave

Task 6.4 Hand Me That

  • Detect pointing position

Task 6.6 Restaurant

  • Detect calling and waving

Task 6.7 Smoothie Chef

  • Learning how to prepare the smoothie and then follow

Task 6.8 Stickler for the Rules

  • Identify rule breaker

Note

This semester we are only planning for task 5.1