Robot Arm Sorting with Online Learning via Human Interaction


Video Demo


Overview

In this project, I developed a novel interactive learning system that enables a robot arm to learn sorting behaviors through natural human interaction. The system combines computer vision, online learning, and natural human interaction to create a robot that can learn and refine sorting criteria throughout operation.

At its core, the system utilizes a Franka Panda robotic arm equipped with a depth camera for precise block detection and manipulation. A separate camera monitors the workspace for human interventions, enabling natural corrections through physical demonstrations. Finally, a combined ECG/IMU sensor worm by the operator enables corrections to the robot’s behavior via gesture control.

The system architecture integrates three neural networks:

  1. Complex sorting network for four-category block classification
  2. Binary gesture network for yes/no feedback validation
  3. Complex gesture network for direct category corrections

This multi-network approach enables fluid human-robot interaction, where users can either confirm the robot’s decisions through gestures or physically move blocks to demonstrate correct sorting. The system continuously learns from these interactions, improving its sorting accuracy over time. The project demonstrates how combining traditional robotic control with adaptive learning and natural interaction modes can create systems that efficiently learn from and collaborate with human operators.

Sorting Process Overview

The full sorting procedure is summarized in the block diagram below.

  1. System Architecture
    • Uses ROS 2 framework with both C++ (robot control) and Python (ML/vision) nodes
    • Integrates MoveIt for motion planning with the Franka Panda robot arm
    • Uses depth cameras for visual perception
    • Implements custom neural networks for both object sorting and gesture recognition
  2. Main Control Flow
    • A sorting demonstration can optionally be given by the user to help pre-train the sorting network
    • The sorting starts when the ‘sort_blocks’ action is called
    • The robot moves to a scanning pose above the workspace
    • A RealSense depth camera scans for blocks and creates 3D markers for detected objects
    • Robot moves to scan the first detected block
    • Sorting network predicts classification
    • Robot grabs block and moves to wait position
    • Gesture recognition checks for human agreement/disagreement
    • Block is placed in appropriate pile based on final decision
    • The process repeats until all blocks are sorted
  3. Interactive Learning
    • Three neural networks work in parallel
    • Sorting network: Classifies blocks based on visual features (4 categories)
    • Binary Gesture network: Interprets human gestures for yes/no feedback
    • Complex Gesture network: Interprets human gestures for sorting classifications (4 categories)
    • Networks are continuously trained during operation
  4. Human-Robot Interaction
    • Allows gesture-based corrections of sorting decisions
    • Monitors for physical intervention (hand detection)
    • Supports continuous learning from human feedback

The system combines autonomous operation with human oversight, enabling efficient sorting with continuous improvement through human feedback.

Below is a summary of the nodes used for this task:

Neural Network for Block Sorting

The system uses a sorting network built in PyTorch, designed for multi-category block classification. The network processes RGB images from the RealSense camera mounted on the robot’s end-effector and outputs classification decisions into four distinct categories.

Network Architecture The architecture consists of three convolutional layers with batch normalization and max pooling for feature extraction, followed by three fully connected layers. Key features include:

Input: 128x128x3 RGB images

Convolutional Layers

  1. Conv2D(16 filters, 3x3) -> BN -> ReLU -> MaxPool(2x2) -> 64x64x16
  2. Conv2D(32 filters, 3x3) -> BN -> ReLU -> MaxPool(2x2) -> 32x32x32
  3. Conv2D(64 filters, 3x3) -> BN -> ReLU -> MaxPool(2x2) -> 16x16x64

Output Layers

Online Learning System

Pretraining by Demonstration

The system provides a demonstration-based pretraining phase to establish initial sorting knowledge:

The pretraining phase provides the network with an initial understanding of visual features associated with each category, creating a higher baseline for further learning through interaction.

Gesture Feedback Sensor

Human-robot interactions in the project were facilitated by a wearable sensor created by Yayun Du of the Simpson Querrey Biomedical Research Center at Northwestern. This sensor features both a single channel ECG sensor and a 6-axis IMU to provide robust information about the behavior of the user. The sensor can be placed in a variety of locations on the body to sense different thing, from hand gestures to eye movements. For this application, it made the most sense to locate it on the wrist, where it would easily be able to return data about hand gestures, which could be used for guiding the robot in the sorting task.

Below are images showing the sensor itself and the positioning of the sensor on the arm. Once in the right position, the sensor is secured with adhesive pads to hold it in place.

Two different types of gestures were used for this project. First, a yes/no binary gesture classification was used to confirm or correct a prediction made by the robot. I decided to make the “yes” gesture simply staying stationary, while the “no” gesture was making a cutting motion with your hand. This way, if the robot is sorting correctly, the user can simply stay in place and do nothing and the robot will proceed with sorting. The gif below shows examples of the two gestures.

For getting sorting feedback, a four way gesture network (the “complex” gesture network) was used. The network itself is described in more detail below. The four gestures used are shown in the GIF below, and were each selected to take advantage of the specific sensing modalities of the sensor. These gestures each have very different ECG and IMU signals, due to the clenching and unclenching of the fist and the different directions of motion. This made classification very accurate and robust after training.

The sensor communicates with the computer via Bluetooth. I created a ROS 2 node to interface between the sensor and the ROS systems. The node publishes the sensors data to topics so it can be processed by the network_node for use in gesture predictions.

Neural Networks for Gesture Control

The system implements two complementary gesture networks for human feedback - a binary network for yes/no feedback and a complex network for direct category selection:

Gesture Classification Data Processing

Gesture Data Normalization

Binary Gesture Network (Yes/No)

Input: 4 channels x 1000 timepoints (reshaped from 20x50)

Convolutional Layers

  1. Conv1D(32 filters, k=5, s=2) -> BN -> ReLU -> MaxPool -> 250 timepoints
  2. Conv1D(64 filters, k=5, s=2) -> BN -> ReLU -> MaxPool -> 62 timepoints

Output Layers

Complex Gesture Network (Multi-Category)

Input: 4 channels x 1000 timepoints (reshaped from 20x50)

Convolutional Layers

  1. Conv1D(64 filters, k=5, s=2) -> BN -> ReLU -> MaxPool -> 250 timepoints
  2. Conv1D(128 filters, k=5, s=2) -> BN -> ReLU -> MaxPool -> 62 timepoints

Output Layers

Network Integration Both networks operate in parallel during deployment:

Franka Panda Control with MoveIt

The system integrates with MoveIt through the MoveGroupInterface and PlanningSceneInterface, which provide core functionality for motion planning and collision avoidance. Key custom functions were developed to handle common operations:

Core Movement Functions:

The system uses MoveIt’s planning scene to maintain a real-time representation of the workspace:

All motion planning is then executed through MoveIt’s pipeline using the MoveIt C++ API.

Block Detection

The block detection system employed computer vision techniques to accurately detect blocks, their dimensions, positions, and orientations. Key aspects included:

  1. Camera Setup:
    • Utilized both color and depth information from a d405 RealSense camera mounted on the robot gripper. This camera is optimized for close range and high accuracy.
    • To scan, the robot hand hovers above the workspace with the gripper and camera pointing straight down, where it can see the blocks.
    • Depth data provided 3D positioning of the table in 3d space, while color data helps with block detection and segmentation.
  2. Depth Segmentation:
    • Implemented a custom depth segmentation algorithm to isolate the table surface.
    • Used Gaussian blur and Canny edge detection on the depth image.
    • Applied Hough line transform to identify table edges.
    • Created a mask to separate the table surface from the background.
  3. Color-based Block Segmentation:
    • Applied color thresholding to isolate potential block regions.
    • Used morphological operations (opening and closing) to reduce noise and refine block shapes.
  4. Contour Analysis:
    • Detected contours in the segmented image using cv2.findContours().
    • Analyzed contour properties (area, aspect ratio) to filter out non-block objects.
    • Used cv2.minAreaRect() to find the oriented bounding box of each block.
  5. 3D Pose Estimation:
    • Calculated block orientation using the angle of the minimum area rectangle.
    • Estimated block dimensions using the contour’s bounding rectangle and depth information.
    • Depth data was used to estimate the height of the block.

Interaction and Block Correction

The system continuously monitors the workspace to enable natural human interaction in the sorting process. This allows users to physically adjust block placements while the system learns from these corrections:

Vision-Based Monitoring:

Interaction Detection Pipeline:

  1. Pre-Interaction State
    • Captures baseline color and depth images when hands enter workspace
    • Tracks specific monitoring regions around current and potential block positions
    • Uses depth thresholding to detect significant changes
  2. Movement Detection
    • Analyzes depth changes in monitored regions after hands leave
    • Identifies block removal from original position
    • Detects block addition in new stack locations
    • Uses configurable thresholds for robust change detection
  3. State Updates
    • Updates internal block tracking based on detected movements
    • Triggers network retraining based on human corrections
    • Maintains consistency between physical state and system’s internal representation
    • Updates all three networks when corrections occur using the correction services described above

Network Corrections

The network_node offers services to correct incorrect training each of the networks. These services reverse the incorrect training and retrain the network with the correct classification. This enables continuous learning from human feedback via physical interventions after the robot has already trained and updated the networks:

Camera Localization

The system uses AprilTag-based localization to accurately determine the D435 monitoring camera’s position in the workspace:

Network Training and Evaluation

The network_training node provides functionality for testing and training the networks without using the robot system:

This node acts as both a training and evaluation tool, enabling training sessions to improve network performance and evaluation of trained networks.

Online Training Results

The gesture networks were pretrained.

Binary Gestures:

Complex Gestures:

Below is the online training of the solid color blocks. The first graph represents the outcomes of the video at the top of this post. The second set of graphs represent an extended training session using the network_training node.

Demo:

Extended Training:

After the network was trained, it could then be redeployed and retrained using a new sorting method. below are the results from redeploying the network in the demo in video and another from an extended training session in the network_training node. For comparison, I have also provided an example of an extended training session with no pretraining, showing that a pretrained network learns faster than the untrained network, even when using a different training method.

Extended Training with Pretraining:

Extended Training without Pretraining:

Finally, the network can be deployed again to learn aother new training method.

Demo with Pretraining:

Extended Training with Pretraining:

Extended Training without Pretraining:

Conclusion and Future Work

This project successfully integrated robotics, computer vision, and machine learning to create an adaptive block sorting system. This system features a unique approach to training neural networks online via feedback and interaction with a human operator, as well as interaction between the different networks. The system’s ability to learn and adapt to new sorting criteria, demonstrates potential for intelligent automation in industrial settings. Future work on this project could include:

  1. Extending the system to handle a wider variety of objects with more complex shapes and materials, perhaps using an object detection model or vision-language-action model.
  2. Implementing more advanced reinforcement learning algorithms to further improve adaptability and efficiency.
  3. Improve gesture sorting network to categorize even more gestures and make models more robust between different users.

By combining adaptive learning with precise robotic control, this project lays the groundwork for more intelligent automation systems. The potential applications span various industries, from manufacturing and logistics to healthcare and beyond.

Acknowledgements

Thanks you to Matt Elwin and to Yayun Du for their guidance and support with this project!

GitHub