Voice Response Based Emotion Intensity Classification

This interactive dashboard summarizes research by Hoashalarajh Rajendran et al. on enabling assistive robots to classify human emotion intensities. By transitioning from simple emotion detection to intensity classification, robots can respond with higher empathy and social alignment.

The Core Problem

No more numbers: I "Need Classes"

Emotions are widely classified from images and speech signals and even from multi-modalities. However, intensities of those emotions were studied only as a continuous scale such as emotion temperature. It is really hard to understand rationale behind such continuous scale, on the other hand we have a rapidly growing data in the social media and internet of speech signals with a variety of emotional intensities. We classified the emotion intensities only from speech signals into four different categories such as Neutral, Onset, Offset and Apex by utilizing machine learning algorithms. The system classifies intensity into four distinct levels, allowing a robot to distinguish between a "Mildly Sad" user and a "Severely Distressed" one.

🎙

1.2s Windowing

Optimal segment length for acoustic stability.

17 Features

Statistical features (Mean & Std Dev).

📈

Plurality Logic

Exponentially reducing failure probability.

🎯

System Highlights

  • 99.47% Accuracy

    Achieved via Random Forest on intensive voice datasets - by converting an unstructured data into a tabular data.

  • Human Alignment

    Experimental study verified robot predictions vs human perception.

  • Theoretical Bound

    Failure probability bounded by Hoeffding’s Inequality.

Audio Analysis Pipeline

The architecture transforms raw acoustic waveforms into classified intensity labels. Click on each module to see the technical processing requirements for real-time assistive robotics.

Pre-Processing

🎙

Signal VAD

Voice Activity Detection isolates speech from background noise.

1.2s Windowing

Partitions audio into N non-overlapping segments.

Feature Engine

📊

Statistical Extractor

Frames: 40 per segment

Vector Vi: 17 Dimensions

MFCCs, Amplitude, RMS Energy, Zero-crossing rate, Fundamental frequency

Classifier

🗽

Plurality Decision

Aggregates segment-wise labels to determine final response intensity.

Interactive: Click a node above to explore the pipeline specifications.

The Empirical Method & The Theoretical Guarantee

Why is 1.2s windowing and plurality voting effective? The research provides a theoretical guarantee that the classification error decreases exponentially with signal length.

1 Statistical Feature Vector

Each 1.2s segment i is transformed into a vector Vi representing the temporal mean (μ) and standard deviation (σ) of 17 low-level descriptors.

Vi = [μ(Xi), σ(Xi)]


// 17-Dimensional Feature Space

  • • Amplitude
  • • RMS Energy
  • • Zero-crossing rate
  • • Fundamental frequency
  • • MFCCs (1-13)

2 Theoretical Guarantee

Based on Hoeffding's Inequality, the plurality decision Ŷ for a response with N segments satisfies a failure bound:

Pr(Failure) ≤ (C-1) · exp(-Nμ²/2)

C: Number of classes (4)

N: Segments of 1.2s duration

μ: Classifier confidence margin

Exponential Convergence

Experimental Results

Performance evaluation across four machine learning models. Random Forest emerged as the superior algorithm for handling the 17-dimensional acoustic feature space.

Model Accuracy Comparison

Human Alignment Study

Robot predictions were compared against 15 human participants (ages 20-30). Results showed significant alignment in intensity perception.

Random Forest (System) 99.47%
Human Peer Group Alignment 78.86%
"The proposed system performs significantly better on test data... revealing significant alignment with data extracted from human participants."