Voice Response Based Emotion Intensity Classification
This interactive dashboard summarizes research by Hoashalarajh Rajendran et al. on enabling assistive robots to classify human emotion intensities. By transitioning from simple emotion detection to intensity classification, robots can respond with higher empathy and social alignment.
No more numbers: I "Need Classes"
Emotions are widely classified from images and speech signals and even from multi-modalities. However, intensities of those emotions were studied only as a continuous scale such as emotion temperature. It is really hard to understand rationale behind such continuous scale, on the other hand we have a rapidly growing data in the social media and internet of speech signals with a variety of emotional intensities. We classified the emotion intensities only from speech signals into four different categories such as Neutral, Onset, Offset and Apex by utilizing machine learning algorithms. The system classifies intensity into four distinct levels, allowing a robot to distinguish between a "Mildly Sad" user and a "Severely Distressed" one.
1.2s Windowing
Optimal segment length for acoustic stability.
17 Features
Statistical features (Mean & Std Dev).
Plurality Logic
Exponentially reducing failure probability.
System Highlights
-
✓
99.47% Accuracy
Achieved via Random Forest on intensive voice datasets - by converting an unstructured data into a tabular data.
-
✓
Human Alignment
Experimental study verified robot predictions vs human perception.
-
✓
Theoretical Bound
Failure probability bounded by Hoeffding’s Inequality.
Audio Analysis Pipeline
The architecture transforms raw acoustic waveforms into classified intensity labels. Click on each module to see the technical processing requirements for real-time assistive robotics.
Pre-Processing
Signal VAD
1.2s Windowing
Feature Engine
Statistical Extractor
Frames: 40 per segment
Vector Vi: 17 Dimensions
Classifier
Plurality Decision
Aggregates segment-wise labels to determine final response intensity.
Interactive: Click a node above to explore the pipeline specifications.
The Empirical Method & The Theoretical Guarantee
Why is 1.2s windowing and plurality voting effective? The research provides a theoretical guarantee that the classification error decreases exponentially with signal length.
1 Statistical Feature Vector
Each 1.2s segment i is transformed into a vector Vi representing the temporal mean (μ) and standard deviation (σ) of 17 low-level descriptors.
Vi = [μ(Xi), σ(Xi)]
// 17-Dimensional Feature Space
- • Amplitude
- • RMS Energy
- • Zero-crossing rate
- • Fundamental frequency
- • MFCCs (1-13)
2 Theoretical Guarantee
Based on Hoeffding's Inequality, the plurality decision Ŷ for a response with N segments satisfies a failure bound:
• C: Number of classes (4)
• N: Segments of 1.2s duration
• μ: Classifier confidence margin
Experimental Results
Performance evaluation across four machine learning models. Random Forest emerged as the superior algorithm for handling the 17-dimensional acoustic feature space.
Model Accuracy Comparison
Human Alignment Study
Robot predictions were compared against 15 human participants (ages 20-30). Results showed significant alignment in intensity perception.