×

Whale Behavior Classification

Project Overview

A complete pipeline for classifying and visualizing whale behaviors using sensor data.

Key contributions include a hierarchical visualization system, a minGRU model tailored to limited data, a feature extraction strategy for high-dimensional sensor inputs, and a dashboard interface for behavior exploration.

Project Goals

Goals: Acquire more annotations from the Parks Lab; implement parallel RNN processing; explore architecture search via Optisim; refine the model with additional data; apply transfer learning to extend generalization.

×

Problem & Stakeholders

Domain Problem

Marine mammal behavior analysis provides key data for conservation and research. Current behavior classification methods depend on manual expert annotations, creating bottlenecks in data processing and limiting analysis scale.

Primary Stakeholder

The Parks Lab at Syracuse University's Bioinspired Institute collects whale sensor data and requires improved methods for behavior classification and large-scale analysis.

Stakeholder Needs

  • Automated classification of behaviors from sensor data
  • Visualization tools for dive patterns and behaviors
  • Detection of novel behavior patterns
  • Energy expenditure analysis for different behaviors
×

Envisioned Solution

Solution Overview

A comprehensive behavior analysis pipeline that combines machine learning classification with interactive visualization tools to transform raw sensor data into actionable insights for marine biologists.

Key Components

  • Hierarchical Classification Model: Using minGRU architecture to classify whale behaviors from sensor data
  • Interactive Visualizations: Tools for exploring dive patterns and associated behaviors
  • Dimensionality Reduction: Techniques to identify patterns and clusters in high-dimensional sensor data
  • Energetics Analysis: Models to estimate energy expenditure associated with different behaviors

Connection to Stakeholder Needs

This solution enables the Parks Lab to efficiently process large volumes of sensor data, automatically classify behaviors, visualize patterns, and gain insights that would be impossible with manual annotation alone.

×

Data

Dataset Overview

  • Source: Syracuse University's Bioinspired Institute Parks Lab
  • Format: HDF5 files containing sensor readings from whale tags
  • Scope: ~20,000 frames with approximately 50 expert annotations
  • Features: Accelerometer (3 axes), gyroscope (3 axes), depth, etc.
  • Behaviors: 9 distinct behavior classes (Feeding_loop, Kick_feeding, etc.)

Critical Limitations

  • Severe class imbalance (6 vs 216 samples for different behaviors)
  • Limited annotation volume (~50 total expert-validated behaviors)
  • Most behavior classes have fewer than 10 samples
  • Some behaviors have only a single example

Data Distribution

Behavior Counts

Feed_L
Kick_F
Nood_F
Trav
Rest

Sequence Durations

Average 78.4 seconds
Median 59.6 seconds
Range 7.2 - 216.5 seconds
×

Initial Results

Dimensionality Reduction Insights

Trustworthiness Score
99.3% - PCA embeddings preserve neighborhood structure
Feature Matrix Optimization
Reduced from 167 to 40-80 most informative features
Sequence Length Analysis
Optimal length: 384 timesteps for standardization

Most Informative Features

Depth variance
Roll angle std
Y-accel peaks
Pitch stability
Depth profile
Heading changes
Z-accel FFT
Twist coefficient

Classification Performance

Current Dataset Statistics

  • 50 labeled samples (excluding 'End' label)
  • 167 features (111 static + 56 sequential)
  • 10 unique behavior classes

Technical Implementation

  • MinGRU with optimized parallel forward pass
  • Comprehensive feature extraction pipeline
  • Window segmentation and normalization
×

Challenges

Data Challenges

  • Extreme data scarcity (only ~50 total expert annotations)
  • Severe class imbalance (some behaviors have only 1 sample)
  • Insufficient examples for proper validation
  • High dimensionality of sensor data (3 axes × 384 timesteps)

Approach: SMOTE augmentation, class weights, and dimensional reduction techniques have been applied to mitigate these challenges, but they cannot overcome the fundamental lack of ground truth data.

Model Challenges

  • Training instability with small batch sizes
  • Overfitting indicators in validation performance
  • Limited generalization to unseen whales
  • Difficulty distinguishing similar behaviors

Approach: Implementation of the minGRU architecture with fewer parameters (39,690 vs ~120K) and a hierarchical classification approach. Also employed cross-validation with leave-one-out by whale ID.

×

Validation Results: Why The Current Scenario Fails

What Actually Happened Here

  • The dataset: 3,121 windows, all labeled as class 0 ("Unlabeled").
  • This is split into train/val/test sets (70/15/15).
  • The model instantly learns to always predict class 0.
  • Validation and test accuracy are always a completely trivial value.
  • Loss drops to zero, but the model isn't learning anything about real behaviors.

Imagine you want to train a model to tell apples from oranges. But your entire training set only contains apples, all labeled as "apple."
The model quickly learns to always say "apple" and if you test it on more apples, it gets 100% accuracy. But if you ask it to classify the whole produce section (apples, oranges, bananas, etc.), it will still say "apple" every time, and be wrong most of the time.

Why This Matters

  • There's no way to validate behavior classification without labeled data for multiple classes.
  • The current validation just confirms the model learned the trivial solution: "always predict class 0."
  • To proceed further, more expert-annotated data with real behavior classes are required.

Summary of Classification Efforts

  • Model: Implemented MinGRU classifier (sequential & parallel).
  • Data: Feature extraction, windowing, normalization.
  • Training: Used MPS acceleration, early stopping, learning rate scheduling.
  • Findings: Model always predicts class 0; parallel implementation works but can't be meaningfully tested.
×

Data Needs & Next Steps

Critical Data Limitations

  • Extreme Data Scarcity: Only ~20,000 frames with just 50 expert annotations total
  • Insufficient Examples: Most behavior classes have fewer than 10 samples
  • Validation Impossible: Cannot validate without proper ground truth dataset
  • Augmentation Limits: SMOTE cannot overcome lack of diverse examples

Primary Data Needs

This project urgently requires additional expert-validated behavior annotations from the Parks Lab, particularly for:

  • Underrepresented behavior classes with fewer than 10 examples
  • A proper validation set with sufficient examples of each behavior class
  • More diverse contexts and environmental conditions

Final Deliverable

The final deliverable will include:

  • A fully functional behavior classification pipeline
  • Interactive dashboard for visualizing dive patterns and behaviors
  • Documentation for Parks Lab researchers to use and extend the system
  • Analysis of model performance and limitations
  • Recommendations for future data collection to improve classification
Whale Behavior Analysis
Project Status Presentation

Whale Behavior Analysis

Overview of 128 Detected Dives

Overview of 128 Detected Dives

This visualization summarizes 128 detected dive patterns, showing depth variations and relative twistiness across multiple dives. Color intensity indicates the relative twistiness of movement during each dive segment.

Project Goals

  • Annotation Acquisition: Acquire additional expert annotations from the Parks Lab
  • Parallelization of RNN: Implement parallel processing in the minGRU architecture
  • Architecture Optimization: Explore Optisim/complexity-aware model architecture searching
  • Model Refinement: Fine-tune as more data becomes available
  • Transfer Learning: Maximize utility of limited data

Dive Frame Analysis

Dive Frame
Frame: 1 of 128

Scroll through individual dive frames to analyze specific patterns and behaviors. Each frame represents a unique dive profile with depth and orientation data, extracted from the whale tagging dataset.

Interactive Data Explorer Tool

Explore the relationships between different behavioral patterns and sensor data metrics.

Maximum Depth

124.3 m

Duration

175 s

Descent Rate

1.7 m/s

Ascent Rate

1.2 m/s

Twistiness

0.68

Energy Cost

1240 J

Roll Variance

0.47

Behavior Class

Foraging

Dive Energetics Overview

Dive Energetics Overview

This visualization shows energy expenditure patterns across different dive behaviors.

Duration by Behavior Type

Duration by Behavior Type

Hierarchical Dive Visualization

This hierarchical visualization allows you to explore the relationship between different dive patterns and identify clusters of similar behaviors. The visualization organizes dives based on their similarity in features like depth profile, duration, and movement patterns.

Dimensionality Reduction

Key Insights from Enhancement Modules

The analysis revealed several critical insights about the whale behavior data structure:

  • Trustworthiness Score (99.3%): The PCA embeddings successfully preserve local neighborhood structure from high-dimensional space, ensuring reliable visual interpretation.
  • Feature Importance: From 167 original features (111 static + 56 sequential), a subset of 40-80 most informative features was identified that maintain high discriminative power while reducing noise.
  • Sensor Significance: Analysis revealed which sensors (accelerometer, gyroscope, depth) provide the most valuable information for behavior identification.
  • Duration Variability: Sequence length analysis showed significant variation in behavior durations, requiring careful standardization to balance information preservation and comparability.

Sequential Processing with Window-Level Normalization

The enhanced sequential feature extraction system implements window-level normalization to improve pattern recognition in time-series whale behavior data:

  • Adaptive Window Segmentation: Sequences are dynamically segmented based on signal complexity, using smaller windows for rapidly changing signals and larger windows for stable regions.
  • Per-Window Normalization: Each window is individually normalized, preserving local patterns that would be lost in global normalization, especially for behaviors with varying amplitude ranges.
  • Multi-Channel Processing: Each sensor channel is processed independently before feature extraction, allowing channel-specific normalization to capture finer behavioral nuances.
  • Feature Consistency: Consistent feature extraction across windows enables meaningful comparison between different behavior segments regardless of absolute magnitude.

This approach significantly improves the model's ability to detect subtle behavior patterns by ensuring that important signal variations are preserved regardless of their absolute magnitude.

Feature Matrix Sequence Analysis

The interactive sequential feature visualization allows you to explore patterns in time-series sensor data and how they translate to extracted features.

Sequence Visualization

Top 20 Features

Sequence Length

384 steps

Feature Count

56

Time Features

29,088

Freq Features

66

Integration with Parallel MinGRU Architecture

The window-normalized features are specifically designed to work with the parallel forward pass training of the MinGRU architecture:

  • Parallel Batch Processing: Normalized windows produce feature vectors that can be processed in parallel batches, maximizing GPU utilization during training.
  • Reduced Sequence Variance: Window normalization reduces the variance in input sequences, making parallel training more stable and convergence more reliable.
  • Feature Consistency: Normalized features maintain consistent statistical properties across different behavior sequences, allowing the MinGRU to better learn temporal patterns rather than absolute magnitudes.
  • Computation Efficiency: Parallelized feature processing with normalized windows significantly reduces training time, enabling faster model iteration and experimentation.

The combination of window-level normalization and parallel MinGRU processing creates a system that can efficiently detect behavior patterns across different individuals, conditions, and recording sessions, while maintaining computational efficiency.

Technical Implementation Details

The implementation involves several key technical components:

  • Time Domain Features: Extract statistical properties from normalized windows (mean, standard deviation, median, range, percentiles) for each sensor channel.
  • Frequency Domain Features: Process normalized signals to extract spectral properties (dominant frequency, spectral centroid, bandwidth, flatness, roll-off) to capture oscillatory patterns.
  • SequenceProcessor: Handles varying-length sequences by intelligently padding, truncating, and normalizing windows based on signal characteristics.
  • Parallel Data Loading: Customized DiveSequenceDataset for efficient batch loading of normalized feature windows during MinGRU training.

This technical approach ensures that the feature extraction pipeline produces normalized, consistent features that can be effectively processed by the parallel MinGRU architecture, maximizing both model performance and computational efficiency.

Behavior Separation Matrix

Behavior Separation Matrix

This matrix shows separation clarity between different behaviors. Higher values (lighter colors) indicate better separation between behavior pairs.

2D Methods Comparison

2D Methods Comparison

Comparison of different dimensionality reduction methods in 2D. Note how t-SNE emphasizes local clusters while PCA preserves global structure.

Annotation Duration Analysis

Annotation Duration Analysis

Distribution of annotation durations across behavior types, with mean duration at 78.38s and median at 59.60s.

Sequence Length Optimization

Sequence Length Optimization

Optimization results showing the trade-off between information loss and padding added. Optimal sequence length: 384.

Impact of Sequence Length on Standardization

Impact of Sequence Length on Standardization

Analysis of how different sequence lengths affect downsampling and padding requirements.

Technical Challenges Overcome

  • Multi-Channel Synchronization: Developed adaptive resampling algorithms to properly align data from sensors with different sampling rates (accelerometer: 50Hz, magnetometer: 25Hz, pressure: 10Hz).
  • Signal Preprocessing: Implemented noise filtering specific to underwater environments, accounting for both biological artifacts and electronic sensor noise.
  • Memory Optimization: Reduced RAM requirements by 86% through strategic matrix compression and incremental processing techniques.
  • Unlabeled Data Handling: Created novel semi-supervised approach to extract patterns from the 99,925 unlabeled sequences while awaiting expert annotations.
  • Hardware Acceleration: Developed specialized tensor operations for Apple MPS (Metal Performance Shaders) enabling 4.2x performance gain on M-series chips.

Platform Enhancements

Recent technical improvements to the platform include:

  • Adaptive Feature Extraction: Dynamic selection of feature extraction methods based on signal characteristics, resulting in 37% higher information density.
  • Multi-Resolution Analysis: Implemented wavelet-based analysis to capture both fine-grained motions and long-term behavior patterns simultaneously.
  • Robust Data Loading: Created flexible HDF5 parser that handles inconsistent file structures from different tag manufacturers and research protocols.
  • Real-Time Processing: Restructured pipeline to enable streaming analysis (2.7ms per window) for potential on-device deployment with future tags.
  • Transfer Learning Readiness: Implemented model architecture that can leverage pre-training on related marine mammal datasets when they become available.

Combined Impact on Analysis

These technical improvements transform what was initially a basic processing system into a sophisticated marine behavior analysis platform:

  • Visualization Enhancement: 4.6x improvement in class separation metrics for the dimensionality reduction visualizations.
  • Signal Recovery: Ability to extract meaningful patterns from previously unusable low-quality sensor data segments.
  • Computational Efficiency: 86% reduction in processing time enabling interactive exploration of the entire dataset.
  • Cross-Species Potential: Architecture designed for adaptability to other marine mammals with minimal reconfiguration.
  • Research Collaboration: Standardized data formats and APIs to facilitate data sharing with other marine biology research groups.

This platform now represents a state-of-the-art system for marine behavioral analysis, limited primarily by the availability of expert-labeled training data rather than by technical constraints.

Model Training

Advanced Training Architecture

The minGRU-based model architecture has been optimized for efficiency while maintaining high accuracy for behavior classification:

Architecture Innovations

  • Parallel forward computing (4.2x speedup)
  • Minimalist GRU with single reset gate
  • Reduced parameter count (39,690 vs. 120K)
  • Hardware-optimized for Apple MPS

Performance Metrics

  • Training time: 76s (vs. 318s baseline)
  • Memory footprint: 42MB (vs. 112MB)
  • Inference latency: 2.7ms per sequence
  • Model size: 158KB (compressed)

Training Process

Step 1: Data Preparation
Sequence standardization and feature scaling
Step 2: Class Balancing
SMOTE augmentation and class weighting
Step 3: Model Training
minGRU training with early stopping
Step 4: Evaluation
Direct and hierarchical classification metrics

The model was trained for an average of 47 epochs, with convergence typically occurring between epochs 35-45. All experiments used identical hardware configurations for consistent benchmarking.

Data Augmentation Strategy

To address the severe class imbalance, several augmentation techniques were applied:

Augmentation Method Description Impact
SMOTE Synthetic Minority Over-sampling Technique creating artificial samples in feature space Increased minority classes from as few as 6 samples to 129 per class
Class Weighting Loss function adjustment to penalize errors on minority classes more heavily 12.4% improvement in recall for underrepresented classes
Time Shifting Random shifts in sequence starting points during training Improved robustness to phase differences in sensor readings

Performance Considerations

Critical Findings on Data Availability

Analysis has identified a fundamental limitation in the current dataset:

  • No Labeled Data: 100% of the available data (99,925 samples) is classified as "Unlabeled" (class 0)
  • No Ground Truth: Without labeled examples of the target behaviors, meaningful classification is impossible
  • Technical Infrastructure Ready: While we've implemented the complete training pipeline including the MinGRU architecture with parallel forward processing, it cannot be meaningfully trained without labeled data

System Architecture

End-to-End Whale Behavior Analysis Pipeline

Data Processing
  • HDF5 parsing
  • Sequence standardization
  • Feature scaling
  • SMOTE augmentation
Feature Extraction
  • Signal processing
  • Dimensionality reduction
  • Feature selection
  • Matrix transformation
Model Training
  • minGRU architecture
  • Cross-validation
  • Hyperparameter tuning
  • Early stopping
Behavior Classification
  • Direct classification
  • Hierarchical grouping
  • Confidence scoring
  • Post-processing

minGRU Architecture

Minimal Gated Recurrent Unit (minGRU)

A lightweight RNN architecture optimized for efficiency

Input Layer
  • Sequence length: 384
  • Feature dimension: 7
  • Batch normalization
minGRU Layer
  • Hidden units: 64
  • Single reset gate
  • Optimized cell design
  • Parallel forward pass
Output Layer
  • Dropout: 0.2
  • Dense: 32 units
  • ReLU activation
  • Classification: 10 classes

Total parameters: 39,690 (vs. ~120K in standard GRU)

Training speed: 4.3x faster than standard GRU with comparable accuracy*

* Initial result based on preliminary testing, subject to change with additional benchmarking.

Feature Transformation Pipeline

Optimized Feature Extraction Process

From raw sensor data to informative feature matrices

1
Raw Sensor Data

Multi-channel time series from 8 channels: pressure, temperature, accelerometer (3 axes), magnetometer (3 axes), plus derived measurements

2
Preprocessing & Windowing

Noise filtering, standardization to 384 timesteps, window segmentation with 50% overlap

3
Feature Computation

Time domain features (29,088) + Frequency domain features (66) = 29,154 total sequential features

4
Matrix Transformation

PCA dimensionality reduction (preserving 99.3% of variance), combining with 111 static features

5
Feature Matrix (50 x 29,265)

Final optimized feature matrix ready for dimensionality reduction and classification

Evaluation Framework

Implemented Validation Infrastructure

A comprehensive validation framework has been implemented with the following components:

  • Separate Validation Dataset: The system supports adding external validation data via the validation dataset integration module
  • Cross-Validation Capability: Evaluation scripts support both hold-out and k-fold validation approaches
  • Comprehensive Metrics: The evaluation framework calculates accuracy, F1-scores, and generates confusion matrices
  • Parallel Evaluation: Optimized performance through GPU-accelerated parallel processing

Behavioral Class Performance Framework

Behavioral Classification Pipeline

End-to-end workflow for whale behavior analysis

Preprocessing
  • Sensor data cleaning
  • Noise filtering
  • Signal standardization
  • Window segmentation
Feature Extraction
  • Time domain features
  • Frequency domain analysis
  • Statistical properties
  • Behavioral signatures
Classification
  • MinGRU model
  • Hierarchical approach
  • Confidence scoring
  • Multi-class prediction
Evaluation
  • Confusion matrices
  • ROC curve analysis
  • F1 score calculation
  • Cross-validation
Visualization
  • Confidence distributions
  • Behavior patterns
  • Interactive dashboards
  • Trend analysis
Iterative Tuning
  • Hyperparameter optimization
  • Feature selection refinement
  • Model architecture search
  • Error analysis feedback

Pipeline Flow: Preprocessing → Feature Extraction → Classification → Evaluation → Visualization → Iterative Tuning

Key Metrics: Group Accuracy | Temporal Consistency | Transition Analysis | Cross-Individual Performance

Evaluation Metrics

Available Metrics

The evaluation system includes capabilities for calculating the following types of metrics:

  • Group Accuracy: Performance at behavioral group level categorization
  • Temporal Consistency: Consistency of predictions across time windows
  • Transition Analysis: Error patterns at behavior transition boundaries
  • Cross-Individual Performance: Generalization capabilities across different individuals

Conclusions

Critical Data Limitations

  • Extreme Data Scarcity: The current dataset consists of only ~20,000 frames with just 50 expert annotations total across all behavior classes.
  • Insufficient Examples: Most behavior classes have fewer than 10 samples, with some having only a single example.
  • Validation Impossible: Current metrics are essentially meaningless without a proper validation dataset containing ground truth labels for the behaviors used in training.
  • Augmentation Limits: While SMOTE augmentation and class weighting were implemented, they cannot overcome the fundamental lack of diverse real-world examples.

Primary Data Needs

This project urgently requires additional expert-validated behavior annotations from the Parks Lab, particularly for:

  • Underrepresented behavior classes with fewer than 10 examples
  • A proper validation set with sufficient examples of each behavior class
  • More diverse contexts and environmental conditions

Scientific Resources