Knowledge Distillation Meets Reinforcement Learning

A two-stage training framework that first compresses a large "teacher" vision-language model into a smaller "student" using standard knowledge distillation, then further improves the student using reinforcement learning rewards.

Problem

Smaller models are easier to deploy, but classic knowledge distillation can be too 'one-size-fits-all' for real-world data that varies a lot (different sensors, regions, patients, etc.), so the student may not generalize well.

Approach

Stage 1 (KD): train the student to mimic the teacher using logit matching + feature alignment + task loss.
Stage 2 (RL refinement): treat training like an 'agent' that gets rewards for (a) moving student features closer to teacher cluster centers and (b) producing confident predictions, while gradually shifting weight away from pure KD over time (dynamic balance).

Results

Zero-shot classification benchmarks: KDRL reports strong results across multiple remote-sensing datasets (e.g., RESISC 70.30, PatternNet 80.08, RSI-CB 48.85). Image-text retrieval: KDRL achieves high Recall@10 on RSITMD (67.44% image to text, 74.76% text to image) and RSICD (52.61% image to text, 57.04% text to image).