𝑱𝙞𝒏𝙛𝒂 𝑯𝙪𝒂𝙣𝒈 黄锦发

✨Bonjour, I am a Ph.D. candidate in the Department of Computer Science, University of Rochester (UR), advised by Prof. Jiebo Luo.

Prior to that, I got my master's degree from Peking University (PKU), advised by Prof. Li Yuan and Prof. Jie Chen. I obtained the honored bachelor's degree from University of Electronic Science and Technology of China (UESTC).

Jinfa Huang

My long-term goal is to build multimodal, interactive AI systems that ground, reason, and generate within a closed loop. I conceptualize this pursuit as the Prometheus framework:

(1) Distilling the Spark
Grounding autonomous intelligence in real-world environments via continuous human feedback.
(2) Self-Evolving
Aligning generative models with reward mechanisms to transition from passive recognition to understanding.
(3) Agentic Autonomy
Developing Agentic AI with planning, memory, and tool-use capabilities to autonomously learn and evolve.
Research Map

Research Map: A schematic overview of my research vision.

News

Education

UR Logo
University of Rochester (UR), USA Ph.D. Student in Computer Science • Sep. 2023 - Present
Advisor: Prof. Jiebo Luo
PKU Logo
Peking University (PKU), China Master Degree in Computer Science • Sep. 2020 - Jun. 2023
Advisors: Prof. Li Yuan & Prof. Jie Chen
UESTC Logo
University of Electronic Science and Technology of China (UESTC) Bachelor Degree in Software Engineering • Sep. 2016 - Jun. 2020
Advisors: Xucheng Luo

Selected Research Experience

Google Research, USA Student Researcher • Sep. 2025 - Present
Advisor: Dr. Junfeng He
International Machine Learning, Amazon, USA Applied Scientist Intern • Jul. 2025 - Sep. 2025
Advisors: Dr. Yang Liu, Dr. Chien-Chih Wang, Dr. Huidong Liu
Core ML Applied ML, Google, USA Student Researcher • Jan. 2025 - May. 2025
Advisors: Jiageng Zhang, Dr. Eric Li
Seed Foundation Model, ByteDance, USA Research Intern • May 2024 - Aug. 2024
Advisors: Dr. Quanzeng You, Dr. Yongfei Liu, Dr. Jianbo Yuan

Selected Publication

My current research mainly focuses on vision+language and generative models. (*Equal Contribution)

MagicTime
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators Shenghai Yuan*, Jinfa Huang*, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, Jiebo Luo TPAMI 2025 (IEEE Transactions on Pattern Analysis and Machine Intelligence)
IF Score: 18.6 (Github Repo 1300+ Stars🌟)

Area: Text-to-Video Generation, Diffusion Model, Time-lapse Videos
Existing text-to-video generation models have not adequately encoded physical knowledge of the real world, thus generated videos tend to have limited motion and poor variations. In this paper, we propose MagicTime, a metamorphic time-lapse video generation model, which learns real-world physics knowledge from time-lapse videos and implements metamorphic video generation.

HBI
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen CVPR 2023 (Conference on Computer Vision and Pattern Recognition)
(Highlight, Top 2.5%)

Area: Video-and-Language Representation, Machine Learning, Video-Text Retrieval, Video Captioning
To solve the problem of the modality gap in video-text feature space, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. We use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases.

EMCL
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Peng Jin*, Jinfa Huang*, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David A. Clifton, Jie Chen NeurIPS 2022 (Conference on Neural Information Processing Systems)
(Spotlight Presentation, Top 5%)

Area: Video-and-Language Representation, Machine Learning, Video-Text Retrieval, Video Captioning
To solve the problem of the modality gap in video-text feature space, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. We use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases.

Selected Surveys

I maintain several repositories to track the latest research.

Selected Benchmarks

Invited Talks

Academic Service

  • Program Member: MUCG@ACMMM2025, ER@NeurIPS2025
  • PC Member: CVPR'23-26, NeurIPS'22-25, ICLR'23-26, ICCV'23/25, ACM MM'24/25, ECCV'24, AAAI'25/26, COLM'25, ACL'25/26
  • Journal Reviewer: IEEE TPAMI(x2), IJCV, IEEE TCSVT, NEJM AI
  • Volunteer: NeurIPS 2025

Teaching

Personal Interests

Life