Jinfa Huang ( 黄锦发 )

✨Bonjour, I am a Ph.D. candidate in the Department of Computer Science, University of Rochester (UR), advised by Prof. Jiebo Luo.

My research interests focus on building autonomous intelligence, specifically:

  • World Models: Learning physical laws and simulating complex, and dynamic real-world environments via advanced video generation techniques.
  • Agentic MLLMs: Empowering interactive multimodal agents with advanced reasoning and long-horizon planning within these simulated worlds and software sandboxes.

Prior to that, I received my master's degree from Peking University (PKU), advised by Prof. Li Yuan and Prof. Jie Chen. I obtained my bachelor's degree with honors from University of Electronic Science and Technology of China (UESTC).

Jinfa Huang
Research Map

🗺️ Research Map: A schematic overview of my research vision.
🌟 Project Prometheus: Fetching the "fire🔥" of real-world physics to spark autonomous AI.

News

Education

UR Logo
University of Rochester (UR), USA Ph.D. Student in Computer Science • Sep. 2023 - Present
Advisor: Prof. Jiebo Luo
PKU Logo
Peking University (PKU), China Master's Degree in Computer Science • Sep. 2020 - Jun. 2023
Advisors: Prof. Li Yuan & Prof. Jie Chen
UESTC Logo
University of Electronic Science and Technology of China (UESTC), China Bachelor's Degree in Software Engineering • Sep. 2016 - Jun. 2020
Advisor: Xucheng Luo

Selected Research Internships

Google Research, HCAI-ML, USA Student Researcher • Sep. 2025 - Dec. 2025
Advisor: Dr. Junfeng He
Amazon, International Machine Learning, USA Applied Scientist Intern • Jul. 2025 - Sep. 2025
Advisors: Dr. Yang Liu, Dr. Chien-Chih Wang, Dr. Huidong Liu
Google, Core ML Applied ML, USA Student Researcher • Jan. 2025 - May. 2025
Advisors: Jiageng Zhang, Dr. Eric Li
ByteDance, Seed Foundation Model, USA Research Intern • May 2024 - Aug. 2024
Advisors: Dr. Quanzeng You, Dr. Yongfei Liu, Dr. Jianbo Yuan

Selected Publications Total Citations

My current research mainly focuses on multimodal understanding and generation. (*Equal Contribution)

MagicTime
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators Shenghai Yuan*, Jinfa Huang*, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, Jiebo Luo TPAMI 2025 (IEEE Transactions on Pattern Analysis and Machine Intelligence)
IF Score: 18.6 (Github Repo 1300+ Stars🌟)

Area: Text-to-Video Generation, Diffusion Model, Time-lapse Videos
Existing text-to-video generation models have not adequately encoded physical knowledge of the real world, thus generated videos tend to have limited motion and poor variations. In this paper, we propose MagicTime, a metamorphic time-lapse video generation model, which learns real-world physics knowledge from time-lapse videos and implements metamorphic video generation.

HBI
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen CVPR 2023 (Conference on Computer Vision and Pattern Recognition)
(Highlight, Top 2.5%)

Area: Video-and-Language Representation, Machine Learning, Video-Text Retrieval
To move beyond coarse-grained global interactions, we explicitly model video-text as game players using cooperative game theory. We propose Hierarchical Banzhaf Interaction (HBI) to value fine-grained correspondence between video frames and text words for sensitive, explainable cross-modal contrast across different semantic levels.

EMCL
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Peng Jin*, Jinfa Huang*, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David A. Clifton, Jie Chen NeurIPS 2022 (Conference on Neural Information Processing Systems)
(Spotlight Presentation, Top 5%)

Area: Video-and-Language Representation, Machine Learning, Video-Text Retrieval, Video Captioning
To solve the problem of the modality gap in video-text feature space, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. We use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases.

Selected Surveys

I maintain several repositories to track the latest research.

Selected Benchmarks

Invited Talks

Academic Service

  • PC Member: CVPR'23-26, NeurIPS'22-25, ICLR'23-26, ICCV'23/25, ACM MM'24/25, ECCV'24/26, AAAI'25/26, COLM'25, ACL'25/26
  • Journal Reviewer: IEEE TPAMI(x2), IJCV, IEEE TCSVT, NEJM AI
  • Volunteer: NeurIPS 2025

Teaching

Personal Interests

Life