π’₯𝒾𝓃𝒻𝒢 β„‹π“Šπ’Άπ“ƒβ„Š      

✨Bonjour, I am a Ph.D. student in the Department of Computer Science, University of Rochester (UR), advised by Prof. Jiebo Luo.

I aim at building multimodal interactive AI systems that can not only ground, reason and generate over the external world signals, to understand human language, but also assist humans in decision-making and efficiently solving social concerns, e.g., robot, medical. As steps towards this goal, my research interests include but are not limited to multimodal understanding, multimodal generation and multimodal foundation model post-training.

Prior to that, I got my master's degree from Peking University (PKU) in 2023, advised by Prof. Li Yuan and Prof. Jie Chen. And I obtained the honored bachelor's degree from University of Electronic Science and Technology of China (UESTC) in 2020.

Email  /  Google Scholar  /  Github  /  Twitter  /  Zhihu  /  LinkedIn

Winter 2024, Puerto Rico✨

News

  • [2025/01]    2 papers (1 Poster and 1 Spotlight) are accepted by ICLR 2025.
  • [2025/01]    1 paper (Medical LLM Survey) is accepted by Nature Reviews Bioengineering 2025.
  • [2025/01]    Started the research internship at Google, USA, supervised by Jiageng Zhang and Dr. Eric Li.
  • [2024/12]    Happy New YearπŸ₯³! 1 paper is accepted by TPAMI 2025.
  • [2024/12]    1 paper is accepted by AAAI 2025.
  • [2024/12]    1 short paper is accepted by COLING 2025.
  • [2024/11]    1 paper is accepted by ACM Transactions on Intelligence Systems and Technology (TIST) 2024.
  • [2024/11]    Winter is coming❄️! 1 paper is accepted by npj Digital Medicine (Impact Factor: 15.357).
  • [2024/11]    1 survey is accepted by CAAI Transactions on Intelligence Technology (Impact Factor: 8.4), which aims at promoting camouflaged object detection and beyond tasks: GitHub Repo stars Awesome Concealed Object Segmentation.
  • [2024/10]    πŸ”₯πŸ”₯πŸ”₯ We release a GitHub repository and survey aim at promoting the application of autoregressive models in vision domain: GitHub Repo stars Awesome Autoregressive Models in Vision.
  • [2024/09]    1 paper (Spotlight) is accepted by NeurIPS 2024 Datasets & Benchmarks Track.
  • [2024/09]    1 paper is accepted by EMNLP 2024 Findings.
  • [2024/06]   πŸ”₯πŸ”₯πŸ”₯ We are excited to present π‚π‘π«π¨π§π¨πŒπšπ π’πœ-𝐁𝐞𝐧𝐜𝐑, a benchmark for metamorphic evaluation of text-to-video generation, which provides valuable insights for T2V models selection. GitHub Repo stars
  • [2024/05]    Started the research internship at ByteDance Seed, Bellevue, USA, supervised by Quanzeng You & Yongfei Liu & Jianbo Yuan.
  • [2024/05]    1 paper is accepted by ACL 2024 Findings.
  • [2024/04]   πŸ”₯πŸ”₯πŸ”₯ We are thrilled to present πŒπšπ π’πœπ“π’π¦πž, a metamorphic time-lapse video generation model and a new dataset ChronoMagic, support U-Net or DiT-based T2V frameworks. GitHub Repo stars
  • [2024/01]    1 paper is accepted by ICLR 2024.
  • [2023/11]   πŸ”₯πŸ”₯πŸ”₯ We release a GitHub repository to promote medical Large Language Models research with the vision of applying LLM to real-life medical scenarios: GitHub Repo stars A Practical Guide for Medical Large Language Models.
  • [2023/11]   πŸ”₯πŸ”₯πŸ”₯ How could LMMs contribute to social good? We are excited to release a new preliminary explorations of GPT-4V(ison) for social multimedia: GPT-4V(ision) as A Social Media Analysis Engine.
  • [2023/09]   Join the VIStA Lab as a Ph.D. student working on vision and language.
  • [2023/07]   1 paper is accepted by ACMMM 2023.
  • [2023/05]   I was awarded the 2023 Peking University Excellent Graduation Thesis.
  • [2023/04]   1 paper is accepted by TIP 2023.
  • [2023/04]   1 paper is accepted by IJCAI 2023.
  • [2023/02]   1 paper (Top 10% Highlight) is accepted by CVPR 2023.
  • [2022/09]   1 paper is accepted by ICRA 2023.
  • [2022/09]   1 paper (Spotlight) is accepted by NeurIPS 2022.

  • Education

    University of Rochester (UR), USA
    PH.D. Student in Computer Science      • Sep. 2023 - Present
    Advisor: Prof. Jiebo Luo

    Peking University (PKU), China
    Master Degree in Computer Science      • Sep. 2020 - Jun. 2023
    Advisors: Prof. Li Yuan and Prof. Jie Chen

    University of Electronic Science and Technology of China (UESTC), China
    Bachelor Degree in Software Engineering      • Sep. 2016 - Jun. 2020
    Advisors: Prof. Xucheng Luo

    Research Experience

    Core ML Applied ML, Google, USA
    Student Research       • Jan. 2025 - Now
    Advisors:   Jiageng Zhang and Dr. Eric Li.

    Seed-Foundation-Model, ByteDance
    Research Intern       • May. 2024 - Aug. 2024
    Advisors:   Dr. Quanzeng You & Dr. Yongfei Liu & Dr. Jianbo Yuan

    Artificial Intelligence Center, Pengcheng Lab
    Research Intern       • Sep. 2020 - Aug. 2022
    Advisors:   Dr. Guoli Song & Prof. Jie Chen

    Multimedia Computing Team, KDDI Research
    Research Intern       • Nov. 2019 - Feb. 2020
    Advisors:   Dr. Yanan Wang & Dr. Jianming Wu

    X-Data Research Group, Tencent IEG
    Engineering Intern       • Jan. 2019 - Jul. 2019
    Advisors:   Boya Yin & Dr. Yang Chao

    Selected Publication
    MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
    Shenghai Yuan*, Jinfa Huang*, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, Jiebo Luo
    arXiv preprints
    (Github Repo 1300+ Stars🌟)
    [Paperlink], [Code], [Page], GitHub Repo stars
    Area: Text-to-Video Generation, Diffusion Model, Time-lapse Videos

    Existing text-to-video generation models have not adequately encoded physical knowledge of the real world, thus generated videos tend to have limited motion and poor variations. In this paper, we propose MagicTime, a metamorphic time-lapse video generation model, which learns real-world physics knowledge from time-lapse videos and implements metamorphic video generation.

    Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
    Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen
    IEEE International Conference on Computer Vision and Pattern Recognition, CVPR 2023
    (Highlight, Top 2.5%)
    [Paperlink], [Code], [Page], GitHub Repo stars
    Area: Video-and-Language Representation, Machine Learning, Video-Text Retrieval, Video Captioning

    To solve the problem of the modality gap in video-text feature space, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. We use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases.

    Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
    Peng Jin*, Jinfa Huang*, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David A. Clifton, Jie Chen
    Conference on Neural Information Processing Systems, NeurIPS 2022
    (Spotlight Presentation, Top 5%)
    [Paperlink], [Code], GitHub Repo stars
    Area: Video-and-Language Representation, Machine Learning, Video-Text Retrieval, Video Captioning

    To solve the problem of the modality gap in video-text feature space, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. We use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases.

    All Publication [Google Scholar]

    My current research mainly focuses on multimodal generation and understanding. (*Equal Contribution)

    arXiv preprints

    [1] Shenghai Yuan*, Jinfa Huang*, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, Jiebo Luo. "MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators" [PDF][Code][Project page] GitHub Repo stars

    [2] Bin Zhu, Peng Jin, Munan Ning, Bin Lin, Jinfa Huang, Qi Song, Mingjun Pan, Li Yuan. "LLMBind: A unified modality-task integration framework" [PDF][Code] GitHub Repo stars

    [3] Bin Lin, Zhenyu Tang, Yang Ye, Jinfa Huang, Junwu Zhang, Patian Pang, Peng Jin, Munan Ning, Jiebo Luo, Li Yuan. "MoE-LLaVA: Mixture of Experts for Large Vision-Language Models" [PDF][Code] GitHub Repo stars

    [4] Cong Jin, Jingru Fan, Jinfa Huang, Jinyuan Fu, Yi Zhang, Tao Mei, Li Yuan, Jiebo Luo. "Next-Gen AIGC: Harnessing Advanced Multimodal Foundation Models for Text-to-Media Innovations"

    [5] Yongdong Luo, Xiawu Zheng, Xiao Yang, Guilin Li, Haojia Lin, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, Rongrong Ji. "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension" [PDF][Code] GitHub Repo stars

    [6] Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan. "Identity-Preserving Text-to-Video Generation by Frequency Decomposition" [PDF][Code][Page] GitHub Repo stars

    [7] Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong. "Autoregressive Models in Vision: A Survey" [PDF][Code] GitHub Repo stars

    2025

    [1] Jinfa Huang*, Jinsheng Pan*, Zhongwei Wan, Hanjia Lyu, Jiebo Luo. "Evolver: Chain-of-Evolution Prompting to Boost Large Multimodal Models for Hateful Meme Detection", COLING 2025, short paper, [PDF] [Code] [Poster] GitHub Repo stars

    [2] Haoran Tang, Meng Cao, Jinfa Huang, Ruyang Liu, Peng Jin, Ge Li, Xiaodan Liang. "MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval", AAAI 2025, [PDF][Code] GitHub Repo stars

    [3] Fenglin Liu, Xian Wu, Jinfa Huang, Kim Branson, Patrick Schwab, Lei Clifton, Ping Zhang, Jiebo Luo, Yefeng Zheng, and David A. Clifton. "Aligning, Autoencoding and Prompting Large Language Models for Novel Thorax Disease Reporting", TPAMI 2025, [PDF][Code] GitHub Repo stars

    [4] Hongjian Zhou*, Fenglin Liu*, Boyang Gu*, Xinyu Zou*, Jinfa Huang*, Jinge Wu, Yiru Li, Sam S. Chen, Peilin Zhou, Junling Liu, Yining Hua, Chengfeng Mao, Xian Wu, Yefeng Zheng, Lei Clifton, Zheng Li, Jiebo Luo, David A. Clifton. "A Survey of Large Language Models in Medicine: Principles, Applications, and Challenges", Nature Reviews Bioengineering 2025, [PDF][Code] GitHub Repo stars

    [5] Shaofeng Zhang, Qiang Zhou, Sitong Wu, Haoru Tan, Zhibin Wang, Jinfa Huang, Junchi Yan. "CR2PQ: Continuous Relative Rotary Positional Query for Dense Visual Representation Learning", ICLR 2025, [PDF][Code] GitHub Repo stars

    [6] Chunming He, Chengyu Fang, Yulun Zhang, Longxiang Tang, Jinfa Huang, Kai Li, Zhenhua Guo, Xiu Li, Sina Farsiu. "Reti-Diff: Illumination Degradation Image Restoration with Retinex-based Latent Diffusion Model", ICLR 2025 Spotlight, [PDF][Code] GitHub Repo stars

    [7] Fenglin Liu, Zheng Li, Qingyu Yin, Jinfa Huang, Xian Wu, Anshul Thakur, Kim Branson, Patrick Schwab, Bing Yin, Yefeng Zheng, Jiebo Luo, and David A. Clifton. "A Multimodal Multidomain Multilingual Medical Foundation Model for Zero-Shot Clinical Diagnosis", npj Digital Medicine, [PDF][Github] GitHub Repo stars

    2024

    [1] Meng Cao*, Haoran Tang*, Jinfa Huang, Peng Jin, Can Zhang, Ruyang Liu, Long Chen, Xiaodan Liang, Li Yuan, Ge Li. "RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter", ACL 2024 Finding, [PDF]

    [2] Shaofeng Zhang, Jinfa Huang, Qiang Zhou, Zhibin Wang, Fan Wang, Jiebo Luo, Junchi Yan. "Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach", ICLR 2024, [PDF][Code] GitHub Repo stars

    [3] Zhongwei Wan*, Ziang Wu*, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan. "LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference", EMNLP 2024 Finding, [PDF][Code] GitHub Repo stars

    [4] Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, Li Yuan. "ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation", NeurIPS 2024 D&B Spotlight, [PDF][Code][Project page] GitHub Repo stars

    [5] Fengyang Xiao, Sujie Hu, Yuqi Shen, Chengyu Fang, Jinfa Huang, Chunming He, Longxiang Tang, Ziyun Yang, Xiu Li. "A Survey of Camouflaged Object Detection and Beyond", CAAI 2024, [PDF][Github] GitHub Repo stars

    [6] Hanjia Lyu*, Jinfa Huang*, Daoan Zhang*, Yongsheng Yu*, Xinyi Mou*, Jinsheng Pan, Zhengyuan Yang, Zhongyu Wei, Jiebo Luo. "GPT-4V (ision) as a Social Media Analysis Engine", ACM Transactions on Intelligence Systems and Technology (TIST) 2024, [PDF][Code] GitHub Repo stars

    2023

    [1] Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen. "Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning", CVPR 2023 Highlight, [PDF][Code][Project page] GitHub Repo stars

    [2] Jingyi Wang, Jinfa Huang, Can Zhang, Zhidong Deng. "Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs", ICRA 2023, [PDF][Code] GitHub Repo stars

    [3] Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, Zhennan Wang, Li Yuan, Chang Liu, Jie Chen. "Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment", IJCAI 2023, [PDF][Code] GitHub Repo stars

    [4] Hao Li, Jinfa Huang, Peng Jin, Guoli Song, Qi Wu, Jie Chen. "Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering", TIP 2023, [PDF]

    [5] Jingyi Wang, Can Zhang, Jinfa Huang, Botao Ren, Zhidong Deng. "Improving Scene Graph Generation with Superpixel-Based Interaction Learning", ACMMM 2023, [PDF]

    2022 and Earlier

    [1] Peng Jin*, Jinfa Huang*, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David Clifton, Jie Chen. "Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations", NeurIPS 2022 Spotlight, [PDF][Code] GitHub Repo stars

    [2] Yingmei Guo, Jinfa Huang, Yanlong Dong, Mingxing Xu. "Guoym at SemEval-2020 task 8: Ensemble-based Classification of Visuo-lingual Metaphor in Memes", SemEval-2020, [PDF]

    [3] Yanan Wang, Jianming Wu, Jinfa Huang, Gen Hattori, Yasuhiro Takishima, Shinya Wada, Rui Kimura, Jie Chen, Satoshi Kurihara. "LDNN: Linguistic Knowledge Injectable Deep Neural Network for Group Cohesiveness Understanding", ICMI 2020, [PDF][Code] GitHub Repo stars

    Selected Honors & Scholarships

  • Peking University Excellent Graduation Thesis (Top 10%), PKU  2023
  • Outstanding Graduate of University of Electronic Science and Technology of China (UESTC),  2020
  • Selected entrant for Google Machine Learning Winter Camp 2019 (100 people worldwide),  2019
  • National Inspirational Scholarship,  2018
  • China Collegiate Programming Contest (ACM-CCPC), Jilin, Bronze,  2018

  • Talk

  • "Can Video Generation Models as World Simulators?β€œ, 3D视觉ε·₯坊, 2025.01, [Live]

  • Teaching

  • Teaching Assistant, CSC 240/440 Data Mining, Prof. Thaddeus E. Pawlicki, University of Rochester, 2025 Spring
  • Teaching Assistant, CSC 240/440 Data Mining, Prof. Monika Polak, University of Rochester, 2024 Fall

  • Personal Interests

    Anime: As a pastime in my spare time, I watched a lot of Japanese anime about love, sports, and sci-fi.

    Literature: My favorite writer is Xiaobo Wang, the wisdom of his life inspires me. My favorite philosopher is Friedrich Wilhelm Nietzsche, and I am grateful that his philosophy has accompanied me through many difficult times in my life.

    Academic Service

  • PC Member:   CVPR'23/24/25, NeurIPS'22/23, ICLR'23/24/25, ICCV'23/25, ACM MM'24/25, ECCV'24, AAAI'25, COLM'25
  • Journal Reviewer:   IEEE TCSVT, IEEE TPAMI, NEJM AI


  • My hometown is Guangdong, you can call me Cantonese name: Gamfaat Wong.
    Last updated on Feb, 2025.

    This awesome template is inspired from this good man.