[Temporal Video Segmentation]
[Video Content Processing and Understanding]
[Content-Based Image/Video Retrieval]

Temporal Video Segmentation    [back to top]

Shot Boundary Detection and Classification

A video shot is defined as a sequence of video frames with continuous background settings. It is the smallest video unit containing temporal semantics, such as action, dialog, etc.  In this work, I have proposed a multi-level framework for detecting the transition locations between video shots by analyzing the feature plots in different temporal scales of the input video sequence. The shot transitions are further classified into category of “abrupt” or “gradual” based on the accumulative changes around the neighborhoods of the transitions.  



Scene Segmentation of Video Sequences

A scene is defined a group of video shots that are related to the same subject, e.g., chapters in movies, stories in news programs, etc. In this work, I have developed a general framework for temporal scene segmentation in various video domains. The developed framework is formulated in a statistical fashion and uses the Markov chain Monte Carlo (MCMC) technique to determine the boundaries between video scenes.  In this approach, a set of arbitrary scene boundaries are initialized at random locations and are automatically updated using two types of updates: diffusion and jumps. The major advantage of the proposed framework is two-fold: 1) it is able to find the weak boundaries as well as the strong boundaries, i.e., it does not rely on the fixed threshold; 2) it can be applied to different video domains.  The proposed scene segmentation framework has been applied on home videos and feature films, very promising results have been obtained for both domains.



Story Detection in News Videos

A particular interest exists in detecting stories in news videos. The results of the news story segmentation can be further applied in tasks, such as video summarization, indexing and retrieval. In this work, I have developed a new framework for segmenting the news programs into different story topics. The proposed method is constructed based on the Shot Connectivity Graph and utilizes both visual and textual contents of the video. With a series of anchor detection, weather and sporting news localization and story merging processes, the input news videos is finally segmented into stories, each of which consists coherent semantic contents.  This work has achieved very high accuracy in the TRECVID evaluation competition 2004, and UCF vision team was invited to given an oral presentation in the forum.



Scene Structuring in Continuously Recorded Videos

Instead of finding the abrupt scene boundaries, video scenes are represented by their corresponding representative feature values such as color statistics, and each portion of the video is indicated by a fuzzy number computed based on the membership functions with respect to the representative feature values. These representative feature values are obtained by applying spectral clustering technique. The scene segments are later determined by the preferred criteria. Different from the shot-based methods, the proposed method finds the scene boundaries not only based on the data (video shots), but also based on the user preference.  


Video Content Processing and Understanding    [back to top]

Spatiotemporal Visual Attention Detection

Human vision system actively seeks interesting regions in images/videos to reduce the search effort in the object detection tasks. Similarly, prominent actions in video sequences are more likely to attract human's first sight than their surrounding neighbors. In this project, we have developed a spatiotemporal video attention detection framework for detecting the attended regions that correspond to both interesting objects and actions in video sequences. Homographies are estimated between video frames and used to detect the motion saliency, and a hierarchical structure is constructed for the color-based spatial visual saliency computation.  Both temporal and saliency maps are fused in a dynamic way for the generation of the spatiotemporal visual attention model with the bias towards the temporal model.  This work has been tested on multiple sequences to highlight the target objects and/or activities.



Semantic Linkage of News Stories

In this project, we have developed a novel framework for the semantic linking of the news topics. Unlike the conventional video content linking methods based only on the video shots, the proposed framework links the news stories across different sources. The semantic linkage between the news stories is computed based on their visual and textual similarities. The visual similarity is carried on both of the story key-frames with or without faces detected. The textual similarity is computed using the automatic speech recognition (ASR) output of the video sequences. The output of the story linking method can be applied to compute the ranking or interestingness of a news story. The developed method has been tested on a large open-benchmark dataset from TRECVID 2003 by NIST, and very satisfactory results for both of the proposed tasks have been obtained.

Yun Zhai and Mubarak Shah, "Tracking News Stories Across Different Sources", ACM Multimedia 2005, Singapore, November 6-12.



Movie Scene Classification Using Finite State Machines

Among many genres of video production, feature films are a vital field for the application of such tools. Feature films are produced in accordance with the “film grammars”, which is a set of rules of how the films should be generated to reveal the story lines. In this work, we utilized the knowledge of film grammars, and modeled the movie scenes using the Finite State Machines. Three scene categories are modeled, action, dialog and suspense scenes. This method analyses the structural information of the scenes based on the low-level and mid-level features. The presented framework has demonstrated the usefulness of FSM by experimenting on over 80 movie scenes and has achieved high accuracy scores, including recall and precision scales.  


Content-Based Image/Video Retrieval    [back to top]

Relevance Feedback Using Keyword
and Region-Based Refinement

In this project, we have developed an on-line content-based video retrieval system, PEGASUS. It retrieves relevant video shots from news database according to user queries. The developed system indexes the video data using speech (using ASR) and images (key-frame regions) features. In addition, this system provides a relevant feedback mechanism based on query expansion and region-based image matching to allow users to refine their search results through an iterative process. The PEGASUS system has been constructed using more than 43,000 video shots of news programs. This system can be accessible from the internet at the address of



TREC Video Retrieval Forum

TRECVID is an annual research forum organized by the US National Institute of Standards and Technologies (NIST). It encourages the research in the classification, searching and retrieval of news video data, and provides a worldwide communication platform for researchers to share their thoughts. We, UCF vision team, have participated in the tasks of shot boundary detection, news story segmentation, camera motion classification, high-level semantic concept detection, interactive topic search and TV rushes exploitation. I was personally deeply involved in all the tasks except the high-level feature detection. We have achieved top performance in the feature “beach” detection and story segmentation tasks.