lunes, 3 de julio de 2006

Multimodal Video Characterization and Summarization

* *
Smith, Michael, and Kanade, Takeo. (2005) Multimodal Video Characterization and Summarization. Kluwer Academic Publishers. Boston USA.
Keywords: summarization, video terminology.

This is a technical book about some video aspects for production, analysis and further treatment for research and production levels. In this context the term summarization is mainly related to compression techniques and pattern recognition, though there is a section where the authors classifies the summarization of content. Its chapters are: 1) Introduction 2) Video Structure and Terminology 3)Multimodal Video Characterization 4) Video Summarization 5)Visualization Techniques and 6) Evaluation.
Useful quotes.
  • Video Terminology. (from production standards, and image, audio and video processing technology) P18-20
  • Video-Feed-Stream (audio and image). The terms video, feed and stream will be used to represent a sequence of images and audio. These terms will seldom be used to describe the image track without audio. During production or broadcast, the feed is often described separately as an audio feed or video feed.
  • Full-Length Video. The term full-length is used to describe a video that is produced professionally with studio quality. This usually implies feature films, documentaries or news programs.
  • Shot (image only) This term defines a single camera shot. It is also used to describe change or borders of image content. Cuts, Fades, Dissolves and Wipes for example, delineate shots. [Zhang95]
  • Frames (images). Frames refer to the actual image portion from the image track of the video.
  • Show-Story-Segment (image and audio). The term show, story or segment may be used interchangeably, but in all cases, they represent a video that is independent of other content. Shows are typically shown on broadcast television for home use. An example might be a common sitcom, movie talk show, etc. A segment or story usually represents a smaller portion of a show, such as a single topic in a full-length news broadcast.
  • Scene (image and audio) A scene is a subset of the video that makes up a semantic unit. This unit may consist of several shots and phrases.
  • Audio Track. (audio) The audio track refers to the actual audio stream in the video. This may be compressed, although many applications still use uncompressed WAV files for faster analysis.
  • Word (audio or text) A word is a portion of the audio signal or text transcript which represents a single word.
  • Transcript (text) The transcript is an ANSII representation of the spoken words in an audio or video signal. It is usually provided through speech recognition, closed-captions, production notes, or an actual script. The transcript may be included in production notes and it may generally refers to that part of the audio that is spoken. Some production notes contain information about sound and imagery effects
  • Phrase (audio or text) A phrase is a collection of words that make a semantic unit of audio or text. A phrase may bound the range of a sentence, but his is not a necessary condition. A sub-region of a sentence may also serve as a phrase.
  • The notation for video timing and duration follows a common format; hours, minutes, seconds and frames are listed respectively. An example is Hours (0-N); Minutes (0-59): Seconds (0:59): Frames (0-29. Example 0:08:56:15. P20
  • Video Categories used in the book: Documentaries, Broadcast News, Feature films, Sports, Play-based sports, Continuous Action Sports, Sports Replay, (Cf. P21-29)
  • Video Captions- Text in video provides significant information as to the content of a scene. For example, statistical numbers and titles are not usually spoken but are included in captions for viewer inspection. Moreover, this information does not always appear in closed captions so detection in the image is crucial for identifying potential skim regions. P36
  • The position of the subject is an important and simple procedure for conveying a specific theme in video. Two common procedures, chin concentrate on positioning, are Viewer Dialogues and Close-ups. (P41)
  • Viewer Dialogues – This involves a character on screen talking directly to the viewing audience. This effect was popular in suspense movies prior to the 1950’s. It is seldom seen today an is mostly used as a method of comic relief. Broadcast news employs constant anchorperson and viewer dialogue, but not as a special effect. (P41)
  • Close-up Shots. When an object or person is placed close to the camera, it consumes the majority of the viewing space and serves as the dominant subject in the scene. This effect is used in varying degrees with all types of video. (P41)
  • Other Technical features (Cf. P43-48)
  • Angle shots. High and Low Angle Shots, wide-angle shot
  • Camera focus. Pulling Focus, Isolation focus, shallow focus,
  • Lightning and Mattes. Low lighting, Dramatic lighting Silhouettes, Matte effect
  • Grayscale video
  • Audio effect
  • Video Summaries have great potential for use in digital video libraries, as well as other mediums that use video. A number of summary representations similar to the video skims research from Carnegie Mellon University [Smith 1997] are used in broadcast television. Examples of these representations include the following:
  • News Summaries – Many news stations provide a short summary of regional, state, or world news. An example of a news summary is the “World in a Minute” from WPXI, Channel 11 News, Pittsburgh, Pennsylvania. As the title indicates, this one-minute segment displays several world news stories, each usually 8 seconds in duration. The purpose of the news summary is to display the most important news from around the world in a short time.
  • Sports Highlights – Sports highlights are common in local and world news… A segment shows several isolated portions of a sporting event with commentary as audio. The purpose of the highlight video is to convey points of interests in a short amount of time.
  • Recorded Sports Broadcasts – A sporting event that is rebroadcast at some later time is often edited to remove ambiguous content or to shorten the duration. In Japan, baseball games that are rebroadcast need to fit into a one-hour time slot. The plays are parsed manually to create a shorter version of the original game…For these broadcasts highlights are aired that usually includes scores, long plays and turnovers.
  • Movie trailers and previews – Short video trailers and previews are produced to attract potential customers for feature-films. Conveying the content truthfully is not the primary motivation. The selection of segments is based more on producer preferences, which includes exciting and climatic video.
  • Introductory segments. Most documentaries include a short video abstract prior to the full segment. They contain image and some audio portions from full segment. They contain image and some audio portions from the full segment. The purpose of the abstract is to convey the overall content of the later segments. P49-50
  • Video summarization selects appropriate keywords or keyphrases and a corresponding set of images to create a shortened video abstraction or video skim. P111
  • A skim is a compilation of “important” audio and image regions from a video into a smaller semantic unit. It results in a motion video that is much shorter than the original and retains the same semantic meaning. P112
  • Visualization techniques provide user interfaces for applications in video characterization. They describe content and present video for specific types of summarization applications. Video skims, poster frames, thumbnails, text titles, and segmentation applications are described in this chapter. P149
  • Browsing. A number of commercial browsing systems exist today, both as software and hardware devices. They include:
    DVD- chapter skipping and digital Fast Forward and Reverse
    Analog Fast Forward – A noise medium that distorts
    Accelerated Playback –Pitch control technology enables the acceleration and deceleration of audio or video playback without annoying high frequency artefacts.
    Roll Bar. Cylindrical Rolling tool in many analog videocassettes
Further reading
  • Smith, M., Kanade, T. “Video skimming and Characterization through the Combination of Image and Language Understanding Techniques” Computer Vision and Pattern Recognition. San Juan, PR, June 1997.
  • Ju, X. S., Black, M. J., Minnerman, S., and Kimber, D., “Summarization of Videotaped Presentations: Automatic Analysis of Motion and Gesture”. IEEE Transcriptions on Circuits and Systems for Video Technology, Vol 8, No. 5, September 1998.
Alberto Ramirez Martinell

No hay comentarios:

Publicar un comentario