Exploring and
Constructing Video in
Improvisational Manner
Paul Nemirovsky
The Media
Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
e-mail:
pauln@media.mit.edu
Gene Shuman
The Media
Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
e-mail:
gshuman@mit.edu
Abstract
How can machines help us to manipulate and structure audiovisual media
in ways that are always novel and are uniquely ours? How can such construction
happen in real time, with no precise planning or guidance given by the user?
The Emonic Environment (EE), the system described in this paper, enables
improvisational construction and navigation of media space, both by individuals
and by groups. Participants either control the system directly (e.g., real-time
recording, processing, and performance of audio, video, and text, or exchange
with remote users and online databases), or provide only a higher-level
structural guidance, letting the underlying genetic algorithms control the
low-level details. The system’s behaviour and content is controlled utilizing a
keyboard or mouse, as well as microphones, cameras, sensors, MIDI controllers,
and cell phones.
This paper focuses on the newest aspect of the EE: a capability to
manipulate video, and the methods by which the system’s control structures,
assisted by genetic algorithms, make it possible for the participants to
manipulate video without having to attend to the minute parametric details.
Characteristics of improvisational action are described as well as the rationale for our particular design. Two architectural notions aimed at encouraging real-time structural thinking in the creative context are introduced: those of content abstraction and structural control. This is followed by a description of a relevant subset of the EE’s implementation: temporal and new video elements. We conclude with remarks on future work.
The Emonic Environment (EE) is a system for media creation and
exploration based on characteristics of improvisational action, detailed below
and in [1]. The EE’s architecture allows its users to manipulate video, audio
and text in a nearly identical fashion. Additionally, the EE utilizes genetic
algorithms to introduce mutations of the media space, including both the
content and its controlling structures, that might be of interest to its users.
The EE’s aim is to draw people into exploring structural behaviours and
strategies rather than individual aspects of the media space.
One of the primary characteristics of improvisational action is content
abstraction. In the context of human-computer interaction, content
abstraction means presenting users with tools for structural control of the
media space (i.e., control of its density, intensity, and behaviour over time),
while making the particulars of media manipulation controllable by the machine
(or other users). Two conditions are necessary for content abstraction: (1)
functions used to control media need to be as medium-independent as possible
(video, sound, and text controlled in the same fashion) and (2) it should be
possible to easily make links between these functions and an abstract network
that provides the structural control. The EE’s premise is to abstract away from
the low-level control of particular media properties and towards a higher-level
functional view of the media space, leading the participants to consider such
configurations of the space that they would not arrive at on their own. To
facilitate such a functional view of the space, its components need to be
modular, easy to interconnect, and capable of manipulating the entire spectrum
of media. The previous version of the EE was limited in that regard – it worked
with audio only. In this paper we describe our attempt at bringing the EE
closer to the idea of content abstraction by incorporating video manipulation.
The rest of the paper is as follows: we start by presenting an updated description of the characteristics of improvisational action. We then discuss the rationale for our design, articulating the key notions of content abstraction and structural control. This is followed by a description of a relevant subset of the EE’s implementation: temporal and new video elements. We conclude with remarks on future work.
Improvisational action, defined here in the context of human-computer
interaction, implies that no predefined rules, plans, or objectives of the
action exist. Instead, the media space being explored is regarded as an
evolving structure, its configuration guided by real-time genetic algorithms
and human feedback to continuously restructure itself. The ten characteristics
that follow are inspired by a mix of non-idiomatic improvisational and experimental
music traditions ([2], [3]).
1.Changing, Multi-level Focus. Whether deciding what to do next, or reflecting on
past actions, improvisers employ different levels of abstraction
simultaneously. Switching between representations, they attend to the minute
details at one moment, only to shift to looking at an overall structural
development (e.g., a climax) a second later.
2.Dynamic Structural Rules. Improvisation, like composition, has rules concerning
elements’ interrelationship in terms of time, volume, and other perceptual
characteristics. Unlike in composition, however, these rules and the structure
they comprise need not be predetermined. Instead, they are often created and
modified on the spot, defining the improvisation’s character. For instance, an
improviser may spontaneously decide to repeat a given motif every few seconds,
or increasingly desynchronize two ongoing motifs, regardless of what the actual
motifs are at the given moment.
3.Absence of Static Plan. Choosing the path to follow in exploring
and structuring media is a dynamic process that happens as the improvisation
unfolds. Instead of first creating the structure and then filling in the
content, improvisational plans are generated and evaluated on the spot. As a
result, improvisers are not too concerned with following an existing framework
in a perfect manner, focusing instead on creating new plans and learning from
unintended mistakes and unexpected successes.
4.Absence of Authoritative Score and Price of Mistakes. Improvisational performance, unlike that
of a composed piece, cannot be quantitatively compared with a pre-existing
deterministic score. As a result, the notion of “mistake” is modified from that
of a non-compliance with predetermined solutions to a higher-level non-compliance
with aesthetic expectations. As a result, initiating an unorthodox action
becomes less threatening, and risk-taking is encouraged, due to the fact that
unlike a performance of a composed piece evaluated against an authoritative
score, aesthetic expectations of improvisation cannot be evaluated on a
note-per-note basis. Generating new actions becomes easier than following
predetermined ones, with the improvisers being free to exercise as much control
over the creative action as they desire yet avoid the responsibility that
conformance to or creation of a deterministic score implies.
5.Process, not Artefact Production, as the Goal. An improviser, unlike a feature-film
cinematographer, a Western composer, or a product designer, is not concerned
with producing a final viable artefact – a movie, a sonata, a pop song, or a
chair. While improvisation might be recorded and, as such, seen as a fixed
construct, improvisation is primarily a process of exploring, contextualizing
and interrelating memories, perceptions, and actions. Improvisers weave
together an array of “sketches” which gain relevance and meaning only as the
improvisation unfolds. The importance of individual elements lessens in favour
of that of the paths by which these elements appear, become significant, and
disappear – that is, the strategies for exploring the overarching structure.
Improvisers employ these strategies to find structure within chaos - only to
break it again a moment later, and start looking anew.
6.Relevance of Context. Improvisers do not follow a score; as a result, their decision-making is
guided both by explicit actions (their own and others’) and their perception of
the moment in its entirety. In other words, improvisation is not formed in a
vacuum; it merges between explicit decision-making and implicit
context-gathering from the environment in which the improvisation is being
created.
7.Distributed Responsibility and Control. In an improvisational performance, no fixed contract
specifying responsibilities of control (i.e., a balance of power) exists
between the performers. The degree of control assumed over the improvisation by
each participant is set dynamically, following implicit and explicit
negotiations. Improvisers are always free to renegotiate what and how they
control during the course of improvisation, thus freeing them from
preoccupation with every aspect of the creative action, and making
experimentation easier.
8.Audience as a Participant. From the passive audience of linear storytelling
systems to the nearly equally passive audience of multiple-choice interactive
environments, a strict giver / taker dichotomy has been enforced between the
consumer (the audience) and the producer (the performer). In the context of
improvisation such a distinction is obsolete; anyone can co-improvise, so long
as the effect of his activity is seen or heard in some way by the other
performers. As a result, any audience becomes a pool of potential participants
which, even when not actively participating in the act of media creation, are
regarded as a part of the improvisational circle.
9.Timeframe and Obligation to Participate. In the compositional paradigm, performer’s
participation is to last as long as is required for the performance of the
piece to be completed. Walking out in the middle of the creative action means
that the action is stopped until the active participants’ return. With time,
social norms have emerged that prevent such walkouts, as manifested in events
such as concerts of classical music: events that force the participants into a
highly ritualized act of performance with no escape until the end of the action
is reached. Improvisational setting, on the other hand, allows for a more
relaxed participatory mode: people are no longer forced to a particular length
of non-stop participation and are free to commence / end their participation
whenever desired.
10.Immediacy of Feedback. Unlike the compositional paradigm, with its sequential two-layered
creative process (compose, then perform), improvisational action blends the
two. As a result, both structures and content are evaluated and incorporated
(or rejected) in real time, whether coming from one of the improvisers or from
an external source.
3.1. Time-based Content Abstraction
Media systems today are typically limited in their treatment of time and
context[1]
when considering placement and relevance of a given media content. In
traditional media systems, content elements follow each other in a fixed,
predetermined order. In loop-based systems, content elements occur at set
intervals for the duration of the loop condition, with temporal relationships
defined between individual pieces of content. In both cases, the focus is on
the content rather than on the structure. As a result, temporal characteristics
are hard to abstract (e.g., instead of arranging events on a timeline, specify
that element A is to occur m times for each n occurrences of
element B). The bigger problem however is that the temporal structures (e.g.,
the relationship between elements A & B at any given point of the
performance) are fixed in their location and interrelationship with other
temporal elements unless a manual change is performed by their users. In
designing content abstraction into the EE’s temporal elements, our first
requirement therefore was to provide a unified temporal treatment for audio,
video, and text. The elements had to be modular and capable of dynamically
changing their placement in the events chain.
Media content is often viewed as pattern-able (e.g., endless available
libraries of drum loops). However, learning, making apparent and manipulating action
patterns (elements’ interrelationships over time) is still absent from
real-time interactions in computer context. Our second requirement for content
abstraction was making the EE’s temporal elements responsive to real-time
evolutionary processes and users’ feedback.
Finally, synchronization of media is also typically limited to a
within-component synch. For instance, audio and video may be synchronized
within a video clip, yet the ‘behaviour’ of their synchronization – that is,
the change in synch between the two over time – is rarely scriptable on its
own. Our third requirement was to make construction of such action patterns
possible, both by users and by the machine.
3.2. Structural Video Manipulation
Nowadays most people with no expertise on sound and video editing see
them as qualitatively different. Sound is seen as fluid and open to
manipulation (anybody can create audio effects with their own mouth) while
video is seen as fixed and its manipulation is considered to be a complex and
cumbersome affair. By incorporating video within the Emonic Environment we hope
to show that video can be manipulated very similarly to sound: real-time and
unconstrained by the initial shape.
Today most of the off-the-shelf video editors regard video manipulation
as an editing process, performed in stages and aimed at producing a
fixed result: first shoot, then digitize, then position on a timeline, then cut
into fixed pieces and connect by fixed transitions. The end output is similarly
fixed, and, intriguingly, almost always square in shape. More than twenty years
after the release of interfaces such as FairLight [4], real-time video
manipulation interfaces mostly remain complex, expensive, and without the type
of instantaneous plug-camera-and-manipulate interaction that would be appealing
to many of today’s computer users. Systems that do allow such interaction
require custom hardware or a solid understanding
of visual manipulation principles (e.g., VJamm [5]).
The EE is currently focused on sample-based media, rather than on a
purely algorithmic processing. The rationale for that is quite simple – people
love seeing and hearing themselves and may possess a more intuitive
understanding of the core elements behind sample-based media (i.e., photos,
videos) than of the parameters behind a purely generative art. Operating on
samples provides us with the added benefit of being able to utilize the vast
amounts of sampled media currently available in free domain.
Some of the concepts mentioned in this paper are familiar to users of
the more novel video editors (e.g., Jitter [6]), which allow, effect-wise,
video manipulation similar to and beyond the one described in this paper. The
core difference between such editors and the Emonic Environment is the added
possibility of structural control, assisted by genetic
algorithms. Structural control implies that the participants can focus on
higher-level control strategies, leaving the manipulation of component-specific
parameters to the machine (or collaborating users). In that way, media
contributed by a participant can be changed in ways its contributor would not
consciously consider or imagine. To effect change, the user only needs to
interact with generalized structural controllers (unless he desires a precise
function control).
Today, media spaces (e.g., a video clip or an audio database) are
typically viewed as consisting of discrete elements upon which media
manipulations are performed. Integration of structural control and content
abstraction is aimed at presenting a network view of the media space; a network
of structural elements controlling media behaviour over time, regardless of the
particulars (media type, content, etc). If we succeed in getting people closer
to realizing that the same structure can be used to control differing content,
we bring our users a step closer to becoming creators who think about context
and structural development.
We believe structural control and content abstraction provides a
powerful way to encourage users to think structurally. Traditional editors
require users to first learn abstract concepts and then apply these to media
content in order to design finished structures (pieces). In the EE, users play
with content while figuring out how it constitutes a structure. By browsing
through disjoint content elements and manipulating how the content is
controlled over time, users end up creating structural control networks (or use
the ones suggested by the machine).
The EE, written in Java, manipulates audio, video, and text in shared
and individual contexts, and utilizes sensors, microphones, cameras, and cell
phones as input devices. As such, the features described below represent only
the fraction of the EE relevant for understanding the video manipulation
capabilities presented later in this paper. We start by describing the overall
architecture, and follow by describing a number of elements that constitute the
core of medium-independent processing.
The EE consists of two main processing layers, Perceptual and Structural,
each represented as an independent network of interconnected elements, and an
auxiliary Mediated layer, connecting the two main layers. Simply put,
the Perceptual layer defines what and how we hear, see, and read,
while the Structural layer defines when and why. The operation of
the system is purely real-time, with no offline processing.
The participants interact with each layer in the following three ways: (1) directly setting the individual component properties within each layer, (2) providing feedback to the built-in genetic algorithms that are evolving the network’s state, and (3) contributing and exploring media and defining its interrelationships with the control structures.
4.1. Perceptual Layer
The Perceptual layer is populated with emons[2]
of various types, each with its own set of features. The emons control how
the media is generated, modified, and played back (e.g., speed or volume of a
sound, rotation angle of a video frame, semantic relationships of a piece of
text). Emons’ modular architecture (ability to interconnect with other emon
types) allows for the creation of nested processing structures. Overall, emons
can be thought of as a media-processing engine where, by combining multiple
types of emons, the Perceptual layer can be built to taste and reconfigured in
real time. Viewed sequentially, the processing can be seen to originate from a
single, repetitive beat; propagating through tempo adjusters to become faster
or off-beat; propagating again through filters which alter the overall pattern
of emons’ processing; and finally ending with a collage of temporal signals
interpreted by effect producers, retrieval mechanisms, and audio/video/text
players, with the result being output into one or more physical environments.
4.2. Structural Layer
The Structural layer is populated with nodes, structural
constructs entwined to create a network providing the participant with a
higher-level abstracted instrument for observing and influencing the ongoing
activity within the EE. The layer is modelled as a recurrent neural network,
allowing evaluation of events’ propagation over time and concerned solely with
change of its elements’ activities. As a purely abstract system for controlling
objects, it plays no role unless connected to the Perceptual layer. Each node
has its own activation level, its value continuously decaying. The nodes
communicate by sending stimuli, which can originate at any point in the
network. When a node is triggered, implying that its activation level is beyond
a propagation threshold, it sends stimuli of proportional strength to all the
connected nodes. The growth and decay in the nodes’ activation levels can be
controlled individually or en masse, by participants, other nodes, MIDI
controllers, sensors, or the ongoing evolutionary process.
4.3. Mediator
The Mediated layer maps the neural activity of the Structural layer onto
the media processing activity of the Perceptual layer. The mapping is reconfigurable
in real time and unifies the layers into a framework for improvisational
action. The Mediator allows each Structural node to have an unlimited
amount of thresholds that indicate that some action is to be taken. Each time a
node passes one of its thresholds, a corresponding Mediating Action is
performed. Mediating Actions consist of one or more functions that use the
information received from the associated node to modify one or more property of
the Perceptual emons. Mediating Actions may be very precise, controlling
low-level particulars of a given emon (e.g., ‘change the strength of the green
channel of the associated video file to the next within the provided envelope
of data points’) or high-level (e.g., ‘desynchronize the ongoing audio events a
bit more’). Mediated Actions are also tradable with other participants within
the EE. The Mediator insulates the two main layers, allowing for their
independence and thus making it potentially possible to replace either layer
with a different type of controller or media system if desired. Our method for
navigating (evolving) media networks utilizes the Mediator to access the
properties of components located at both Structural and Perceptual layers. The
genetic algorithms that facilitate the navigation mutate the properties, biased
by the participants’ feedback and currently active constraints on the
evolutionary process.
4.4. Emons
Multiple emon types exist within the EE. For brevity sake, we only
describe two categories of emons: (1) system emons, which deal with
temporal properties of the EE and apply equally to audio, video, and text, and
(2) video emons.
All the emons can be activated by (1) connected Tempo emons, (2) Actions
designed by the participants or (3) ongoing evolutionary process driven by
built-in genetic algorithms and users’ feedback.
Emon :: Master Beat (MB). MB is the emon that provides the pulse to be used by any other elements
within the Perceptual network as a synchronization reference. On each of its
beats, it fires signals to all of the directly connected emons, thus, prompting
action within the rest of the system. MB’s tempo can be changed at any time,
with a slower tempo resulting in a network in which numerous actions occur
within each beat, while a faster tempo contributing to a fast-paced but less
intricate network. Changing the MB’s tempo can have an effect on the entire
‘attitude’ of the network.
Emon :: Tempo Cycle (TC). TC emons depend on a supplied ‘parent beat’, breaking it into smaller,
more distinct bits of time. By nesting TC emons, participants create complex
polyrhythmic temporal structures without having to understand concepts of music
theory and composition that deal with time. TC emons can also introduce a time
delay, thus propagating a beat off-synch from the original beat. Such time
variance can be utilized to create an echo or stagger effect and make
asynchronous behaviours possible.
Emon :: Action Filter (AF). No system is complete without the ability to ignore
the directives provided by others. Such ‘resistance’ of an element within a
network can be formalized using the concept of a mask or a filter. The Filter
emon implements a mutable ‘pattern of resistance’, propagating events at its
own discretion, filtering the through-coming data in accordance with a preset
or a dynamically defined mask. The filter is a binary array, signalling which
events should and should not propagate. It operates in a circular fashion,
looping through the array in synch with the input events it is filtering.
Filters are useful in media performance scenarios, allowing for variation in
frequency of the output. For example, instead of a simple repetitive chain of
events playing an audio sample every beat, the filter allows for a structured
but non-monotonous repetition.
Emon :: VisualSample (VS). VS allows playback of video and still pictures. As such, it parallels
the functionality of AudioSample and TextSample emons. To play a picture or a
video, participants associate the emon with a video or an image, stored locally
or remotely, that they would like to manipulate. Using Quicktime [7],
individual frames of the video are extracted and passed to JOGL [8], an OpenGL
library we use for video manipulation. VS emon’s properties include (1) Start / Stop cues defining the
limits at which the playback begins and ends; (2) individually controllable
prominence of the video’s red, blue, and green channels; and (3) opacity
control defining the video’s prominence within the Master Visual Player emon’s
display space – its visibility given multiple videos layered on top and within
each other.
Emon :: Master Visual Player (MVP). MVP emon is responsible for the overall video playback, resulting from the sum of all the ongoing visual manipulations. As such, it parallels the Master Audio Player and the Master Text Player emons. Its properties include (1) window size; (2) display ID to allow multi-display scenarios with the network placed on one screen and the visual result on another; (3) coordinates on screen (modifiable by Actions and evolutionary parameters to make the video move on screen); and (4) layering of active VisualSample emons.
Emon :: Visual Subdivide (VSD). VS emon divides and replicates its input (a
VisualSample or another VSD emon). VSD emon’s properties are two arrays of
ratios defining how the original size of the incoming image will be divided on
X and Y axes, with each division becoming a clone of the original image. For
example, setting the X property to [1, 2, 3] will result in three horizontal
copies of the source signal, arranged side by side, with the ratio of their
relative sizes being 1 : 2 : 3. Such subdivision of space can be seen as
analogous to subdividing time by Tempo emons.
Emon :: Spatial Position (SP). The SP emon takes its input and performs a spatial
positioning transform on it. In other words, it takes a video or a picture, and
reshapes it according to the 3D space coordinates (X, Y, and Z). SP emon’s
properties are (1) the X, Y, and Z coordinates that define (translate) how the
input signal is transformed in 3D, and (2) TH,
PHI, and PSI that define the input signal’s rotation about the X, Y, and Z axes
respectively. Multiple SP emons can be chained off each other, easily creating
complex transformations of the source signal.
Emon :: Visual Mask (VM). VM emon allows the participants to define sub-areas within a particular
video that will be played (with the rest of the image ignored). The rationale
for the existence of the VM emon is that sometimes only a particular part of
the image presents interest, or, alternatively, placing only a particular part
of the image in a given context results in the construction of a new meaning.
VM emon’s properties are a set of coordinate points (presented as a
black-and-white mask) that define the boundaries of the region that will be
processed.
Emon :: Time Scale (TS). TS emon takes an incoming video stream, and slows it down or speeds it
up by a requested factor. To exemplify the modularity of emon construction
let’s consider the possible scenarios:
1. The participant makes
three clones of VisualSample emon playing the same video sample and defines
each of them to display only one of the RGB channels. Inserting the TS emon in
the pipeline between only one of the clones and the MasterVisualPlayer emon,
results in one of the channels being desynchronized from the others by a fixed
amount.
2. The participant takes
the configuration described in #1, and adds a VisualMask emon in between the
VisualSample and TS emons. Now, only a part of this particular channel will be
visible as it is scaled, thus allowing for time-scaling of subparts of a single
video stream.
Emon :: Visual Input Grabber (VIG). VIG allows real-time acquisition of video for
subsequent manipulation, storage and/or broadcast. The source of the video can
be either a video camera attached to the computer or a web source (a streaming
file). VIG emon is useful for introducing continuously changing media source
into the performance and parallels the AudioInputGrabber emon. The number of
VIG emons is limited only by the amount of live video sources available.
Emon :: OutputStream (OS). Sometimes it is useful to aggregate everything that happens in a given network in terms of the audio, video, and text manipulation, and forward it, sans any structural information, somewhere else. For example, if one participant creates networks that consist solely of percussive rhythms, another participant may want to use these as a rhythmic basis for his own creations. This is where the OutputStream emon comes in. Any participant may connect his MasterPlayer emons into the OS emon, thus indicating that anything that happens to be played or displayed by the MasterPlayer emons is to be made available for broadcast. With the OS emon activated, anyone who uses Audio/VideoInputGrabber emon in their network may point it to the address of the broadcast computer and get the stream as the input.
This paper discusses the Emonic Environment, a system built on the
principles of structural control and content abstraction. We argue that these
two principles are essential in a media system if we are to effectively change
how people think about creating and exploring media.
We describe our work of extending the EE’s improvisational structure to
accommodate video manipulation. We present a solution that (1) allows such
manipulation in real-time, (2) parallels the currently available audio and text
manipulation capabilities of our system and (3) allows participants to move
between being intimately involved in the details of the control or, conversely,
attend only to the high-level aspects of what is going on.
The next steps in the development of the EE include making it more accessible for novice users as well as conducting experiments that utilize new video capabilities.
The authors thank Glorianna Davenport, Ariadna Quattoni, and Becca
Luger-Guillaume for their help.
[1]Nemirovsky,
P., Luger-Guillaume, B. (2004) Improvisational Media Space :: Architecture and
Strategies for Evolution. In the Proceedings of EuroGP2004, Coimbra, Portugal.
[2]Nyman, M. (1999) Experimental music: Cage and
beyond, Cambridge University Press, Cambridge.
[3] Cage, J. (1966) Silence; lectures and writings, MIT Press,
Cambridge.
[4] FairLight Computer Video
Instrument, audiovisualizers.com/toolshak/vidsynth/fair_cvi/fair_cvi.htm
[5] VJamm, http://vjammpro.com/vjammpro/
[6] Jitter, http://www.cycling74.com/products/jitter.html
[7] Quicktime, http://quicktime.apple.com
[8] JOGL, https://jogl.dev.java.net
[1]
The EE incorporates an extensive array of tools for sampling the context;
discussion of these capabilities is beyond the scope of this paper.
[2]
Emon: a media-processing functional primitive; combined
together, emons form interconnected structures for generation, modification and
presentation of media.