Image driven sound generation
Akemi
Ishijima
Composer,
London, UK
email: akemi@city.ac.uk
Yoshiyuki Abe
Artist,
Tokyo, Japan
email:
y.abe@ieee.org
Abstract
With the object of creating abstract computer graphics animation
with electroacoustic sound, an automatic image-driven sound generator was
developed. Image sampling methods, which in principle sample one datum from
each animation frame、 was tested
as being effective for generating pitch, dynamic structure and stereo
image. Wavelet transform, which
provides multi-resolutional information of the signal made of data from image
frames, was useful for determining other modulation signals including temporal
cues. The generative method of creating
music for animation also involves human judgement. At the system development stage, special attention was taken
regarding pitch and loudness, since our perception of these factors varies
depending on the register and it is not always proportional to what we read in
the values of frequency and amplitude.
Knowledge on orchestration proved useful for creating tangible and
effective melodic and harmonic structure.
One script file proceeds all the way, from making an animation clip,
generating CD quality sound track and combining image and sound into a movie
file for the final product.
1. Introduction
Sound plays an important role in films and video works by providing
space environment for the audience. It also enhances the visual message and
adds reality to animation films. Many film masterpieces are remembered for
their theme music. In most cases, sound
creation and image making are separately developed and ordinarily a composer
starts to work after receiving the screenplay or seeing dailies. This is so that the composer understands the
atmosphere and the theme of each scene.
Animation clips often have a human voice, a concrete sound, or some
electronic music on the sound track. With our animation clip of flying geometric objects,
composer referred to the design charts, which involve timings of object spin
and changing view positions[1].
"The timing chart made the compositional work easy, because it
could be used as the on/off chart for musical events. All I have to do was to
determine the suitable sound to be triggered at each point accordingly
(Ishijima)." How do you start
sound-making for algorithmically generated abstract video clips? Unlike films, there is no script, no studio
scene, or no actor/actress on screen.
Images you've never seen are there.
Just the impression you have is the material for your creative work.
This paper presents the second phase of the research on computer
generated animation with abstract images. This time, we tried to generate a
whole sound stream by using data sampled from each image frame without any
involvement of a video/sound editor console. An image clip is generated first.
Then the visual data is lead up to sound generation. Finally, the image and the
sound are combined into a movie file. We have tested some methods to sample
data from a computer generated animation which has uncompressed 640x480
images.[Fig.1]
In the attempt of making sound with sampled data of image, the first
confrontation is data file discrepancy.
While sound requires to be a continuous data file, video consists of a
sequence of discrete data of image frames.
Image[2] Sound
Image
format TARGA Sampling
freq. 44100Hz
Image size 640x480 Sampling
res. 16bits
Colour depth 24bits Channels 2
Frames/sec 30 Num
of Data 2,646,000/min/ch
Frames/min 1800 Bytes
of Data 10,584,000
Bytes/min
Data
size/frame 921,600
Bytes/frame
Data
size/sec 27,648,000
Bytes/sec
Data
size/min 1,658,880,000
Bytes/min
Fig.1 Specifications of image and sound files.
2. Sound
2.1 Music
Parameters
In any music, pitch is probably the most significant element which
creates tangible musical impression. Although other elements such as rhythm and
timbre also play important roles, which we discuss later, pitch creates melody
and harmony, and is considered the most essential element for any organised
sound. In this sense, the most basic information required for musical sound are
Pitch, Amplitude and Note length. Colour data, which represent the visual impact
of each frame, are a suitable source for generating both pitch and amplitude
information. In order to define note length i.e. duration of each pitch, values
are defined by the duration derived from the varied levels of the wavelet
transform.
In order to express a pitch, one sine wave is enough but for more
interesting timbre, a single sine tone is too simple. To achieve rich sound, layers of sine waves which have different
pitch and duration are superimposed.
Tracking events in left, centre, right areas of the image in respective
audio channels would produce coherent spatial
emphasis in the visual and audio material.
For the purpose of capturing spatial property of the image material,
various combinations of sampling positions are tested.[Fig.2]
+-----------+-----------+ +-----------+-----------+ +-----------+-----------+ +-----------+-----------+
| |
| | |
| | |
| |
| |
| | | |
| |
| | |
o | | |
| + * |
| + o * |
| + * | |
+ + o * *
|
| |
| | |
o | | |
| |
| | |
| | |
| type 1 | |
type 2 | |
type 3 | |
type 4 |
+-----------+-----------+ +-----------+-----------+ +-----------+-----------+ +-----------+-----------+
+-----------+-----------+ +-----------+-----------+ +-----------+-----------+
| |
| | |
|
| + + o
* * | | | |
o o o | + left
| |
| + + o o
o * * | | |
| + + o
* * | | | |
o o o | o left+right
| |
| + + o o
o * * | | |
| + + o
* * | | | |
o o o |
* right
| type 5
| | type 6 | |
type 0 |
+-----------+-----------+ +-----------+-----------+ +-----------+-----------+
Fig.2 Sampling
positions
For sound to be recognised as music, there must be tangible temporal progression such as melody and
rhythm. Harmony is a preferable
element which enriches music both momentarily and progressively. A regular beat is not necessary but useful
to create sense of speed and progression.
Dramatic visual development such as change of scene, or irregular
movement ought to be synchronised or in some sort of causal relation to
accentuation in music.
2.2 Parameters for
Digital Sound
Above discussion is based on the psychological variables of musical
sound. For the sake of clarification, conventional terms such as note or melody were used, but they are not of central concern for the composer. The goal is to generate a
spectromorphological sound track rather than note-based composition. For
digital, or any electronic sound, information which describes physical
variables of phenomena of sound is required.
Fig.3 shows correspondence between music and physical properties.
Digital sound data consists of a series of amplitude values of sound measured
at the sampling frequency of 22.05 kHz, 44.1 kHz, or 48 kHz. Ordinary dynamic range is 16 bits for each
channel.
Music
property Physical
property
Pitch Frequency
Loudness Amplitude
Note length Duration
Timbre Waveform=Sum
of multiple sine waves
Fig.3
3. Sampling Image
Data
For making 44,100Hz CD quality sound, 1470 data samples are required for
one image frame duration, or 33 milliseconds. While there are some methods to collect this
number of data from one image, the safe way is to apply one-sample-from-one-image rule because you have to avoid overtones of 30Hz produced by the 33ms repetitive data sequence. This rule is also effective to keep
consistency of trans-frame events. This
time, our solution is simple linear
interpretation of colour data into pitch value, which is held for the
duration of one frame,
for the main pitch stream. One of the major advantages of this
method is that it is free from 30Hz pattern repetition. The length
of a frame provides enough time for our ears to recognise the frequency as a
pitch. At the same
time, it is short enough for creating impressions of continuous pitch slide. In order to
keep the audio transition smooth, the program checks the phase angle at the end
of 33ms segment and makes the next to waveform to start in-phase by adding offset phase value. This is a must to design a smooth tone shifting
when you synthesize a waveform.[Fig.4]
Fig.4 Phase matching. Same signal without(upper) and with(lower) phase
matching.
4. Data
Transformation
What can be obtained from visual image in any case is a set of pure data
which then will be transformed to a sound file. To construct Image-to-Sound
transformation, it is important to find suitable source for suitable audio
parameters. We used wavelet analysis of the whole length waveform for the
material of some sub tracks of this work[3].
4.1 Pitch and
Harmony
Once a list of frequency are obtained, each frequency can be
harmonized by adding partials expressed in ratio of integers such as 1/4, 1/2, 3/4, 2, 3, 4,
5, etc. relative to the original frequency.
Problem here is how to define the harmonisation algorithm so that the
result would become audible and musical.
First thing to pay attention to is that the audible frequency range for human ears is roughly between 20Hz and 20kHz, so that the resulting frequency should fall into this range. Secondly,
our perception of loudness and pitch separation is not linear. Traditional idea
of instrumentation is helpful. Contrabass part is often doubled by cellos playing in an octave(2nd
partial). Violas might be playing in 5th(3rd partial) or an octave(4th partial)
above the cello. This means, sound harmonized according to the natural harmonic series creates impression of harmonic stability. Building up a harmony on a lower frequency also helps to enhance
the perceived level (loudness) of low pitched sound
against the fact that our perceived loudness is lower than the actual signal
intensity in low frequency region.. Regarding stability, another important
principle is that the longer the note, the more stable the music.
In order to extract a principal pitch movement, colour data are taken from the image area of "type 4" [Fig.2]. What is translated to higher frequency has more
meaning than what became lower frequency, since the frequency values reflect impression level of the image. 0Hz means no colour. Thus, we employed a method of harmonising 'downwards' rather than upwards. The frequency obtained is interpreted as the 8th partial of a natural harmonic series, and harmonised with 7th, 5th and 3rd partials. The amplitude of each partial is controlled by the density of Red, Green and Blue respectively. High pass filters is applied to cut off
inaudible signals below 20Hz. In this
way, an upper structure is created from the main pitch movement. Octave intervals are omitted here, because it will be used for constructing a lower structure. Partials
used for the harmonisation are summarised as below.
Upper Structure 3
5
7 8
Lower Structure 1/4 3/4 1
Example C1 G2 C3 G4 (E5) (Bb5) C6
Frequency(Hz) 32.7 98.1 130.8 392.4 654 915.7 1046.5
What is interpreted as the fundamental
frequency, that is 3octaves lower than the initially obtained frequency, is
harmonised with the 3/4 and 1/4 partials. This provides harmonically stable
base for the upper structure. 20Hz – 200Hz band pass filter is applied
so that the final frequency range for the lower structure becomes equivalent to
the range of contrabass. Different sampling frame rate were applied to create rhythmic variety. Duration
information for longer sound is obtained from 4th to 6th level wavelet
transform.[Fig.5]
When the upper and lower structure are
mixed, the resulting sound shows good separation in terms of pitch and
frequency range. When an audio event is happening in higher register, mid to
low range is suppressed except for one or two partials supports the overall
harmonic structure. In the middle-range frequency, where our ears are most
sensible, most partial elements are present to provide rich harmonic structure.
Pitch separation between each harmonic element is good throughout.
|
|
|||
S4 |
||||
H1 |
||||
H2 |
||||
H3 |
||||
H4 |
||||
H5 |
||||
H6 |
||||
L6 |
||||
L6s |
||||
S5 |
||||
S5xL6s |
||||
H4s |
||||
H6hold |
||||
S1+S4 |
||||
S1 |
||||
S4 |
||||
|
Fig.5 Waveforms. From top to bottom, S4:sampled image data(type 4), H1-H6:wavelet level 1-6
(HPF), L6:wavelet level
6(LPF), L6s:sliced L6, S5:sampled image data(type 5), S5xL6s, H4s: sliced S4, H6s: sliced
H6, S1+S4, S1:sampled image data(type 1), S4(similar to the top waveform). The waveforms of
H1-H6 and L6 are stretched to the original signal's length. H1-6 and L6 are transformed S4 by Daubechies' wavelet, N=2. S4=H1+H2+H3+H4+H5+H6+L6
4.2 Stereo sound
All sampling types produced good results which reflect significant visual events involving object
movement and colour change across the screen.
As number of sample location increases, sampled data include more visual
events but averaging results in relatively lower resolution of event.
In order to capture minimal movement across screen type 2 and 4 are
suitable since the most visual information is concentrated in the central area
of the screen.
|
|
|||
L |
||||
R |
||||
L |
||||
R |
||||
L |
||||
R |
||||
|
Fig.6 Animation with random abstract images. Channel data are type
6(top), 4(middle) and
1(bottom). Type 1 results a good channel separation and
is used for sub channel data.
Because it misses events on the
central part, the most important area, it is not used for
the main signal.
L |
|
|||
R |
Fig.7
Animation used geometric surfaces. Channel data(type4) shows clear
channel separation
and variety of image data levels.
4.3
Synchronisation and Alienation
Transforming data into sound is not sufficient for sound design required for animation. For a quality animation, sound must not always follow the images. That means we need to detect visual events and have to decide whether the sound should cooperate or alienates from them. It is very hard to give a rule to such a highly creative process. As an attempt, we used other image data and AND OR XOR process for the decision making. In the test version, we set a
rule for switching the sound track between synchronisation and alienation modes. The cue sheet is made of derivative data chart of wavelet transform.
5. Process Flow
This research developed all the necessary tools for the processes discussed above. A script file proceeds all the way from generating image list to
integrating the image and sound data into a movie file.
[Fig.8][Fig.9]
#/bin/sh
umkga2003b
gaa 22 640 480 100 2000 # generate
gaa00 .. gaa21, 640x480 image size
apol_light gc4 22 99
gaa00...gaa18 # gaa(22 files) -->
gac(2101 files)
utga-ga2003v
# raytracing
uanim2dat
ggg 4 0 2048 ggg #
.tga --> ggg-4.dat(type 4)
ufwt2
gc4.fdat fwt1 # ggg-4.dat --> fwt.lpf|fwt.hpf (level 1 wavelet)
ufwt2
fwt1.lpf fwt2 #
fwt1.lpf -->
fwt2.lpf|fwt2.hpf(level 2 wavelet)
ufwt2
fwt2.lpf fwt3 #
fwt2.lpg -->
fwt3.lpf|fwt3.hpf(level 3 wavelet)
ufwt2
fwt3.lpf fwt4 #
fwt3.lpg -->
fwt4.lpf|fwt4.hpf(level 4 wavelet)
ufwt2
fwt4.lpf fwt5 #
fwt4.lpg -->
fwt5.lpf|fwt5.hpf(level 5 wavelet)
ufwt2
fwt5.lpf fwt6 #
fwt5.lpg -->
fwt6.lpf|fwt6.hpf(level 6 wavelet)
udat2env
fwt4.hpf fwt4.env #
fwt4.hpf --> fwt4.env(envelope file)
udat-mod
ggg-4.dat gwt4.env ggw-4.mdat #
.dat + .env --> .mdat (modulation)
udat2aiff
ggw-4.mdat 44100 16 2 ggg4 #
ggw4.mdat --> ggg4.aiff
umkavi
gg4 0 2048 gac gc4 #
.tga + aiff -->gg4.avi
echo "finished"
Fig.8 Script file
Fig.9 uanim2dat generates a data file of animation images.
6. Conclusion
It is possible to create a sound track from data obtained from a succession of abstract images. We have produced a full animation
without interactive editing. To create the music, data which represent impression of the images were collected. Sampling colour data from
different areas of the image also proved effective to create stereo
sound.
For a composer, this method of generating sound directly from images is
an attractive alternative to MIDI and sampling since it gives the composer a
wider and flexible range of frequency free from ordinary 12 note chromatic
scale restriction. Composers can also
be freed from lengthy manual endeavour of reshaping sample files with waveform
editors.
The challenge of the present system is that it is still difficult to
create realistic sound whose waveform has overtone-rich transient at the
attack. To further the variety and
quality of sound, we need to investigate methods of creating different
timbre. Establishing a method
to extract figurative and textural impression would enrich timbre
quality and correspondence between sound and image.
We recognised that only very experienced composers and artists can
control this kind of system at current stage, otherwise it can easily become a
junk footage generator. This project gave us an opportunity to rethink why we
make art. We need to understand an image can provide material for sound design, but it also limits the freedom of creation, a double edged sword.
7. Notes and
References
[1] Ishijima,
A. and Abe, Y., "Algorithmic process for time based arts," GA2002,
Milan 2002.
[2] NTSC video system has interleaved 50.94 fields per second and frame
rate is 29.97 fps.
PAL and SECAM for 50/25.
[3] Wavelet provides time-frequency representation. Computation
cost(AxN) of Fast Wavelet
Transform is less than that of FFT(NlogN) in theory, where constant A
depends on the
chosen filters and N for number of samples. The analysis([X]) is a set of LPF and
HPF'ing process and you can analyse a signal on time-freq basis, or more
exactly it's a
time-scale basis because of its relative frequency resolution is
constant, by cascaded
filtering process on the LPF results as shown below.
Signal-->[X]--(lf1)-->[X]--(lf2)-->[X]--(lf3)-->[X]-->lf4
| |
| |
hf1 hf2
hf3 hf4
Signal= hf1+hf2+hf3+hf4+lf4
[4] Ingrid Daubechies' web site has references on wavelet at
http://www.princeton.edu/~icd/publications/