The term "automatic music transcription" was first used by audio researchers James A. Moorer, Martin Piszczalski, and Bernard Galler in 1977. With their knowledge of digital audio engineering, these researchers believed that a computer could be programmed to analyze a
digital recording of music such that the pitches of melody lines and chord patterns could be detected, along with the rhythmic accents of percussion instruments. The task of automatic music transcription concerns two separate activities: making an analysis of a musical piece, and printing out a score from that analysis. This was not a simple goal, but one that would encourage academic research for at least another three decades. Because of the close scientific relationship of speech to music, much academic and commercial research that was directed toward the more financially resourced
speech recognition technology would be recycled into research about music recognition technology. While many musicians and educators insist that manually doing transcriptions is a valuable exercise for developing musicians, the motivation for automatic music transcription remains the same as the motivation for sheet music: musicians who do not have intuitive transcription skills will search for sheet music or a chord chart, so that they may quickly learn how to play a song. A collection of tools created by this ongoing research could be of great aid to musicians. Since much recorded music does not have available sheet music, an automatic transcription device could also offer transcriptions that are otherwise unavailable in sheet music. To date, no software application can yet completely fulfill James Moorer’s definition of automatic music transcription. However, the pursuit of automatic music transcription has spawned the creation of many software applications that can aid in manual transcription. Some can slow down music while maintaining original pitch and octave, some can track the pitch of melodies, some can track the chord changes, and others can track the beat of music. Automatic transcription most fundamentally involves identifying the pitch and duration of the performed notes. This entails tracking pitch and identifying note onsets. After capturing those physical measurements, this information is mapped into traditional music notation, i.e., the sheet music.
Digital Signal Processing is the branch of engineering that provides software engineers with the tools and algorithms needed to analyze a digital recording in terms of pitch (note detection of melodic instruments), and the energy content of un-pitched sounds (detection of percussion instruments). Musical recordings are sampled at a given recording rate and its frequency data is stored in any digital wave format in the computer. Such format represents sound by
digital sampling.
Pitch detection Pitch detection is often the detection of individual
notes that might make up a
melody in music, or the notes in a
chord. When a single key is pressed upon a piano, what we hear is not just
one frequency of sound vibration, but a
composite of multiple sound vibrations occurring at different mathematically related frequencies. The elements of this composite of vibrations at differing frequencies are referred to as
harmonics or partials. For instance, if the note A3 (220 Hz) is played, the individual
frequencies of the composite's
harmonic series will start at 220 Hz as the
fundamental frequency: 440 Hz would be the second harmonic, 660 Hz would be the third harmonic, 880 Hz would be the fourth harmonic, etc.) These are integer multiples of the fundamental frequency (for example, two times 220 is 440, the second harmonic). While only about eight harmonics are really needed to audibly recreate the note, the total number of harmonics in this mathematical series can be large, although the higher the harmonic's numeral the weaker the magnitude and contribution of that harmonic. Contrary to intuition, a musical recording at its lowest physical level is not a collection of individual
notes, but is really a collection of individual harmonics. That is why very similar-sounding recordings can be created with differing collections of instruments and their assigned notes. As long as the total harmonics of the recording are recreated to some degree, it does not really matter which instruments or which notes were used. A first step in the detection of notes is the transformation of the sound file's digital data from the
time domain into the
frequency domain, which enables the measurement of various frequencies over time. The graphic image of an audio recording in the frequency domain is called a
spectrogram or sonogram. A musical note, as a composite of various
harmonics, appears in a spectrogram like a vertically placed
comb, with the individual teeth of the comb representing the various harmonics and their differing frequency values. A
Fourier Transform is the mathematical procedure that is used to create the spectrogram from the sound file’s digital data. The task of many note detection algorithms is to search the
spectrogram for the occurrence of such
comb patterns (a composite of harmonics) caused by individual notes. Once the pattern of a note's particular comb shape of
harmonics is detected, the note's
pitch can be measured by the vertical position of the comb pattern upon the
spectrogram. There are basically two different types of music which create very different demands for a
pitch detection algorithm:
monophonic music and
polyphonic music. Monophonic music is a passage with only one instrument playing one note at a time, while polyphonic music can have multiple instruments and vocals playing at once. Pitch detection upon a monophonic recording was a relatively simple task, and its technology enabled the invention of guitar tuners in the 1970s. However, pitch detection upon polyphonic music becomes a much more difficult task because the image of its
spectrogram now appears as a vague cloud due to a multitude of overlapping comb patterns, caused by each note's multiple
harmonics. Another method of
pitch detection was invented by Martin Piszczalski in conjunction with Bernard Galler in the 1970s and has since been widely followed. It targets monophonic music. Central to this method is how
pitch is determined by the human
ear. The process attempts to roughly mimic the biology of the human inner
ear by finding only but a few of the loudest
harmonics at a given instant. That small set of found
harmonics are in turn compared against all the possible resultant pitches' harmonic-sets, to hypothesize what the most probable
pitch could be given that particular set of harmonics. To date, the complete note detection of polyphonic recordings remains a mystery to audio engineers, although they continue to make progress by inventing algorithms which can partially detect some of the notes of a polyphonic recording, such as a
melody or bass line.
Beat detection Beat tracking is the determination of a repeating time interval between perceived pulses in music. Beat can also be described as 'foot tapping' or 'hand clapping' in time with the music. The beat is often a predictable basic unit in time for the musical piece, and may only vary slightly during the performance. Songs are frequently measured for their Beats Per Minute (BPM) in determining the tempo of the music, whether it be fast or slow. Since notes frequently begin on a beat, or a simple subdivision of the beat's time interval, beat tracking software has the potential to better resolve note onsets that may have been detected in a crude fashion. Beat tracking is often the first step in the detection of percussion instruments. Despite the intuitive nature of 'foot tapping' of which most humans are capable, developing an algorithm to detect those beats is difficult. Most of the current software algorithms for beat detection use a group competing hypothesis for beats-per-minute, as the algorithm progressively finds and resolves local peaks in volume, roughly corresponding to the foot-taps of the music.
How automatic music transcription works To transcribe music automatically, several problems must be solved: 1. Notes must be recognized – this is typically done by changing from the time domain into the frequency domain. This can be accomplished through the
Fourier transform. Computer algorithms for doing this are common. The
fast Fourier transform algorithm computes the frequency content of a signal, and is useful in processing musical excerpts. 2. A beat and tempo need to be detected (
Beat detection)- this is a difficult, many-faceted problem. The method proposed in Costantini et al. 2009 The most successful pitch methods operate in the frequency domain, not the time domain. While time-domain methods have been proposed, they can break down for real-world musical instruments played in typically reverberant rooms. The pitch-detection method invented by Piszczalski again mimics human hearing. It follows how only certain sets of partials “fuse” together in human listening. These are the sets that create the perception of a single pitch only. Fusion occurs only when two partials are within 1.5% of being a perfect, harmonic pair (i.e., their frequencies approximate a low-integer pair set such as 1:2, 5:8, etc.) This near harmonic match is required of all the partials in order for a human to hear them as only a single pitch. ==See also==