US3786188A - Synthesis of pure speech from a reverberant signal - Google Patents
Synthesis of pure speech from a reverberant signal Download PDFInfo
- Publication number
- US3786188A US3786188A US00311731A US3786188DA US3786188A US 3786188 A US3786188 A US 3786188A US 00311731 A US00311731 A US 00311731A US 3786188D A US3786188D A US 3786188DA US 3786188 A US3786188 A US 3786188A
- Authority
- US
- United States
- Prior art keywords
- speech
- speaker
- transfer function
- signal
- signals
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Lifetime
Links
- 230000015572 biosynthetic process Effects 0.000 title description 10
- 238000003786 synthesis reaction Methods 0.000 title description 10
- 238000012546 transfer Methods 0.000 claims abstract description 48
- 230000005284 excitation Effects 0.000 claims abstract description 46
- 230000001755 vocal effect Effects 0.000 claims abstract description 37
- 230000000694 effects Effects 0.000 claims abstract description 16
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 238000000034 method Methods 0.000 abstract description 15
- 230000008569 process Effects 0.000 abstract description 13
- 238000001914 filtration Methods 0.000 abstract description 6
- 210000004704 glottis Anatomy 0.000 abstract description 4
- 230000003111 delayed effect Effects 0.000 description 10
- 230000004044 response Effects 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 238000005070 sampling Methods 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 3
- 230000001934 delay Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000000644 propagated effect Effects 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 210000001260 vocal cord Anatomy 0.000 description 2
- 230000003213 activating effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000002592 echocardiography Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000010183 spectrum analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K2210/00—Details of active noise control [ANC] covered by G10K11/178 but not provided for in any of its subgroups
- G10K2210/30—Means
- G10K2210/301—Computational
- G10K2210/3013—Analogue, i.e. using analogue computers or circuits
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K2210/00—Details of active noise control [ANC] covered by G10K11/178 but not provided for in any of its subgroups
- G10K2210/50—Miscellaneous
- G10K2210/505—Echo cancellation, e.g. multipath-, ghost- or reverberation-cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
Definitions
- ABSTRACT Speech that has been reverberated by the transfer function of a reverberant enclosure is analyzed to detect parameters from which an unreverberative synthetic version of the original speech may be constructed. The process involves continuously approximating the vocal tract transfer function of the speaker. The effect of this transfer function is then removed from the reverberant speech by inverse filtering, the residual signal being the glottis excitation signal reverberated by the room.
- the reverberant excitation function is then analyzed to determine when the speakers driving function is voiced or unvoiced, the periodicity when voiced, and a unique gain factor. Then clean speech is synthesized using the foregoing three parameters operating on an all-pole filter that is continuously adapted to approximate the vocal tract transfer function.
- This invention relates to the removal of distortion from a speech signal.
- this invention relates to the synthesizing of a distortionsfree speech signal from a signal originating in a reverberative enclosure.
- the first effect is the coloration or spectral distortions due to the summation occurring at the microphone of the directly received signal and its many delayed dispersive reflections 'from the numerous walls and surfaces in the room 10.
- the second effect, the echo are the temporal or time distortions arising from the slow decay of energy typically encountered in any moderately lossless room or cavity. These time distortions are closely related to the reverberation time of the room. For the subject who is not physically present in the chamber but listens through a connection to the microphone placed in the chamber, the effects of coloration and echo on intelligibility of the received signal are often severe. This condition is, unfortunately, frequently characteristic of hands-free telephonic transmissions.
- a further object of the invention is to realize a way of reconstructing an original speech signal by analysis of the reverberated speech. This object seeks to overcome prior art schemes wherein the parameters which control the synthesis of the undistorted signal are derived under unrealistic conditions, or are contingent on a stationary room transfer function.
- Another inventive object is to devise a speech processing system of the type alluded to in the foregoing object, that has the property of sensing or detecting the parameters that characterize the original speech, so that by other aspects of the inventive process, undistorted speech will be synthesized from a knowledge of these parameters.
- a still further object of the invention is to enlist and adapt the speech reconstruction method known generally as linear predictive filtering to novel use under reverberant conditions.
- Synthesis or the production of an original or preexisting speech signal from a set of more basic parameters, depends in general upon activating some device whose basic transfer properties are akin to those of the human vocal tract, by some excitation signal which is akin to the excitation which drives the human vocal tract.
- some excitation signal which is akin to the excitation which drives the human vocal tract.
- Atal and others have recognized that a short: time spectral analysis of the original speech signal does not readily yield control signal information for this excitation signal or driving function.
- Atal has realized more reliable control signals by modeling the human vocal tract as an acoustic tube of variable dimensions.
- the vowel and vowel-like sounds of the output at any instant of time are a weighted sum of a discrete number of recent past values of the output plus the value of the input or driving function at that instant of time.
- speech wave forms by exciting the all-pole filter with the proper combinations of quasi-periodic pulses and white noise, referred to herein asthe excitation function e
- the parameters of this filter are the weighting coefficients alluded to above, and termed a where a is the gain applied to the speech sample delayed by k samples.
- Atal involves band width reduction.
- the parameters are derived in the' Atal approach from an undistorted original or preexisting speech signal which is to be reproduced at some remote location.
- Inherent in the reverberation reduction situation is the availability of only reverberant speech as a source from which to derive parameters. It is not apparent that pure speech can be synthesized using only a reverberative speech as a parameter source.
- the present invention in its broadest sense lies in the recognition that the time-varying vocal tract transfer function of a subject speaking within an enclosure, can indeed be sufficiently determined even after the speech has undergone severe reverberative distortion. This is the case, whether or not the room transfer function is also varying or is altogether unknown.
- the speech signal w(t), which pursuant to the present invention is to be dereverberated, results from an as yet unknown excitation signal e(t) driving a vocal tract as described above with transfer function T(m) (where w 212' X frequency).
- the speech so produced, s(t), is then reverberated by the rooms transfer function H(m) to produce a reverberative speech signal w(t).
- the problem is to extract, fromthe reverberated speech w(t) information which can be used to reconstruct or synthesize the original speech signal s(t).
- anypractical or typical room transfer function H(w) has certain properties that make it possible to accurately determine the speakers vocal tract transfer function T(m), from the reverberative speech signal w(t).
- the principal property that makes the foregoing possible is that the mode structure, i.e., mode density, is almost always sufficiently great that the modes are closer than their bandwidths over the frequency range of useful speech information.
- the reverberation times, i.e., the 60 dB energy delay time, of the vast majority of office or room size reverberant enclosures are less than those which would damage the articulation because of echos. In contrast, articulation damage could be expected to occur in the case of a large auditorium with hard walls.
- the vocal tract transfer function T(w) of the speaker is continuously approximated. Then, the effect of the vocal tract transfer function is removed from the reverberant speech by inverse filtering, leaving only the spectrically flattened glottis excitation signal e(n) reverberated by the room transfer function H(w). Pursuant to the invention, analysis is then performed at this point on the reverberant excitation function to determine when the driving function e(t) of the speaker is quasiperiodic (which is the voiced condition) or white noise (which is the unvoiced condition). The gain of the driving function e(t) and the period of the quasiperiodic source during voicing are also derived.
- this process is continuously performed digitally by a sampling of the reverberated speech at, nominally, a kHz rate.
- the sampling and processing can occur at any point, such as at the transmitting-station, the receiving station, or at some central point such as a central office if the system is telephonic.
- one speech processor pursuant to the present invention can be constructed to process a multiplicity of reverberative speech signals that are routed through the office.
- FIG. 1 is a schematic block diagram of the entire inventive process in combination with a communications transmission network.
- FIg. 2 is a schematic circuit diagram of a unique computer.
- FIG. 3 is a schematic circuit diagram of an inverse filter.
- FIG. 4 is a schematic circuit diagram of an excitation analysis/synthesis unit.
- FIG. 5 is a schematic circuit diagram of a second correlation computer.
- FIG. 6 is a schematic circuit diagram of a synthetic speech generator.
- FIG. 7 is a schematic circuit diagram of a peak selector portion of said excitation analysis/synthesis unit.
- FIG. 8 is a graph depicting resonant frequencies.
- Fig. 9 is a graph depicting an aspect of a typical room transfer function.
- Excitation Signal or Driving Function e(t) In order to cause an output at the mouth from the human vocal tract, the vocal cords of the glottis are excited to produce pulses recurring at a quasiperiodic rate. The sounds so produced are voiced. Other sounds are unvoiced, such as sss, fff, p, and k. The latter are formed by turbulent air at the mouth, throat, and lips without vocal cord excitation. The voiced and unvoiced sounds in total are the signal source from which human speech originates, and are called the excitation signal e(t). In order to generate an output from a model of the vocal tract, such as a filter with transfer function T(m), an excitation signal must be applied. Speech so produced is of course synthetic.
- the excitation signal which in sampled form is herein denoted e(n), may consist of a pulse generator with a variable pulse period and a white noise source, selectively applied to a variable gain amplifier.
- the pulse generator supplies the Vocal Tract Transfer Function T(w) It has been demonstrated by Atal and Hanauer that the human vocal tract may be accurately modeled as an all-pole filter T(w) which closely approximates the transfer properties of the vocal tract.
- Such afilter has a transfer function in the frequency domain given by:
- Equation (2) is the reciprocal of a polynomial where the zeros formed for a given set of coefficient values (1,, a a a determine the frequencies where T(w) has its maximum values. The latter frequencies are the resonant frequencies or poles of the filter shown in FIG. 8 as m etc.
- Equation (3) is an application of the method of linear digital filtering, and it states that the present output value of s can be estimated from a weighted sum of past (or delayed) output values (s,, plus the new input value e,, of the driving function.
- Room Transfer Function H(cu) It is well known that an enclosure such as a room is a linear system. This means that the effect a room has on a signal such as speech is to cause numerous filtered delays of the signal in its travel to a stationary'microphone, via many diverse path lengths. All the delayed signals are additively combined by a microphone placed in the enclosure. For example, two unit amplitude sinusoidal signals cos ant and cos m launched in an enclosure will be recovered by a microphone, or perceived by a listener, with an altered amplitude 12,, b and each will have been delayed by an amount expressable as respective phase angles (1), and (11 Thus:
- a room transfer function is an expression that in one respect describes how signals of various given frequencies will be relatively affected inamplitude and phase by being propagated in the room.
- FIG. 9 depicts a typical room transfer function.
- FIG. 9 illustrates the fact that propagated frequencies differing by as little as 2 Hz may differ substantially in power at a stationary point remote from the source by 40 dB. It can be seen that a room is a filter, and -its transfer function is that of a filter.
- the output signal w(t) may be predicted as: U) 0) M (6) where the symbol denotes convolution.
- the enclosure input signal s(t) by its Fourier transform S(m) and the frequency response H(m) of the closure, the output frequency response W(w) is:
- the speaker at location 11 is characteristic of most male and female adults speaking English, his articulation times are commonly of the order of less than 1 second.
- the 60 dB energy decay time is less than one second; and room dimensions are at least times greater than the dimensions of a human vocal tract.
- FIG. 1 depicts the continuous operation of the inventive process, in a sequence of stages.
- the values of the a terms of Equation (1) are calculated from w by the correlation computer 31 vand the coefficient computer 30.
- the a terms, which number typically 14, constitute an estimate of the time-varying filter which approximates, in its transfer function T(w), the speakers vocal tract.
- the reverberant speech is forwarded via a transmission network 34 to the intended receiving point such as telephone 39.
- the network 34 is shown as separate from the speech processor; but obviously the process could be located within network 34, such as in a central office.
- telephone 39 includes a direct connection to network34 and an indirect connection thereto via the processor, thus to indicate that the processor could'be an add-on feature located at the telephone.
- the reverberant speech w(t) is first low'pass filtered in filter 37.
- the latter is a 5 kHz filter, designed to the proposition that human speech information is sufficiently specified within the frequencies below kHz.
- the low-pass filtered speech signal is then sampled in sampler 38 at a kHz rate, in keeping with the Nyquist sampling theorem.
- the output of sampler 38 is a stepwise succession of voltages whose amplitudes are indicative of the low-pass filtered speech signal strength at times corresponding to the sampling times. This output is w,,, n being an integer analogous to time.
- the sampler 38 output w is fed to correlation computer 31.
- computer 31 forms the following combinations of the input:
- FIG. 2 depicts the structure of correlation computer 31.
- the sampled signal w is introduced through a shift register consisting of p stages.
- the samples are delayed one sample value per stage, 2 being notation signifying a delay of one sample.
- signals w,, w,, w,, W,, 14 are present at a given time as outputs of successive delay stages of register 31a.
- the outputs of ar ri plifiers 31b are fed respectively to low-pass filters 310 which are for example Hz-filters and which average over the respective filter inputs with a weighting defined by their impulse response h...
- the outputs of filters 31c, designated R R2 R14" are fed to coefficient computer 30.
- Coefficient computer sets up and solves a set of linear simultaneous equations for the values (a, a a
- FIG. 3 depicts the structure of inverse filter 32 as consisting of a delay line or shift register 32a of p stages where p equals the number of stages, for example, 14 of register 31a.
- the input to shift register 32 is the sampled signal w,..
- shift register 310 the samples are delayed one sample value per stage.
- the delayed samples are picked off successive stages of register 32a and respectively led to multipliers 32b.
- the inputs to re spective multipliers 32b are the values (0,, a a a calculated in computer 30.
- the outputs of all multipliers 32b are combined in adder 35a; and from this sum, the value of sampled speech w, is subtracted in subtractor 35b.
- the subtractor 35b output, e,, is a driving function in sampled form of the original unreverberant speech signal s,,, reverberated by the effects of enclosure 11.
- the next step in the inventive process involves dereverberating the excitation function e,,.
- the unit which performs the preceding is excitation analysis/synthesis unit 27, shown in FIG. 4. Its purpose is to synthesize, from a revamping of the driving function e,,, a clean driving function E Driving function e is first autocorrelated in correlation computer 20 to determine any dominant periodicities or lack thereof.
- This process involves the apparatus of FIG. 5 which computes the result:'
- y is the impulse response of the low-pass filters 206, r runs over the range of possible pitch periods, i.e., 3-13 ms.
- the sampled driving function 2 is introduced through a shift register 20a consisting of l stages where 1 corresponds to delays of up to 13 ms in keeping with the largest pitch periods which may be encountered.
- the samples are delayed one sample value per stage.
- signals e,, e,, e,, e,, are present at a given time at the output of the successive stages of shift register 20a.
- These signals are each multiplied separately in respective multipliers 20b by the quantity e
- the outputs of the respective multipliers 2012 are low-pass filtered in respective-filters 200 which are 20 Hz, for example, selected because of the inherent slowly varying nature of the correlation.
- the outputs of the respective filters 200 for each 1 sample are a set of numbersR('r,),R(r ),R(r R('r,) which are a measure of the degree of correlation for delays n.
- the delay 1, corresponding to the maximum of the just-performed autocorrelation is now ascertained in peak picking selector l6 seen in FIG. 7.
- the maximum R(-r value among R('r,), R(r R(*r,) is selected, and the delay 7, associated with that largest value of R, denoted r is used as the pitch period parameter required by pulse generator 13.
- selector 16 includes a threshold detector 36, which inspects the values of each signal R(r,, to determine whether the driving function at that time is voiced or unvoiced.
- threshold detector 36 is a binary level signal which is fed to voicing switch 15. Also, the output 1 of selector 16, which represents the pitch period, is fed to pulse generator 13.
- the latter can, for example be an astable oscillator of variable period well known to the state of the art.
- Pulse generator 13 waits r samples with zero output, then produces a unit amplitude output.
- the output of generator 13 is connected to voicing switch 15.
- White noise generator 14 is a conventional noise generator creating power of all frequencies at equal levels. Its output is also connected to voicing switch 15. When threshold detector 36 determines that a voiced excitation has occurred, it supplies an order to voicing switch 15 to effect a connection to pulse generator 13. Otherwise, when detector 36 identifies presence of an unvoiced excitation, it causes a connection of voicing switch 15 to white noise source 14.
- the output side of voicing switch 15, denoted 6, is a fixed amplitude source signal which is either a sequence of pitch pulses at the given pitch period, or a burst of white noise.
- the third parameter derived in unit 27 is a gain factor, denoted G in FIG. 4, which amplitude modulates or multiplies the fixed amplitude source signal 8,, by an amount that makes the result E identical in meanssquared (MS) level to the reverberant driving signal e
- the latter is the quotient, calculated in divider 25, of a dividend MS (e,,) and a divisor MS (8
- MS (e,,) is generated by feeding the sampled signal e through squarer 21 and thence Hz low-pass filter 22.
- the value MS (6,.) is generated by feeding the sampled signal fi through squarer 23 and 20 Hz low-pass filter 24.
- the all-pole filter 33 seen in FIG. 6, is a vocal tract model such as taught by Atal in his U.S. Pat. No. 3,624,302.
- Filter 33 consists of the delay line shift register 33a having for example 14 stages, each stage causing a delay z; and a corresponding number of multipliers 33b connected between the respective stages.
- the coefficients a a a a derived in coefficient computer are supplied to the respective multipliers 33b.
- the combination of delay line shift register 33a and multipliers 33b, designated 29 in FIG. 6 are known in the art as a transversal delay line.
- transversal delay line 29 the terms a s,,.,, are calculated. They are then summed by summer 28 along with the clean driving function 2,, giving the output stated in Equation (1). The result is the synthesized speech signal s,,, free of reverberative effects.
- the digital signal s at the, output of all-pole filter 33 may be converted to an analog version by the conventional technique of low-pass filtering at half the sample frequency for use in driving the receiver of telephone 39, for example.
- Apparatus for synthesizing speech comprisingz' transducer means located within a reverberant enclosure remotely from a speaker therein, for receiving reverberated speech signals from said speaker;
- Apparatus for constructing an undistorted replica of a speakers original speech uttered in a reverberant enclosure comprising means for continuously extracting from the reverberant speech signal an approximation of the vocal tract transfer function of said speaker;
- said extracting means further comprises means for recurrently estimating a sequence of weighting coefficients a,, which constitute a unique estimate of a time-varying filter that approximates in its transfer function T((u).
- said removing means comprises an inverse filter having as its inputs said weighting coefficients-a and said reverberant speech signal.
- said means for deriving said second parameter comprises means for autocorrelating said excitation function to determine a maximum value, and means for ascertaining a unique delay associated with said maximum value, said delay constituting the said pitch period parameter.
- said means for deriving said first parameter comprises a fixed threshold detector for inspecting the level of each of said maximum values resulting from said autocorrelation of said excitation function, and means for producing a voicedunvoiced decision based on whether a given said maximum value exceeds or falls below said fixed threshold.
- Apparatus pursuant to claim 6 further comprising:
- a voicing switch responsive to the output of said threshold detector for selecting a its output either said white noise source or said pulse generator, thereby to produce a fixed amplitude source signal.
- said means for generating said third parameter comprises means for multiplying said fixed amplitude source signal by an amount that renders the speakers said reverberated excitation function identical in mean-squared level with said fixed amplitude source signal, the result being a synthetic unreverberated excitation function.
- said combining means comprises an all-pole filter having as its inputs said weighting coefficients a and said unreverberated excitation function, the output of said combining means constituting the speakers synthesized speech signal free of reverberative effects.
- a speech dereverberation system for an enclosure characterized by a fixed transfer function H(w) comprising:
- a voiced-unvoiced interval ratio detector having a fixed threshold level; and 1 means including said detector for applying said impulses to said recursive filter when said threshold is exceeded and otherwise for applying said noise to said filter.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
Speech that has been reverberated by the transfer function of a reverberant enclosure is analyzed to detect parameters from which an unreverberative synthetic version of the original speech may be constructed. The process involves continuously approximating the vocal tract transfer function of the speaker. The effect of this transfer function is then removed from the reverberant speech by inverse filtering, the residual signal being the glottis excitation signal reverberated by the room. The reverberant excitation function is then analyzed to determine when the speaker''s driving function is voiced or unvoiced, the periodicity when voiced, and a unique gain factor. Then clean speech is synthesized using the foregoing three parameters operating on an all-pole filter that is continuously adapted to approximate the vocal tract transfer function.
Description
United States Patent [191 Allen SYNTHESIS OF PURE SPEECH FROM A [111 3,786,188 [4 1 Jan. 15, 1974 Primary ExaminerKath1een H. Claffy Assistant Examiner-Jon Bradford Leaheey Attorney-C. E. Graves [57] ABSTRACT Speech that has been reverberated by the transfer function of a reverberant enclosure is analyzed to detect parameters from which an unreverberative synthetic version of the original speech may be constructed. The process involves continuously approximating the vocal tract transfer function of the speaker. The effect of this transfer function is then removed from the reverberant speech by inverse filtering, the residual signal being the glottis excitation signal reverberated by the room. The reverberant excitation function is then analyzed to determine when the speakers driving function is voiced or unvoiced, the periodicity when voiced, and a unique gain factor. Then clean speech is synthesized using the foregoing three parameters operating on an all-pole filter that is continuously adapted to approximate the vocal tract transfer function.
10 Claims, 9 Drawing Figures r -en REVERBERANT SIGNAL [75] Inventor: Jont Brandon Allen, Fair Haven,
[73] Assignee: Bell Telephone Laboratories,
Incorporated, Murray Hill, Berkeley Heights, NJ. [22] Filed: Dec. 7, 1972 [21] Appl. No.: 311,731
[52] U.S. Cl 179/1 SA [51] Int. Cl. G101 1/00 [58] Field of Search 179/1 SA, 1 J, 1 P, 179/15.55 R; 84/D1G. 26
[56] References Cited UNITED STATES PATENTS 3,440,350 4/1969 Flanagan 179/1 SA 3,542,954 11/1970 F1anagan.... 179/1 .1 3,662,108 5/1972 Flanagan... 179/1 SA 3,715,512 2/1973 Kelly 179/1 SA REVERBERANT FLfirkSURE REVERBERATED SPEECH TRANS- 5 KHZ c i MISSION LOW 10 KHZ R COR NW T PASS A L ELATION I g\ m WORK FILTER 5 MP ER COMPUTER l I 1 l l u/(L) 34 37 38 l X 11 31 1| i o 4 L EP 1 5KHZ LOW 39 PASS FILTER n n. al 27 33 w a EXCITATION TIME- EFFICIENT s mvaasc ANALYSIS- VARYING 4n 7 FILTER SYNTHESIS ALL-POLE COMPUTER a 1 UNIT FILTER 1 REVERBERATED DRIVING FUNCTION "CLEAN" DRIVING FUNCTION DE'REVERBERATED SPEECH FIELD OF THE INVENTION This invention relates to the removal of distortion from a speech signal. In particular, this invention relates to the synthesizing of a distortionsfree speech signal from a signal originating in a reverberative enclosure.
Background of the Invention It is well known that speech, when produced as an acoustic signal in a reverberative chamber, reaches a remotely located microphone in that chamber at different times via a large number of paths of differing lengths. The signal received at the microphone will in general consist of the direct path energy, which is received first, followed closely by infinitely many delayed and filtered replicas of varying amplitudes. As perceived by the human ear, the effect is reverberative.
There are believed to be two separate effects present. The first effect is the coloration or spectral distortions due to the summation occurring at the microphone of the directly received signal and its many delayed dispersive reflections 'from the numerous walls and surfaces in the room 10. The second effect, the echo, are the temporal or time distortions arising from the slow decay of energy typically encountered in any moderately lossless room or cavity. These time distortions are closely related to the reverberation time of the room. For the subject who is not physically present in the chamber but listens through a connection to the microphone placed in the chamber, the effects of coloration and echo on intelligibility of the received signal are often severe. This condition is, unfortunately, frequently characteristic of hands-free telephonic transmissions.
Numerous schemes have been proposed to remove these degradations perceived in reverberant speech signals. Examples of two such schemes are found in U.S. Pat. Nos. 3,440,350 and 3,662,108 issued to J. L. Flanagan. One drawback of prior art schemes is their lack of facility to adapt to a room transfer function that is continually time-varying. A second drawback is an inability to rely on only the reverberant speech signal itself as a source of information with which to reconstruct the clean" speech.
Accordingly, it is one object of this invention to remove from reverberant speech the spectral distortions altogether, and also those temporal distortions which are equal to or less than the articulation times of the original speech of interest.
A further object of the invention is to realize a way of reconstructing an original speech signal by analysis of the reverberated speech. This object seeks to overcome prior art schemes wherein the parameters which control the synthesis of the undistorted signal are derived under unrealistic conditions, or are contingent on a stationary room transfer function.
Another inventive object is to devise a speech processing system of the type alluded to in the foregoing object, that has the property of sensing or detecting the parameters that characterize the original speech, so that by other aspects of the inventive process, undistorted speech will be synthesized from a knowledge of these parameters.
A still further object of the invention is to enlist and adapt the speech reconstruction method known generally as linear predictive filtering to novel use under reverberant conditions.
The processes of the present invention are based on known properties of human speech and the theory of linear prediction as expounded, for example, in the article Speech Analysis and Synthesis by LinearPredic tion of the Speech Wave, B. S. Atal and S. L. Hanauer, Journal of the Acoustic Society of America, Vol. 50, pages 637-655 (1971); and in U.S. Pat. No. 3,624,302, issued to B. S. Atal on Nov. 30, 1971, both of which are hereby incorporated by reference. By way of understanding the general relevance of applicants invention with respect to the cited prior work, the following brief review of this background art is in order.
Synthesis, or the production of an original or preexisting speech signal from a set of more basic parameters, depends in general upon activating some device whose basic transfer properties are akin to those of the human vocal tract, by some excitation signal which is akin to the excitation which drives the human vocal tract. For ongoing real time speech synthesis, Atal and others have recognized that a short: time spectral analysis of the original speech signal does not readily yield control signal information for this excitation signal or driving function. Atal has realized more reliable control signals by modeling the human vocal tract as an acoustic tube of variable dimensions. In the Atal model, the vowel and vowel-like sounds of the output at any instant of time are a weighted sum of a discrete number of recent past values of the output plus the value of the input or driving function at that instant of time. Thus:
equivalent to'a linear all-pole filter. The latter can be made to behave like the human vocal tract by the proper choice of filter parameters. One may produce.
speech wave forms by exciting the all-pole filter with the proper combinations of quasi-periodic pulses and white noise, referred to herein asthe excitation function e The parameters of this filter are the weighting coefficients alluded to above, and termed a where a is the gain applied to the speech sample delayed by k samples.
One inventive embodiment of Atal involves band width reduction. The parameters are derived in the' Atal approach from an undistorted original or preexisting speech signal which is to be reproduced at some remote location. Inherent in the reverberation reduction situation, however, is the availability of only reverberant speech as a source from which to derive parameters. It is not apparent that pure speech can be synthesized using only a reverberative speech as a parameter source. I
SUMMARY OF THE INVENTION The present invention in its broadest sense lies in the recognition that the time-varying vocal tract transfer function of a subject speaking within an enclosure, can indeed be sufficiently determined even after the speech has undergone severe reverberative distortion. This is the case, whether or not the room transfer function is also varying or is altogether unknown.
The speech signal w(t), which pursuant to the present invention is to be dereverberated, results from an as yet unknown excitation signal e(t) driving a vocal tract as described above with transfer function T(m) (where w 212' X frequency). The speech so produced, s(t), is then reverberated by the rooms transfer function H(m) to produce a reverberative speech signal w(t). The problem is to extract, fromthe reverberated speech w(t) information which can be used to reconstruct or synthesize the original speech signal s(t).
Pursuant to a prime aspect of the invention, it has been recognized that anypractical or typical room transfer function H(w) has certain properties that make it possible to accurately determine the speakers vocal tract transfer function T(m), from the reverberative speech signal w(t). The principal property that makes the foregoing possible is that the mode structure, i.e., mode density, is almost always sufficiently great that the modes are closer than their bandwidths over the frequency range of useful speech information. Further, the reverberation times, i.e., the 60 dB energy delay time, of the vast majority of office or room size reverberant enclosures are less than those which would damage the articulation because of echos. In contrast, articulation damage could be expected to occur in the case of a large auditorium with hard walls.
By analysis of the reverberant speech, the vocal tract transfer function T(w) of the speaker is continuously approximated. Then, the effect of the vocal tract transfer function is removed from the reverberant speech by inverse filtering, leaving only the spectrically flattened glottis excitation signal e(n) reverberated by the room transfer function H(w). Pursuant to the invention, analysis is then performed at this point on the reverberant excitation function to determine when the driving function e(t) of the speaker is quasiperiodic (which is the voiced condition) or white noise (which is the unvoiced condition). The gain of the driving function e(t) and the period of the quasiperiodic source during voicing are also derived.
Then, pursuant to the invention, clean speech is synthesized using:
1. T(w), the vocal tract transfer function;
2. a binary parameter denoting voiced or unvoiced information;
3. a parameter denoting the period of the voiced part of the speakers vocal tract driving function e(t); 4. and a gain parameter denoting the mean-squared level of the driving function e(t).
Advantageously, this process is continuously performed digitally by a sampling of the reverberated speech at, nominally, a kHz rate. In a given communications link, the sampling and processing can occur at any point, such as at the transmitting-station, the receiving station, or at some central point such as a central office if the system is telephonic. In the latter case, one speech processor pursuant to the present invention can be constructed to process a multiplicity of reverberative speech signals that are routed through the office.
THE DRAWING FIG. 1 is a schematic block diagram of the entire inventive process in combination with a communications transmission network.
FIg. 2 is a schematic circuit diagram of a unique computer.
FIG. 3 is a schematic circuit diagram of an inverse filter.
FIG. 4 is a schematic circuit diagram of an excitation analysis/synthesis unit.
FIG. 5 is a schematic circuit diagram of a second correlation computer.
FIG. 6 is a schematic circuit diagram of a synthetic speech generator.
FIG. 7 is a schematic circuit diagram of a peak selector portion of said excitation analysis/synthesis unit.
FIG. 8 is a graph depicting resonant frequencies.
Fig. 9 is a graph depicting an aspect of a typical room transfer function.
THEORY OF THE INVENTION A greater understanding of the illustrative embodiment will be gained by first more fully considering the theory of the invention and definitions of certain terms.
Excitation Signal or Driving Function e(t) In order to cause an output at the mouth from the human vocal tract, the vocal cords of the glottis are excited to produce pulses recurring at a quasiperiodic rate. The sounds so produced are voiced. Other sounds are unvoiced, such as sss, fff, p, and k. The latter are formed by turbulent air at the mouth, throat, and lips without vocal cord excitation. The voiced and unvoiced sounds in total are the signal source from which human speech originates, and are called the excitation signal e(t). In order to generate an output from a model of the vocal tract, such as a filter with transfer function T(m), an excitation signal must be applied. Speech so produced is of course synthetic. The excitation signal, which in sampled form is herein denoted e(n), may consist of a pulse generator with a variable pulse period and a white noise source, selectively applied to a variable gain amplifier. The pulse generator supplies the Vocal Tract Transfer Function T(w) It has been demonstrated by Atal and Hanauer that the human vocal tract may be accurately modeled as an all-pole filter T(w) which closely approximates the transfer properties of the vocal tract. Such afilter has a transfer function in the frequency domain given by:
where z exp (i (ll/(D (2a) in which m, the radian sampling frequency w the radian frequency The number 14 in Equation (2) is a typical value. Equation (2) is the reciprocal of a polynomial where the zeros formed for a given set of coefficient values (1,, a a a determine the frequencies where T(w) has its maximum values. The latter frequencies are the resonant frequencies or poles of the filter shown in FIG. 8 as m etc.
If a driving function e(t) comprising a specified combination of periodic impulses and white noise is applied to such a filter, a speech signal s(t) will result. In sampled form, we denote s,, (n an integer) as the speech samples, and e as the driving signal samples. Then, as in Equation (1), since:
(where s is the output of this filter and e, is the input), the resulting s is the output of the vocal tract further defined by the a coefficients and being driven by e,,. Equation (3) is an application of the method of linear digital filtering, and it states that the present output value of s can be estimated from a weighted sum of past (or delayed) output values (s,, plus the new input value e,, of the driving function.
Room Transfer Function H(cu) It is well known that an enclosure such as a room is a linear system. This means that the effect a room has on a signal such as speech is to cause numerous filtered delays of the signal in its travel to a stationary'microphone, via many diverse path lengths. All the delayed signals are additively combined by a microphone placed in the enclosure. For example, two unit amplitude sinusoidal signals cos ant and cos m launched in an enclosure will be recovered by a microphone, or perceived by a listener, with an altered amplitude 12,, b and each will have been delayed by an amount expressable as respective phase angles (1), and (11 Thus:
cos 0),! [2 cos (m t (1),)
cos w r b cos (m r where d) and b are functions of the frequency w and the location of the microphone and loudspeaker, but not otherwise a function of time.
Thus, a room transfer function is an expression that in one respect describes how signals of various given frequencies will be relatively affected inamplitude and phase by being propagated in the room. FIG. 9 depicts a typical room transfer function. FIG. 9 illustrates the fact that propagated frequencies differing by as little as 2 Hz may differ substantially in power at a stationary point remote from the source by 40 dB. It can be seen that a room is a filter, and -its transfer function is that of a filter.
The problem of describing how an enclosure affects propagated acoustic wave signals may be approached analytically either in terms of the frequency response H(w), as depicted in FIG. 9; or in terms of impulse response/1(1) where H is a complex number and h(t) is a real time-varying signal amplitude.
Given H(w), one can, by aFourier transform, derive the function h(t). Also, given the problem of a time- 6 varying input signal of amplitude s(t) passing through a reverberant enclosure having a known impulse response h(t), the output signal w(t) may be predicted as: U) 0) M (6) where the symbol denotes convolution. Likewise, given the enclosure input signal s(t) .by its Fourier transform S(m) and the frequency response H(m) of the closure, the output frequency response W(w) is:
DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT Theory will now be applied by reference to a typical reverberant chamber shown as a four-sided room 10 in FIG. 1 wherein a speaker is speaking from location 11 and his speech is received by a microphone 12 remotely placed. In general, as the distance between a speaker and the receiving microphone is increased, beginning at a point typically beyond a few inches a rain barrel-like quality to the speech will be increasingly evident at the microphone location and hence, of course, also at any receiver connected thereto.
If the speaker at location 11 is characteristic of most male and female adults speaking English, his articulation times are commonly of the order of less than 1 second. For the typical reverberant closure such as room 10, the 60 dB energy decay time is less than one second; and room dimensions are at least times greater than the dimensions of a human vocal tract.
Summary of Process FIG. 1 depicts the continuous operation of the inventive process, in a sequence of stages. First, the values of the a terms of Equation (1) are calculated from w by the correlation computer 31 vand the coefficient computer 30. The a terms, which number typically 14, constitute an estimate of the time-varying filter which approximates, in its transfer function T(w), the speakers vocal tract. Then, the reverberated signal w,, and
the varying values of a are applied to inverse filter 32 Process Details The reverberant speech is forwarded via a transmission network 34 to the intended receiving point such as telephone 39. For simplicity, the network 34 is shown as separate from the speech processor; but obviously the process could be located within network 34, such as in a central office. Similarly, telephone 39 includes a direct connection to network34 and an indirect connection thereto via the processor, thus to indicate that the processor could'be an add-on feature located at the telephone.
Advantageously, at or near the point where the processing is to occur, the reverberant speech w(t) is first low'pass filtered in filter 37. The latter is a 5 kHz filter, designed to the proposition that human speech information is sufficiently specified within the frequencies below kHz. The low-pass filtered speech signal is then sampled in sampler 38 at a kHz rate, in keeping with the Nyquist sampling theorem. The output of sampler 38 is a stepwise succession of voltages whose amplitudes are indicative of the low-pass filtered speech signal strength at times corresponding to the sampling times. This output is w,,, n being an integer analogous to time.
The sampler 38 output w, is fed to correlation computer 31. On a continuous (every sample) or periodic (for example every 6 ms) basis, computer 31 forms the following combinations of the input:
where P is typically I4 and h, is the impulse response of each of the filters 310.
FIG. 2 depicts the structure of correlation computer 31. The sampled signal w, is introduced through a shift register consisting of p stages. The samples are delayed one sample value per stage, 2 being notation signifying a delay of one sample. Thus, signals w,, w,, w,, W,, 14 (ifp 14) are present at a given time as outputs of successive delay stages of register 31a. These signals are each multiplied separately in respective multipliers 31b, by the present speech sample w, giving w w,, ,7= 1, 2, SP. The outputs of ar ri plifiers 31b are fed respectively to low-pass filters 310 which are for example Hz-filters and which average over the respective filter inputs with a weighting defined by their impulse response h... The outputs of filters 31c, designated R R2 R14" are fed to coefficient computer 30.
Coefficient computer sets up and solves a set of linear simultaneous equations for the values (a, a a
. a of a These equations are:
FIG. 3 depicts the structure of inverse filter 32 as consisting of a delay line or shift register 32a of p stages where p equals the number of stages, for example, 14 of register 31a. The input to shift register 32 is the sampled signal w,.. As in shift register 310, the samples are delayed one sample value per stage. The delayed samples are picked off successive stages of register 32a and respectively led to multipliers 32b. The inputs to re spective multipliers 32b are the values (0,, a a a calculated in computer 30. The outputs of all multipliers 32b are combined in adder 35a; and from this sum, the value of sampled speech w, is subtracted in subtractor 35b. The subtractor 35b output, e,,, is a driving function in sampled form of the original unreverberant speech signal s,,, reverberated by the effects of enclosure 11.
The next step in the inventive process involves dereverberating the excitation function e,,. The unit which performs the preceding is excitation analysis/synthesis unit 27, shown in FIG. 4. Its purpose is to synthesize, from a revamping of the driving function e,,, a clean driving function E Driving function e is first autocorrelated in correlation computer 20 to determine any dominant periodicities or lack thereof. This process involves the apparatus of FIG. 5 which computes the result:'
y is the impulse response of the low-pass filters 206, r runs over the range of possible pitch periods, i.e., 3-13 ms.
As seen in FIG. 5, the sampled driving function 2,, is introduced through a shift register 20a consisting of l stages where 1 corresponds to delays of up to 13 ms in keeping with the largest pitch periods which may be encountered. The samples are delayed one sample value per stage. Thus, signals e,, e,, e,, e,, are present at a given time at the output of the successive stages of shift register 20a. These signals are each multiplied separately in respective multipliers 20b by the quantity e The outputs of the respective multipliers 2012 are low-pass filtered in respective-filters 200 which are 20 Hz, for example, selected because of the inherent slowly varying nature of the correlation.
The outputs of the respective filters 200 for each 1 sample are a set of numbersR('r,),R(r ),R(r R('r,) which are a measure of the degree of correlation for delays n. The delay 1, corresponding to the maximum of the just-performed autocorrelation is now ascertained in peak picking selector l6 seen in FIG. 7. The maximum R(-r value among R('r,), R(r R(*r,) is selected, and the delay 7, associated with that largest value of R, denoted r is used as the pitch period parameter required by pulse generator 13.
Additionally, selector 16 includes a threshold detector 36, which inspects the values of each signal R(r,, to determine whether the driving function at that time is voiced or unvoiced. A
The output of threshold detector 36 is a binary level signal which is fed to voicing switch 15. Also, the output 1 of selector 16, which represents the pitch period, is fed to pulse generator 13. The latter can, for example be an astable oscillator of variable period well known to the state of the art.
The third parameter derived in unit 27 is a gain factor, denoted G in FIG. 4, which amplitude modulates or multiplies the fixed amplitude source signal 8,, by an amount that makes the result E identical in meanssquared (MS) level to the reverberant driving signal e The latter is the quotient, calculated in divider 25, of a dividend MS (e,,) and a divisor MS (8 The value MS (e,,) is generated by feeding the sampled signal e through squarer 21 and thence Hz low-pass filter 22. The value MS (6,.) is generated by feeding the sampled signal fi through squarer 23 and 20 Hz low-pass filter 24.
The quotient of these two, namely gain factor G, is continuously applied to the signal 8,, through variable gain amplifier 26. The output of the latter is E which approximates the driving function of the original unreverberant speech s(!) in 10. It remains now to synthesize the clean speech; and this is accomplished in all-pole filter 33.
The all-pole filter 33 seen in FIG. 6, is a vocal tract model such as taught by Atal in his U.S. Pat. No. 3,624,302. Filter 33 consists of the delay line shift register 33a having for example 14 stages, each stage causing a delay z; and a corresponding number of multipliers 33b connected between the respective stages. The coefficients a a a a derived in coefficient computer are supplied to the respective multipliers 33b. The combination of delay line shift register 33a and multipliers 33b, designated 29 in FIG. 6 are known in the art as a transversal delay line.
In transversal delay line 29, the terms a s,,.,, are calculated. They are then summed by summer 28 along with the clean driving function 2,, giving the output stated in Equation (1). The result is the synthesized speech signal s,,, free of reverberative effects.
The digital signal s at the, output of all-pole filter 33 may be converted to an analog version by the conventional technique of low-pass filtering at half the sample frequency for use in driving the receiver of telephone 39, for example.
Multiple-Microphone Signal Pickup Although the invention has so far been described as operating with a single microphone l2, arrays of plural microphones can also be used to advantage. The benefit of microphone arrays is understood by recognizing that a better estimate of the parameters is attained through the availability of more data. For this case, each new microphone requires its own correlation computer 31. The new outputs from this computer R(1,), R'('r R'('r are added to the other R('r)s of other microphones thus giving more accurate data It is to be understood that the embodiments de-.
scribed herein are merely illustrative of the principles of the invention. Various modifications may be made thereto by persons skilled in the art without departing from the spirit and scope of the invention.
What is claimed is: 1. Apparatus for synthesizing speech comprisingz' transducer means located within a reverberant enclosure remotely from a speaker therein, for receiving reverberated speech signals from said speaker;
means for continuously deriving, from said reverberated speech signals, first signals representative of the vocal tract transfer function of said speaker;
means for developing, from said reverberated speech signals and said first signals, second signals representing the reverberated excitation source of said speaker;
means for dereverberating said second signals; and
means for developing, from said first signals and said dereverberated second signals, synthetic speech signals substantially approximating said speakers original speech.
2. Apparatus for constructing an undistorted replica of a speakers original speech uttered in a reverberant enclosure comprising means for continuously extracting from the reverberant speech signal an approximation of the vocal tract transfer function of said speaker;
means for removing from said reverberant speech the effect of said vocal tract transfer function, the re sulting residual signal being substantially the speakers reverberated excitation function;
means for deriving, from said reverberated excitation function,
a first parameter denoting the voiced or unvoiced nature of said excitation function;
a second parameter denoting the pitch period of voiced portions of said excitation function; and
a third parameter denoting the mean-squared level of said excitation function; and
means for combining said first, second, and third parameters with said vocal tract transfer function to produce said undistorted replica.
3. Apparatus pursuant to claim 2 wherein said extracting means further comprises means for recurrently estimating a sequence of weighting coefficients a,, which constitute a unique estimate of a time-varying filter that approximates in its transfer function T((u).
said speakers vocal tract.
4. Apparatus pursuant to claim 3 wherein said removing means comprises an inverse filter having as its inputs said weighting coefficients-a and said reverberant speech signal. 1
5. Apparatus pursuant to claim 4 wherein said means for deriving said second parameter comprises means for autocorrelating said excitation function to determine a maximum value, and means for ascertaining a unique delay associated with said maximum value, said delay constituting the said pitch period parameter.
6. Apparatus pursuant to claim wherein said means for deriving said first parameter comprises a fixed threshold detector for inspecting the level of each of said maximum values resulting from said autocorrelation of said excitation function, and means for producing a voicedunvoiced decision based on whether a given said maximum value exceeds or falls below said fixed threshold.
7. Apparatus pursuant to claim 6 further comprising:
a white noise source,
a variable period pulse generator,
means for transmitting said pitch period parameter to said pulse generator to control the period of said pulses, and
a voicing switch responsive to the output of said threshold detector for selecting a its output either said white noise source or said pulse generator, thereby to produce a fixed amplitude source signal.
8. Apparatus pursuant to claim 7, wherein said means for generating said third parameter comprises means for multiplying said fixed amplitude source signal by an amount that renders the speakers said reverberated excitation function identical in mean-squared level with said fixed amplitude source signal, the result being a synthetic unreverberated excitation function.
9. Apparatus pursuant to claim 8, wherein said combining means comprises an all-pole filter having as its inputs said weighting coefficients a and said unreverberated excitation function, the output of said combining means constituting the speakers synthesized speech signal free of reverberative effects.
10. A speech dereverberation system for an enclosure characterized by a fixed transfer function H(w) comprising:
a source of white noise;
a source of electrical impulses;
a recursive filter having a variable transfer function;
means for receiving a reverberative speech signal generated from within said enclosure;
means for deriving, from successive discrete sample sets of said speech signal, a specific said filter setting representing a currently valid vocal tract transfer function;
means for deriving an indicia of the pitch period and an indicia of the ratio of voiced-to-unvoiced intervals in said speech signal;
means for applying said pitch period indicia to said impulse source to control the impulse generation rate; a voiced-unvoiced interval ratio detector having a fixed threshold level; and 1 means including said detector for applying said impulses to said recursive filter when said threshold is exceeded and otherwise for applying said noise to said filter.
Claims (10)
1. Apparatus for synthesizing speech comprising: transducer means located within a reverberant enclosure remotely from a speaker therein, for receiving reverberated speech signals from said speaker; means for continuously deriving, from said reverberated speech signals, first signals representative of the vocal tract transfer function of said speaker; means for developing, from said reverberated speech signals and said first signals, second signals representing the reverberated excitation source of said speaker; means for dereverberating said second signals; and means for developing, from said first signals and said dereverberated second signals, synthetic speech signals substantially approximating said speaker''s original speech.
2. Apparatus for constructing an undistorted replica of a speaker''s original speech uttered in a reverberant enclosure comprising means for continuously extracting from the reverberant speech signal an approximation of the vocal tract transfer function of said speaker; means for removing from said reverberant speech the effect of said vocal tract transfer function, the resulting residual signal being substantially the speaker''s reverberated excitation function; means for deriving, from said reverberated excitation function, a first parameter denoting the voiced or unvoiced nature of said excitation function; a second parameter denoting the pitch period of voiced portions of said excitation function; and a third parameter denoting the mean-squared level of said excitation function; and means for combining said first, second, and third parameters with said vocal tract transfer function to produce said undistorted replica.
3. Apparatus pursuant to claim 2 wherein said extracting means further comprises means for recurrently estimating a sequence of weighting coefficients ak which constitute a unique estimate of a time-varying filter that approximates in its transfer function T( omega ) said speaker''s vocal tract.
4. Apparatus pursuant to claim 3 wherein said removing means comprises an inverse filter having as its inputs said weighting coefficients ak and said reverberant speech signal.
5. Apparatus pursuant to claim 4 wherein said means for deriving said second parameter comprises means for autocorrelating said excitation function to determine a maximum value, and means for ascertaining a unique delay associated with said maximum value, said delay constituting the said pitch period parameter.
6. Apparatus pursuant to claim 5 wherein said means for deriving said first parameter comprises a fixed threshold detector for inspecting the level of each of said maximum values resulting from said autocorrelation of said excitation function, and means for producing a voiced-unvoiced decision based on whether a given said maximum value exceeds or falls below said fixed threshold.
7. Apparatus pursuant to claim 6 further comprising: a white noise source, a variable period pulse generator, means for transmitting said pitch period parameter to said pulse generator to control the period of said pulses, and a voicing switch responsive to the output of said threshold detector for selecting as its output either said white noise source or said pulse generator, thereby to produce a fixed amplitude source signal.
8. Apparatus pursuant to claim 7, wherein said means for generating said third parameter comprises means for multiplying said fixed amplitude source signal by an amount that renders the speaker''s said reverberated excitation function identical in mean-squared level with said fixed amplitude source signal, the result being a synthetic unreverberated excitation function.
9. Apparatus pursuant to claim 8, wherein said combining means comprises an all-pole filter having as its inputs said weighting coefficients ak and said unreverberated excitation function, the output of said combining means constituting the speaker''s synthesized speech signal free of reverberative effects.
10. A speech dereverberation system for an enclosure characterized by a fixed transfer function H( omega ), comprising: a source of white noise; a source of electrical impulses; a recursive filter having a variable transfer function; means for receiving a reverberative speech signal generated from within said enclosure; means for deriving, from successive discrete sample sets of said speech signal, a specific said filter setting representing a currently valid vocal tract transfer function; means for deriving an indicia of the pitch period and an indicia of the ratio of voiced-to-unvoiced intervals in said speech signal; means for applying said pitch period indicia to said impulse source to control the impulse generation rate; a voiced-unvoiced interval ratio detector having a fixed threshold level; and means including said detector for applying said impulses to said recursive filter when said threshold is exceeded and otherwise for applying said noise to said filter.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US31173172A | 1972-12-07 | 1972-12-07 |
Publications (1)
Publication Number | Publication Date |
---|---|
US3786188A true US3786188A (en) | 1974-01-15 |
Family
ID=23208208
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US00311731A Expired - Lifetime US3786188A (en) | 1972-12-07 | 1972-12-07 | Synthesis of pure speech from a reverberant signal |
Country Status (1)
Country | Link |
---|---|
US (1) | US3786188A (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2389280A1 (en) * | 1977-04-27 | 1978-11-24 | Western Electric Co | SIGNAL PROCESSING SYSTEM |
US4612414A (en) * | 1983-08-31 | 1986-09-16 | At&T Information Systems Inc. | Secure voice transmission |
EP0289285A2 (en) * | 1987-04-30 | 1988-11-02 | Oki Electric Industry Company, Limited | Linear predictive coding analysing apparatus and bandlimited circuit therefor |
US4825384A (en) * | 1981-08-27 | 1989-04-25 | Canon Kabushiki Kaisha | Speech recognizer |
US4833717A (en) * | 1985-11-21 | 1989-05-23 | Ricoh Company, Ltd. | Voice spectrum analyzing system and method |
US5014318A (en) * | 1988-02-25 | 1991-05-07 | Fraunhofer Gesellschaft Zur Forderung Der Angewandten Forschung E. V. | Apparatus for checking audio signal processing systems |
US5150413A (en) * | 1984-03-23 | 1992-09-22 | Ricoh Company, Ltd. | Extraction of phonemic information |
US5150414A (en) * | 1991-03-27 | 1992-09-22 | The United States Of America As Represented By The Secretary Of The Navy | Method and apparatus for signal prediction in a time-varying signal system |
US5528726A (en) * | 1992-01-27 | 1996-06-18 | The Board Of Trustees Of The Leland Stanford Junior University | Digital waveguide speech synthesis system and method |
US5717768A (en) * | 1995-10-05 | 1998-02-10 | France Telecom | Process for reducing the pre-echoes or post-echoes affecting audio recordings |
US5774562A (en) * | 1996-03-25 | 1998-06-30 | Nippon Telegraph And Telephone Corp. | Method and apparatus for dereverberation |
US5949891A (en) * | 1993-11-24 | 1999-09-07 | Intel Corporation | Filtering audio signals from a combined microphone/speaker earpiece |
US20040022394A1 (en) * | 2002-08-05 | 2004-02-05 | Michaelis Paul R. | Room acoustics echo meter for voice terminals |
US20040161120A1 (en) * | 2003-02-19 | 2004-08-19 | Petersen Kim Spetzler | Device and method for detecting wind noise |
WO2005062298A1 (en) * | 2003-12-01 | 2005-07-07 | Siemens Aktiengesellschaft | Method for suppressing the interference of audio signals |
US20060245565A1 (en) * | 2005-04-27 | 2006-11-02 | Cisco Technology, Inc. | Classifying signals at a conference bridge |
US20080189107A1 (en) * | 2007-02-06 | 2008-08-07 | Oticon A/S | Estimating own-voice activity in a hearing-instrument system from direct-to-reverberant ratio |
US20090281807A1 (en) * | 2007-05-14 | 2009-11-12 | Yoshifumi Hirose | Voice quality conversion device and voice quality conversion method |
US7734034B1 (en) | 2005-06-21 | 2010-06-08 | Avaya Inc. | Remote party speaker phone detection |
US9025779B2 (en) | 2011-08-08 | 2015-05-05 | Cisco Technology, Inc. | System and method for using endpoints to provide sound monitoring |
US9520140B2 (en) | 2013-04-10 | 2016-12-13 | Dolby Laboratories Licensing Corporation | Speech dereverberation methods, devices and systems |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3440350A (en) * | 1966-08-01 | 1969-04-22 | Bell Telephone Labor Inc | Reception of signals transmitted in a reverberant environment |
US3542954A (en) * | 1968-06-17 | 1970-11-24 | Bell Telephone Labor Inc | Dereverberation by spectral measurement |
US3662108A (en) * | 1970-06-08 | 1972-05-09 | Bell Telephone Labor Inc | Apparatus for reducing multipath distortion of signals utilizing cepstrum technique |
US3715512A (en) * | 1971-12-20 | 1973-02-06 | Bell Telephone Labor Inc | Adaptive predictive speech signal coding system |
-
1972
- 1972-12-07 US US00311731A patent/US3786188A/en not_active Expired - Lifetime
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3440350A (en) * | 1966-08-01 | 1969-04-22 | Bell Telephone Labor Inc | Reception of signals transmitted in a reverberant environment |
US3542954A (en) * | 1968-06-17 | 1970-11-24 | Bell Telephone Labor Inc | Dereverberation by spectral measurement |
US3662108A (en) * | 1970-06-08 | 1972-05-09 | Bell Telephone Labor Inc | Apparatus for reducing multipath distortion of signals utilizing cepstrum technique |
US3715512A (en) * | 1971-12-20 | 1973-02-06 | Bell Telephone Labor Inc | Adaptive predictive speech signal coding system |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2389280A1 (en) * | 1977-04-27 | 1978-11-24 | Western Electric Co | SIGNAL PROCESSING SYSTEM |
US4825384A (en) * | 1981-08-27 | 1989-04-25 | Canon Kabushiki Kaisha | Speech recognizer |
US4612414A (en) * | 1983-08-31 | 1986-09-16 | At&T Information Systems Inc. | Secure voice transmission |
US5150413A (en) * | 1984-03-23 | 1992-09-22 | Ricoh Company, Ltd. | Extraction of phonemic information |
US4833717A (en) * | 1985-11-21 | 1989-05-23 | Ricoh Company, Ltd. | Voice spectrum analyzing system and method |
US4961160A (en) * | 1987-04-30 | 1990-10-02 | Oki Electric Industry Co., Ltd. | Linear predictive coding analysing apparatus and bandlimiting circuit therefor |
EP0289285A3 (en) * | 1987-04-30 | 1989-11-29 | Oki Electric Industry Company, Limited | Linear predictive coding analysing apparatus and bandlimited circuit therefor |
EP0289285A2 (en) * | 1987-04-30 | 1988-11-02 | Oki Electric Industry Company, Limited | Linear predictive coding analysing apparatus and bandlimited circuit therefor |
US5014318A (en) * | 1988-02-25 | 1991-05-07 | Fraunhofer Gesellschaft Zur Forderung Der Angewandten Forschung E. V. | Apparatus for checking audio signal processing systems |
US5150414A (en) * | 1991-03-27 | 1992-09-22 | The United States Of America As Represented By The Secretary Of The Navy | Method and apparatus for signal prediction in a time-varying signal system |
US5528726A (en) * | 1992-01-27 | 1996-06-18 | The Board Of Trustees Of The Leland Stanford Junior University | Digital waveguide speech synthesis system and method |
US5949891A (en) * | 1993-11-24 | 1999-09-07 | Intel Corporation | Filtering audio signals from a combined microphone/speaker earpiece |
US5717768A (en) * | 1995-10-05 | 1998-02-10 | France Telecom | Process for reducing the pre-echoes or post-echoes affecting audio recordings |
US5774562A (en) * | 1996-03-25 | 1998-06-30 | Nippon Telegraph And Telephone Corp. | Method and apparatus for dereverberation |
US20040022394A1 (en) * | 2002-08-05 | 2004-02-05 | Michaelis Paul R. | Room acoustics echo meter for voice terminals |
US7171004B2 (en) * | 2002-08-05 | 2007-01-30 | Avaya Technology Corp. | Room acoustics echo meter for voice terminals |
US7340068B2 (en) * | 2003-02-19 | 2008-03-04 | Oticon A/S | Device and method for detecting wind noise |
US20040161120A1 (en) * | 2003-02-19 | 2004-08-19 | Petersen Kim Spetzler | Device and method for detecting wind noise |
WO2005062298A1 (en) * | 2003-12-01 | 2005-07-07 | Siemens Aktiengesellschaft | Method for suppressing the interference of audio signals |
US7852999B2 (en) * | 2005-04-27 | 2010-12-14 | Cisco Technology, Inc. | Classifying signals at a conference bridge |
US20060245565A1 (en) * | 2005-04-27 | 2006-11-02 | Cisco Technology, Inc. | Classifying signals at a conference bridge |
US7734034B1 (en) | 2005-06-21 | 2010-06-08 | Avaya Inc. | Remote party speaker phone detection |
US20080189107A1 (en) * | 2007-02-06 | 2008-08-07 | Oticon A/S | Estimating own-voice activity in a hearing-instrument system from direct-to-reverberant ratio |
EP1956589A1 (en) * | 2007-02-06 | 2008-08-13 | Oticon A/S | Estimating own-voice activity in a hearing-instrument system from direct-to-reverberant ratio |
AU2007221816B2 (en) * | 2007-02-06 | 2010-12-23 | Oticon A/S | Estimating own-voice activity in a hearing-instrument system from direct-to-reverberant ratio |
US20090281807A1 (en) * | 2007-05-14 | 2009-11-12 | Yoshifumi Hirose | Voice quality conversion device and voice quality conversion method |
US8898055B2 (en) * | 2007-05-14 | 2014-11-25 | Panasonic Intellectual Property Corporation Of America | Voice quality conversion device and voice quality conversion method for converting voice quality of an input speech using target vocal tract information and received vocal tract information corresponding to the input speech |
US9025779B2 (en) | 2011-08-08 | 2015-05-05 | Cisco Technology, Inc. | System and method for using endpoints to provide sound monitoring |
US9520140B2 (en) | 2013-04-10 | 2016-12-13 | Dolby Laboratories Licensing Corporation | Speech dereverberation methods, devices and systems |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US3786188A (en) | Synthesis of pure speech from a reverberant signal | |
Gillespie et al. | Speech dereverberation via maximum-kurtosis subband adaptive filtering | |
Schroeder | Vocoders: Analysis and synthesis of speech | |
CN109065067B (en) | Conference terminal voice noise reduction method based on neural network model | |
JP4567655B2 (en) | Method and apparatus for suppressing background noise in audio signals, and corresponding apparatus with echo cancellation | |
US4066842A (en) | Method and apparatus for cancelling room reverberation and noise pickup | |
CA1123955A (en) | Speech analysis and synthesis apparatus | |
Subramaniam et al. | Cepstrum-based deconvolution for speech dereverberation | |
US8218780B2 (en) | Methods and systems for blind dereverberation | |
JP2703405B2 (en) | Polyphonic coding | |
FI96247C (en) | Procedure for converting speech | |
JP3887028B2 (en) | Signal source characterization system | |
CN115579016B (en) | Method and system for eliminating acoustic echo | |
US4845753A (en) | Pitch detecting device | |
Kawamura et al. | A noise reduction method based on linear prediction analysis | |
Suzuki | Speech processing by splicing of autocorrelation function | |
JP2002064617A (en) | Echo suppression method and echo suppression equipment | |
JP2002062900A (en) | Sound collecting device and signal receiving device | |
JP3035939B2 (en) | Voice analysis and synthesis device | |
Muron et al. | Modelling of reverberations and audioconference rooms | |
Dufera et al. | Reverberated speech enhancement using neural networks | |
Schlang | An auditory based approach for echo compensation with modulation filtering. | |
JP3285178B2 (en) | Sound signal rising detection method | |
Hassan et al. | A Comparative Study between Pitch Detection Techniques on Reverberant Speech Signals | |
Wang et al. | An implementation of multi-microphone dereverbera-tion approach as a preprocessor to the word recogni-tion system |