Evaluation of Speaker-J recordings

Next: Related Work Up: Empirical Results Previous: Evaluation of IPAQ recordings

Evaluation of Speaker-J recordings

In order to evaluate a different model of impersonation, i.e, one where the attacker has knowledge of the speaker being impersonated, we explored a second data set. Our second data set is one collected within the context of a different research effort, and so consequently we had less control over the availability of phrases said by the same user multiple times. This data set consists of recordings of a professional speaker, here called ``Speaker-J'', taken in a professional studio (i.e., a room with virtually no background noise) using a high-quality microphone. The same microphone was used throughout data collection to ensure flat and consistent frequency response. Consequently, this data set is of a much higher quality than the data set described in Section 5.1. This data consists of recordings collected for two separate experiments, but both spoken by the same speaker, Speaker-J. Part I consists of approximately 1600 sentences (roughly 1 hour of speech) and includes the speaker reading (with consistent voice quality and head-to-microphone distance) both standard newswire text and a prepared script that covers rare combinations of speech sounds. Part II consists of 38 minutes of speech consisting of a group of sentences repeated 7 times, all recorded in a single day. The elapsed time between the two datasets was approximately 1 year.

To evaluate our technique, each of these phrases (in part II) were used as a passphrase in our scheme: five recordings were used to generate the speaker's distinguishing features for a given phrase, and two recordings were used to simulate the speaker attempting to regenerate his key. Specifics about the chosen passphrases can be found in Table 1.

The right side of Figure 4 shows the resulting ``distinguishing features'' and ``true speakers'' curves. These curves are analogous to the curves in the left side of Figure 4 with the same labels: the first characterizes the number of distinguishing features for this user based on the training utterances, and the second gives the average number of features on which a feature descriptor generated from a test utterance matched the distinguishing features. If we look at , we again see that $c_{\rm max} \approx 5$ should approach a reasonable false negative rate (and is plausible by Section 4.3). Moreover, according to this data set, the distinguishing features at are approaching a better range for security, with $d/m \approx 0.6$ . On the other hand, the higher quality of these recordings may provide a more optimistic picture than would be realized in practice. Part I of the Speaker-J data set is very rich, and moreover, the research effort that generated this data set carefully annotated the speech, identifying the beginning, ending and midpoint of each phoneme it contains. Informally, phonemes are the basic units in the sound system of a language; in the case of English, there are about phonemes. With these annotations, diphones can be extracted from the recorded speech. A diphone is a portion of speech beginning in the middle of one phoneme and ending in the middle of the next phoneme; a diphone thus is an example of how the user's speech transitions from one phoneme to another. Diphones reveal much about the voice patterns of the user who uttered them. So, part I of the Speaker-J data set provides the opportunity to attempt a different form of attack against our system, namely one that would simulate an attacker who had a corpus of recordings of the user saying many things other than the passphrase itself. A question we attempt to answer is whether these recordings assist the attacker significantly in finding the user's key.

Specifically, consider an attack in which the attacker wishes to test a candidate passphrase (in our case, selected from part II) but does not how the user speaks it. The attacker uses a text analysis module from a text-to-speech (TTS) system [28] to translate the text of the passphrase into a string of phonemes that realize the passphrase (i.e., a pronunciation for the text), along with other important information that is typically used when synthesizing speech (e.g., the duration and the pitch contour for each phoneme). Of course, any of these features may not match exactly what the user says when she speaks the passphrase. For example, a given word can be pronounced a number of different ways. So, even given the correct passphrase as input, there is no guarantee the text analysis module will yield a string of phonemes that matches the way the user speaks the passphrase. Moreover, the duration and pitch predictions made by the text analysis might differ significantly from what the real user sounds like.

Nevertheless, suppose the attacker possesses a corpus of recordings of the user speaking various phrases other than the passphrase (in our experiment, part I of the Speaker-J data), annotated to identify phonemes and diphones. The attacker can then attempt to construct how the user would say the passphrase, using techniques derived from a concatenative text-to-speech synthesis system (e.g., [12]), in one of the following ways:

Cut-and-paste imposter: Concatenate the raw speech samples (diphones, or longer segments) as-is from the inventory. There are various forms of this. On the one hand, the attacker may make no modifications to duration or pitch of the resulting speech. This yields speech that can sound very much like the true speaker, though there can be severe discontinuities at the concatenation boundaries. In addition, such an approach can yield noticeable differences in the recording levels within the passphrase. On the other hand, the attacker can perform minimal signal processing to match the loudness levels and smooth the discontinuities.
TTS imposter: Use a traditional TTS signal processing back-end to synthesize the passphrase. Note that this is designed to produce nice sounding speech, but that it also makes use of the duration and pitch predictions that are output from the text analysis module. If these predictions do not correspond to the way the user actually speaks, this step might impede the attack. For example, the user may have an idiosyncratic way of saying a particular word, either in her passphrase or in the instance in the attacker's recordings of the user.

We experimented with four types of cut-and-paste attacks and two types of TTS attacks. The results of these tests are shown in the right side of Figure 4. The curves labeled ``TTS imposter'' and ``Cut-and-paste imposter'' capture the best attacks of each type that we discovered. As the curves demonstrate, these attacks both performed similarly, and outperformed random guessing in some cases. However, it appears that the attacks as we conducted them would fall short of breaking our scheme.

Though part I of the Speaker-J data set consists of sentences, it is not the case that an attacker would need to assemble of corpus of user recordings of this extent to attack a typical passphrase. Table 1 approximates the average number of sentences and their cumulative duration that the attacker would need to record to obtain the diphones in each of the five passphrases we examined. These numbers were obtained by randomly selecting sentences from part I of the Speaker-J dataset until the needed diphones were obtained.

Table 1: Approximate number of sentences attacker would need to record to obtain diphones necessary to reconstruct each passphrase tested.

passphrase	passphrase	sentences
number	phonemes	needed
0	24	340 (805 secs)
1	52	455 (1071 secs)
2	29	1297 (3104.88 secs)
3	27	152 (367.320 secs)
4	18	415 (951.421 secs)

As speech synthesis technology improves, the size of the corpus of user recordings required to significantly narrow the search for the user's key will only decrease. However, TTS and cut-and-paste attacks of the types we performed require an annotated corpus, and achieving this annotation is a very manually intensive process that is typically conducted by speech experts. In the case of the Speaker-J data set, it is estimated that expert-hours of effort was invested in achieving the annotated data set. (It takes about one hour to manually segment one minute of speech.) This is already a significant barrier to an attacker wishing to utilize these avenues of attack. Though automatic labellers are available (e.g., [30]), their performance is poor, and we expect it would substantially increase the error rates for the attacks outlined herein. We do expect, however, that the success of such attacks will increase even for our own data sets, as we explore in more detail ways to improve the effectiveness of these attacks. In the full version of this paper we will provide a more detailed analysis of these threats on a per-passhprase basis. We hope that this analysis will be useful for designing effective countermeasures.

Next: Related Work Up: Empirical Results Previous: Evaluation of IPAQ recordings

fabian 2002-08-28