We are primarily concerned with measuring the ability of an attacker
to guess the password of a user. Given accurate values for
for each , a
measure that indicates this ability is the ``guessing
entropy'' [18] of passwords. Informally, guessing entropy
measures the expected number of guesses an attacker with perfect
knowledge of the probability distribution on passwords would need in
order to guess a password chosen from that distribution. If we
enumerate passwords , , in
non-increasing order of
, then
the guessing entropy is simply
The direct use of (6) to compute guessing entropy using the probabilities in (5) is problematic for two reasons. First, an attacker guessing passwords will be offered additional information when performing a guess, such as the set of available categories from which the next image can be chosen. For example, in Face, each image choice is taken from nine images that represent nine categories of images, chosen uniformly at random from the twelve categories. This additional information constrains the set of possible passwords, and the attacker would have this information when performing a guess in many scenarios. Second, we have found that the absolute probabilities yielded by (5) can be somewhat sensitive to the choice of , which introduces uncertainty into calculations that utilize these probabilities numerically.
To account for the second of these issues, we use the probabilities computed with (5) only to determine an enumeration of passwords in non-increasing order of probability (as computed with (5)). This enumeration is far less sensitive to variations in than the numeric probabilities are, and so we believe this to be a more robust use of (5). We use this sequence to conduct tests with our dataset in which we randomly select a small set of ``test'' passwords from our dataset (20% of the dataset), and use the remainder of the data to compute the enumeration .
We then guess passwords in order of until each test password is guessed. To account for the first issue identified above, namely the set of available categories during password selection, we first filter from the passwords that would have been invalid given the available categories when the test password was chosen, and obviously do not guess them. By repeating this test with non-overlapping test sets of passwords, we obtain a number of guesses per test password. We use to denote the average over all test passwords, and to denote the median over all test passwords. Finally, we use for to denote the number of guesses sufficient to guess percent of the test passwords. For example, if of the test passwords could be guessed in or fewer guesses, then .
We emphasize that by computing our measures in this fashion, they are intrinsically conservative given our dataset. That is, an attacker who was given 80% of our dataset and challenged to guess the remaining 20% would do at least as well as our measures suggest.