File systems address these problems with features such as desktop search and shortcuts. Browsers offer similar features such as history search, bookmarks, saved passwords, and autocompletion For example, a major feature in Mozilla Firefox 3 is its ``smart location bar" [5], a history search-based autocompletion Similarly, Google Chrome's ``New Tab" page [1] presents history and history search as its most important features.
History search and similar features rely on history metadata that the browser records as users browse the web. For example, all browsers record a history of URLs they have visited. Fundamentally, this metadata describes actions and their consequences; when the user navigates through a series of pages, enters a password, and downloads a file, the browser's history describes these events and how they are related. This metadata is provenance - it describes how the browser state came to be or, if properly stored and queried, precisely how a downloaded object came to be. To the best of our knowledge, the implications and benefits of characterizing browser history as provenance is an unresearched area.
In this paper we explore browser history as provenance. We present use cases in history search, web search, and download management that browser provenance can address. We identify common actions in modern browsers and the provenance those actions generate, useful provenance that browsers do not store, and useful provenance algorithms that browsers do not apply. Provenance helps a browser answer questions such as, ``Where did all this stuff come from?" by providing a natural way to store and query data lineage. Similarly, a browser can answer, ``Where did my stuff go?" by improving its history and web search with contextual and personal relationships extracted from its provenance store. In short, browsers can benefit a great deal by characterizing their history as provenance.
This paper is structured as follows. Section 2 presents four use cases in history search, web search, and download management that browser provenance can address. Section 3 dissects the architecture of a modern browser's history, relates it to provenance, and identifies some of the challenges of doing so. Section 4 describes our current and future work, and we conclude in section 5.
Currently: A browser with textual history search will return the web search page for rosebud, because that page contains the search term in both its title and URL. However, it will not return Citizen Kane, because it does not recognize there is a connection between rosebud and Citizen Kane.
With Provenance: Browser provenance would show that Citizen Kane descends from the search term rosebud. Therefore, a provenance-aware browser could evaluate and return Citizen Kane in its history search results. Shah et al. [13] implemented a provenance-based desktop search algorithm that is readily extensible to history search. Briefly, the algorithm performs a textual search and then reorders results by the relevance of their provenance neighbors. As a first-generation descendant of the rosebud web search page, Citizen Kane would receive substantial weight in such a search.
Currently: The browser knows that the user often visits pages containing the words ``flower", ``gardening", etc. in their title or URL. However, unless many of those descriptors contain both rosebud and flower together, the browser does not capture the user's connection between rosebud and flower.
With Provenance: As described in section 2.1, a provenance-aware browser could see not just textual, but contextual relationships between rosebud and flower pages. The browser would be much more capable of recognizing this connection, and if it did it could supplement a rosebud web search with flower as an additional search term. More generally, there are many advanced operators in modern search engines that are intended for power users [3]. A provenance-aware browser could leverage these operators automatically for regular users.
One exciting implication of this approach is that the browser could personalize search results without giving information about the user to the search engine. The search engine would only see a search for ``rosebud flower"; it would not know anything about the user's history. Conversely, existing web personalization techniques require the user to submit personal information to a third party and can only personalize services that share data with that party.
Currently: Neither history nor web search can help the user here. Although current web searches excel at delivering the user to a canonical and popular wine page, discovering the search terms to produce a specific page can be an arcane task. Furthermore, these terms can change as the search engine aggregates new content.
With Provenance: A browser could record provenance that captures the relationship between pages viewed within a similar time span. To users, these associations are relevant: a study by Blanc-Brude and Scapin [7] shows that users almost always recall events associated with documents. A history search for ``wine associated with plane tickets" is both natural to the user and likely to return the desired result. In fact, Gyllstrom and Soules [9] implemented a desktop search system on this premise.
Currently: Most browsers record downloads somewhere and can find the URL of a download; however, in many cases the URL is not informative. If the user does not recognize the URL, then she will ask, ``How did I get to this URL?" Similarly, an image file may have come from some image hosting site, but this is not useful for author attribution. Thus, the user is forced to recursively search her history and perform forensics until she finds a page that she recognizes.
With Provenance: What the user really wants is, starting from a known location, the sequence of actions that resulted in the download - that is, the lineage of the download. In a provenance-aware browser, the solution is a path query: ``Find the first ancestor of this file that the user is likely to recognize." ``Likely to recognize" can be defined in terms of history, e.g., the number of visits the user has made to the page.
Provenance path queries can also enable new functionality. For example, if the user decides a page is untrusted, she may then want to find all downloads descending from that page and check them for viruses. This might be difficult for a user doing forensics, but is a simple query: ``Find all descendants of this page that are downloads".
The fundamental objects of browser history are web pages. Every browser records visited pages and associates metadata with them, such as the frequency of visits. One type of metadata Firefox records is the HTTP referrer, the page that sent the browser to a particular page. The referrer is useful for many purposes, such as identifying and eliminating redirects from history search results.
The referrer establishes an implicit provenance relationship between the referring page and the target page. Other history metadata can establish similar relationships. For example, Firefox stores a table of ``transitions", the actions that load a particular page. Transitions are a superset of the referrer; they include actions such as the user clicking a bookmark or the relationship between top-level and embedded page content. Researchers such as Roussel et al. [12] have identified many other forms of history metadata that are useful to both users and developers of interfaces and data management systems.
Any browser's history can be represented as a graph in which pages are nodes, relationships are edges, and both nodes and edges can have attributes. This graph can be reasonably large; one author's history has accumulated more than 25,000 nodes over the past 79 days. Given the ubiquity of graph data in browsers, one might imagine that graph algorithms and queries would be similarly ubiquitous, but surprisingly this is not the case. To the best of our knowledge, there are no graph algorithms applied to the history in any modern browser.
However, this is not the only context in which the scientific community considers graphs of web pages and their relationships. Web search is often characterized as a graph problem, and many web search algorithms, such as Kleinberg's HITS [10], are graph algorithms that exploit the relationships between pages. However, browser history differs from a typical web graph in a number of important ways. First, the browser collects metadata that a web crawler cannot, such as which links are actually traversed by users, as opposed to those that simply exist. Secondly, the browser collects metadata from its user, as opposed to crawling web content. Therefore, features premised on browser history are inherently personalized. The browser can and should be better than web search at answering queries such as, ``Where is that page I visited last month?"
Finally, the metadata collected by the browser is provenance. Every relationship in the browser history corresponds to an action taken by the browser to obtain one set of data from another. When the user clicks a bookmark to obtain a page, or a top-level page loads some embedded content, the logs of these events are provenance records. By characterizing browser history as provenance, we open up a new field of solutions to browser data management problems.
We present the following taxonomy of browser provenance. This taxonomy is neither complete nor definitive; however, we have identified many interesting research questions and propose that it serve as a basis for future discussion.
Browsers often implicitly version pages by including time stamps in the metadata associated with page visits. In addition to solving the cycle problem, this facilitates time-based queries. However, the storage of a versioned graph is non-trivial and introduces interesting design decisions. For example, are time stamps a property of pages or links? Versioning nodes (pages) is a common cycle-breaking technique and is used by PASS. However, time stamping edges (links) can also break cycles by creating a traversal order among edges.
Are time stamps attributes of objects, or does each version create a new instance of an object? If they are attributes, then the object instances are semi-structured and more difficult to store. If each version creates a new instance of the object, then it is more difficult to make queries over all the objects that describe a given page or link. Firefox stores its time stamps as instances of link traversals, because in Firefox general page queries are more common than link queries. However, this can make it difficult to run link queries and by extension graph algorithms, because many records of a given link traversal may exist.
There has been a considerable amount of provenance research on efficient storage and query. For example, Chapman et. al [8] developed general factorization and inheritance-based methods that are almost certainly applicable to browser history. However, there are also many interesting properties of the history graph that may lead to unique storage methods. For example, if both pages and links are versioned as new instances, and only link relationships are considered, the result is a tree structure. There were a number of early efforts by researchers such as Ayers and Stasko [6] to develop an interface that used this propery to visualize recent history; we believe it could also be used for efficient storage and query. We are interested in history graph storage as both an enabler of more powerful history queries and a novel provenance storage problem.
Similarly, most browsers do not capture the time relationship between pages that are open simultaneously. Firefox time stamps page visits, but it does not time stamp other display-altering actions, such as page ``close". Consequently, it is impossible to determine whether two pages were open simultaneously; from the perspective of Firefox history, every page is always open. The simple addition of a corresponding close to each page visit enables queries on time relationships. These relationships can be used by contextual searches, as described in sections 3.1 through 3.3. Conceptually, time relationships are undirected; however, when necessary they can be directed with an arbitrary time-ordering rule such as, ``the first node opened in a time span points to later nodes."
Redirects and inner content are a special case; although they are link-like relationships, unlike other edges they are not generated as the result of a user action. They are relevant to many queries such as Data Lineage, but personalization algorithms may wish to exclude or otherwise ignore them. One approach such algorithms can take is to unify edges by ignoring nodes from which a redirect or inner content link occurs.
Search terms and form history are particularly useful provenance nodes. They are concise, conceptual, user-generated descriptors that are in the lineage of the page they generate and that page's descendants. When the user searches her history database, at the very least she expects it to return any page in her history that would also show up in a web search. Currently, this does not happen; but if search terms are stored as provenance, a contextual search can retrieve and use them as data descriptors. Furthermore, forms and form-generated pages are ``deep web" content that are considered difficult for traditional search engines to capture and index. But they are easy to capture from the browser; in fact, many browsers already capture form history to provide autocomplete features.
We implement ``Contextual History Search" as a graph neighborhood expansion algorithm, similar to web search algorithms such as Kleinberg's HITS [10]. ``Personalizing Web Search" performs term frequency analysis on the results of a contextual history search to find terms in user history associated with the search term. ``Time-Contextual History Search" is a query over time relationships, and ``Download Lineage" is a breadth-first search over a node's ancestors. These queries complete in less than 200ms in the majority of cases and can be bound to that time in the remaining cases. The challenges we encountered in implementing these queries and the details of our results will be published in future work.
Our initial results suggest that interesting graph algorithms on browser metadata are feasible for browsers to compute locally. However, there is still much work to be done. We must now develop more intelligent algorithms that can respond to our use case queries with high-quality results. There is a great deal of existing information retrieval research on web search on which we can build, but we also believe that our assumptions about browser provenance must be different from those of the web and that there are unique properties of browser provenance graphs we can exploit.
Another important issue for future work that we have not yet discussed is privacy. Browser history potentially contains a great deal of sensitive personal data. Techniques that aggregate this data at centralized locations and third parties can be powerful, but they must also answer difficult questions about the anonymity and privacy of their users. In our current and future work, the approach we take is to use browser provenance to increase user privacy by processing the data on the user's machine. We believe there is much more interesting research to be done with regards to provenance-based user-side personalization features.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -local_icons -no_navigation paper.tex
The translation was initiated by Daniel Margo on 2009-02-11