Suppose we have search query logs for May 2005 and June 2005, and we wish to find "add-on terms" that spike in June relative to May. An add-on term is a term that is used in conjunction with standard query phrases, e.g., due to media coverage of the 2012 Olympics selection process there may be a spike in queries like "New York olympics" and "London olympics," with "olympics" being the add-on term. Similarly, the term "scientology" may suddenly co-occur with "Tom Cruise," "depression treatment," and other phrases.
The following Pig Latin program describes a dataflow for identifying June add-on terms (the details of the syntax are not important for this paper).2
# load and clean May search logs
1. M = load `/logs/may05' as (user, query, time);
2. M = filter M by not isURL(query);
3. M = filter M by not isBot(user);
# determine frequent queries in May
4. M_groups = group M by query;
5. M_frequent = filter M_groups by COUNT(M) > 10^4;
# load and clean June search logs
6. J = load `/logs/june05' as (user, query, time);
7. J = filter J by not isURL(query);
8. J = filter J by not isBot(user);
# determine June add-ons to May frequent queries
9. J_sub = foreach J generate query,
flatten(Subphrases(query)) as subphr;
10. eureka = join J_sub by subphr,
M_frequent by query;
11. addons = foreach eureka generate
Residual(J_sub::query, J_sub::subphr) as residual;
# count add-on occurrences, and filter by count
12. addon_groups = group addons by residual;
13. counts = foreach addon_groups generate residual,
COUNT(addons) as count;
14. frequent_addons = filter counts by count > 10^5;
15. store frequent_addons into `myoutput.txt';
Line 1 specifies the filename and schema of the May query log. Lines 2 and 3 filter out search queries that consist of URLs or are made by suspected "bots" (the filters are governed by the custom Boolean functions isURL and isBot, which have been manually ordered to optimize performance).
Lines 4-5 identify frequent queries in May.
Lines 6-8 are identical to Lines 1-3, but for the June log.
Lines 9-10 match sub-phrases in the June log (enumerated via a custom set-valued function Subphrases) against frequent May queries.
Line 11 then extracts the add-on portion of the query using a custom function Residual (e.g., "olympics" is an add-on to the frequent May query "New York").
Lines 12-14 count the number of occurrences of each add-on term, and filter out add-ons that did not occur frequently.
Line 15 specifies that the output should be written to a file called myoutput.txt.
In general, Pig Latin programs express acyclic dataflows in a step-by-step fashion using variable assignments (the variables on the left-hand side denote sets of records). Each step performs one of: (1) data input or output, e.g., Lines 1, 6, 15; (2) relational-algebra-style transformations, e.g., Lines 10, 14; (3) custom processing, e.g., Lines 9, 11, governed by