Friday, November 13, 2009

From Item Sets to Association Rules Using Chi-Square

In Data Analysis Using SQL and Excel, I introduce the chi-square metric for evaluating association rules. This posting extends that discussion, to explain how to use the chi-square metric for generating rules.

An association rule is of the form: left-hand-side --> right-hand-side (or, alternatively, LHS --> RHS). The left hand side consists of one or more items, and the right-hand side consists of a single item. A typical example of an association rule is "graham crackers plus chocolate bars implies marshmallows", which may readers will recognize that the recipe for a childhood delight called smores.

Association rules are not only useful for retail analysis. They are also useful for web analysis, where we are trying to track the parts of a web page where people go. I have also seen them used in financial services and direct marketing.

The key to understanding how the chi-square metric fits in is to put the data into a contingency table. For this discussion, let's consider that we have a rule of the form LHS --> RHS, where each side consists of one item. In the following table, the numbers A, B, C, D represent counts:



RHS-present RHS-absent

LHS-present A B

LHS-absent C D

A is the count where the LHS and RHS items are both present. B has the LHS item but not the RHS item, and so on. Different rules have different contingency tables. We choose the best one using the chi-square metric (described in Chapter 3 of the above book). This tells us how unusual these counts are. In other words, the chi-square metric is a measure of how unlikely the counts that are measured are due to a random split of the data.

Once we get a contingency table, though, we still do not have a rule. A contingency table really has four different rules:
  • LHS --> RHS
  • LHS --> not RHS
  • not LHS --> RHS
  • not LHS --> not RHS
(Or, another way of saying this is that these rules all generate the same contingency table.) How can we choose which rule is the best one?

In this case, we'll choose the rule based on how much better they do than just guessing. This is called the lift or improvement for a rule. So, the rule LHS --> RHS is correct for A/(A+B) of the records: the LHS is true for A+B records, and for A of these, the rule is true.

Overall, simply guessing that RHS is true would be correct for (A+C)/(A+B+C+D) of the records. The ratio of these is the lift for the rule. A lift greater than 1 indicates that the rule does better than guessing; a lift less than 1 indicates that guessing is better.

The following are the ratios for the four possible rules in the table:
  • (A/(A+B))/((A+C)/(A+B+C+D))
  • (B/(A+B))/((A+C)/(A+B+C+D))
  • (C/(C+D))/((B+D)/(A+B+C+D))
  • (D/(C+D))/((B+D)/(A+B+C+D))
When choosing among these, choose the one with highest lift.

The process for choosing rules is to choose the item sets based on the highest chi-square value. And then to choose the rules using the best lift.

This works well for rules with a single item on each side. What do we do for more complicated rules, particularly ones with more items in the left hand side? One method would be to extend the chi-square test to multiple dimensions. I am not a fan of the multidimensional chi-square test, as I've explained in another blog.

In this case, we just consider rules with a single item on the RHS side. So, if an item set has four items, a, b, c, and d, then we would consider only the rules:
  • a+b+c --> d
  • a+b+d --> c
  • a+c+d --> b
  • b+c+d --> a
We are ignoring possibilities such as a+b-->c+d.

Each of these rules can now be evaluated using the chi-square metric, and then the best rule chosen using the lift of the rule.

Friday, November 6, 2009

Oversampling in General

Dear Data Miners,

I am trying to find out statistical reasons for balancing data sets when building models with binary targets, and nobody is able to intelligently describe why it is being done. In fact, there are mixed opinions on sampling when the response rate is low.

Based on literature and data mining professional opinions, here are few versions (assume that the response rate is 1%):

1) As long as the number of responders is approximately equal or greater than 10 times the number variables included, no additional sampling is needed.

2) Oversample or undersample (based on the total number of observations) at least until the response rate = 10%.

3) Oversample or undersample (based on the total number of observations) until the response rate = 50%.

4) Undersampling is useful only for cutting down on processing time; really no good reason to do it statistically as long as the number of observations for responders is "sufficient" (% does not matter).


Having an advanced degree in mathematics but not being a statistician, I would like to understand whether there really is any statistical benefit in doing that.

I appreciate your time answering this.

Sincerely,

Your fellow data miner

Many years ago, I was doing a churn model for SK Telecom (in South Korea) using SAS Enterprise Miner. A friend of mine at SAS, Anne Milley, had suggested that having a 50% density for a binary response model would produce optimal models. Her reasoning was that with a 50% density of each target value, the contrast between the two values would be maximized, making it easier to pick out patterns in the data.

I spent some time testing decision trees with all sorts of different densities. To my surprise, the decision trees with more than 30% density performed better than trees with lower densities, regardless of the splitting criterion and other factors. This convinced me that 50% is not a bad idea.

There is a reason why decision trees perform better on balanced samples. The standard pruning algorithm for decision trees uses classification as the metric for choosing subtrees. That is, a leaf chooses its dominant class -- the one in excess of 50% for two classes. This works best when the classes are evenly distributed in the data. (Why data mining software implementing trees doesn't take the original density into account is beyond me.)

In addition, the splitting criteria may be more sensitive to deviations around 50% than around other values.

Standard statistical techniques are insensitive to the original density of the data. So, a logistic regression run on oversampled data should produce essentially the same model as on the original data. It turns out that the confidence intervals on the coefficients do vary, but the model remains basically the same.

Hmmm, as I think about it, I wonder if the oversampling rate would affect stepwise or forward selection of variables. I could imagine that, when testing each variable, the variance in results using a rare target would be larger than the variance using a balanced model set. This, in turn, might lead to a poorer choice of variables. But I don't know if this is the case.

For neural networks, the situation is more complicated. Oversampling does not necessarily improve the neural network -- there is no theoretical reason why. However, it does allow the network to run on a smaller set of data, which makes convergence faster. This, in turn, allows the modeler to experiment with different models. Faster convergence is a benefit in other ways.

Some other techniques such as k-means clustering and nearest neighbor approaches probably do benefit from oversampling. However, I have not investigated these situations in detail.

Because I am quite fond of decision trees, I prefer a simple rule, such as "oversample to 50%", since this works under the maximum number of circumstances.

In response to your specific questions, I don't think that 10% is a sufficient density. If you are going to oversample, you might as well go to 50% -- there is at least an elegant reason why (the contrast idea between the two response values). If you don't have enough data, then use weights instead of oversampling to get the same effect.

In the end, though, if you have the data and you have the software, try out different oversampling rates and see what produces the best models!

Wednesday, November 4, 2009

Scoring Association Rules

At M2009 (SAS's data mining conference), I was approached with the question of scoring association rules for customers. This is not a topic I have thought about very much. More typically, association rules are used qualitatively or to understand products. I hadn't thought about assigning the "best" rule (or rules) back to customers.

As a reminder, association rules provide information about items that are purchased at the same time. For example, we might find that marshmallows and chocolate bars imply graham crackers. The "marshmallows" and "chocolate bars" are the left hand side of the rule (LHS) and the graham crackers is the right hand side (RHS). The presumption is that when graham crackers are missing from a shopper's basket, then they should be there.

Most data mining software, such as SAS Enterprise Miner, SQL Server Data Mining, and SPSS Clementine, can be used to generate association rules. I prefer to calculate the rules myself using database technology, using code similar to that in Data Analysis Using SQL and Excel.

However, data mining tools do not provide the ability to score association rules for individual customers. Neither is this is a topic that I discuss in my book. My goal here is to discuss scoring rules in databases. This is because scoring is computationally expensive. Because databases can take advantage of indexing and parallel processing, they offer scope to make the score more efficient.

Hmmm, what does scoring association rules even mean? Scoring is the process of finding the best rule that a customer matches, either for a single RHS or for all possible RHSs. In the former case, the result is one rule. In the latter, it is an array of rules, for each possible RHS.

An association rule is traditionally defined by three metrics: support, confidence, and lift (as well as a fourth, the chi-square metric, which I prefer). For the purposes of this discussion, the best rule is the one with the highest confidence.

The simplistic way of doing such scoring is by considering each rule for each customer, to determine which rules apply to each customer. From the set that do apply, do some work to find the best one.

Imagine that we have a table, rules, with the following columns:
  • The number of LHS items (we assume there is 1 RHS item);
  • The RHS item.
  • The LHS items, as a string: "item1;item2;..."
There is another table, custitem, containing each customer and each item as a separate row.

The following query find all matching rules for each customer in the innermost subquery, by counting the number of items matched on the left hand side. The outer query then finds the rule (for each RHS) that has the maximum confidence, using SQL window functions.

SELECT cr.*
FROM (SELECT customerid, r.rhs, r.ruleid,

. ...........(MAX(r.confidence) OVER (PARTITION BY customerid, rhs)
.............) as maxconfidence
......FROM (SELECT ci.customerid, r.rhs, r.ruleid,
..
.................COUNT(*) as nummatches,
............FROM custitem ci CROSS JOIN
.................rules r
............WHERE CHARINDEX(ci.item||';', r.lhs) > 0
............GROUP BY ci.customerid, r.rhs, r.ruleid
............HAVING COUNT(*) = MAX(r.numlhs)
...........) matchrules JOIN
...........rules r
...........ON matchrules.ruleid = rules.ruleid
......) cr
WHERE confidence = maxconfidence

This query is expensive, as you might guess from the use of CROSS JOIN. And, its performance gets longer particularly as the number of rules gets larger (and presumably the number of customers is larger still).

It is possible to make it more efficient, by doing tricks, such as:
  • If there are a few number of items, then the LHS could be encoded using bits. This eliminates the need for string matching.
  • The rules can be pruned, so only the rules with the highest confidence are kept.
And, although this cannot be done in SQL, the rules could be ordered by confidence (for each RHS) from highest to lowest. The first match would then stop the search.

An alternative method requires storing the rules in two tables. The first is rules, containing descriptive information about each rule, such as:
  • ruleid;
  • rhs; and,
  • numlhs.
The second is ruleitem, which contains each item in the rules. Incidentally, this is more in keeping with the spirit of normalization in relational databases.

The subquery for the scoring now changes to a join. This is useful, because it means that we can use database mechanisms -- such as indexing and table partitioning -- to speed it up.

SELECT cr.*
FROM (SELECT customerid, r.rhs, r.ruleid,
.............(MAX(r.confidence) OVER (PARTITION BY customerid, rhs)
.............) as maxconfidence
......FROM (SELECT ci.customerid, r.rhs, r.ruleid,
...................COUNT(*) as nummatches, MAX(numlhs) as numlhs
............FROM custitem ci JOIN
.................ruleitems ri
.................ON ci.item = ri.item JOIN
.................rule r
.................ON ri.ruleid = ri.ruleid
............GROUP BY ci.customerid, r.rhs, r.ruleid
............HAVING COUNT(*) = MAX(r.numlhs)
...........) matchrules JOIN
...........rules r
...........ON matchrules.ruleid = rules.ruleid
......) cr
WHERE confidence = maxconfidence

Of course, such an approach makes a difference only when you need to score many customers and you have many rules. This same approach can be used for looking at a single product in the RHS or at several at one time. Of course, this would require summarizing the multiple products at the customer level in order to append the desired information on the customer record.

Friday, October 23, 2009

Counting Users From Unique Cookies

Counting people/unique visitors/users at web sites is a challenge, and is something that I've been working on for the past couple of months for the web site of a large media company. The goal is to count the number of distinct users over the course of a month. Counting distinct cookies is easy; the challenge is turning these into human beings. These challenges include:

  • Cookie deletions. A user may manually delete their cookies one or more times during the month.

  • Disallowing first party cookies. A user may allow session cookies (while the browser is running), but not allow the cookies to be committed to disk.

  • Multiple browsers. A single user may use multiple browsers on the same machine during the month. This is particularly true when the user upgrades his or her browser.

  • Multiple machines. A single user may use multiple machines during the month.

And, I have to admit, that the data that I'm using has one more problem, which is probably not widespread. The cookies are actually hashed into four bytes. This means that it is theoretically possible for two "real" cookies to have the same hash value. Not only theoretically possible, but it happens (although not too frequently).

I came across a very good blog by Angie Brown that lays out the assumptions in making the calculation, including a spreadsheet for varying the assumptions. One particularly interesting factoid from the blog is that the number cookies that appear only once during the month exceeds the number of unique visitors, even under quite reasonable assumptions. Where I am working, one camp believes that the number of unique visitors is approximated by the number of unique cookies.


A white paper by ComCast states that the average user has 2.5 unique cookies per month due to cookie deletion. The paper is here, and a PR note about it is it is here. This paper is widely cited, although it has some serious methodological problems due to the fact that its data sources are limited to DoubleClick and Yahoo!.


In particular, Yahoo! is quite clear about its cookie expiration policies (two weeks for users clicking the "keep me logged in for 2 weeks" box and eight hours for Yahoo! mail). I do not believe that this policy has changed significantly in the last few years, although I am not 100% sure.


The white paper from ComCast does not mention these facts, which means that most of the cookies that a user has are due to automatic deletion, not user behavior. How many distinct cookies does a user have, due only to the user's behavior?

If I make the following assumptions:

  • The Yahoo! users have an average of 2.5 cookies per month.

  • ComCast used the main Yahoo! cookies, and not the Yahoo! mail cookies.

  • All Yahoo! users use the site consistently throughout the month.

  • All Yahoo! users have the "keep me logged in for 2 weeks" box checked.
Then I can estimate the number of cookies per user per machine per month. The average user would have 31/14 = 2.2 cookies per month, strictly due to the automatic deletion. This leaves 0.3 cookies per month due to manual deletion. Of course, the user starts with one cookie. So the average number of cookies per month per user per machine is 1.3.

By the way, I find this number much more reasonable. I also think that it misses the larger source of overcounting -- users who use more than one machine. Unfortunately, there is no single approach. In the case that I'm working on, we have the advantage that a minority of users are registered, so we can use them as a sample.




PAW conference, privacy issues, déjà vu

I attended the Predictive Analytics World conference this week and I thought it was very succesful. Although the conference was fairly small, I heard several interesting presentations and ran into several interesting attendees. In other words, I think the density of interesting people on both sides of the podium was higher than at some larger conferences.

One of the high points for me was a panel discussion on consumer privacy issues. Normally, I find panel discussions a waste of time, but in this case the panel members had clearly given a lot of thought to the issues and had some interesting things to say.  The panel consisted of Stephen Baker, a long-time Business Week  writer and author of  The Numerati, (a book I haven't read, but which, I gather, suggests that people like me are using our data mining prowess to rule the world); Jules Polonetsky, currently of the Future of Privacy Forum, and previously Chief Privacy Officer and SVP for Consumer Advocacy at AOL, Chief Privacy Officer and Special Counsel at DoubleClick, New York City Consumer Affairs Commissioner in the Giuliani administration; and Mikael Hagström, Executive Vice President, EMEA and Asia Pacific for SAS. I was particularly taken by Jules's idea that companies that use personal information to provide services that would not otherwise be possible should agree on a universal symbol for "smart" kind of like the easily recognizable symbol for recycling. Instead of (well, I guess it would have to be in addition to) a privacy policy that no one reads and is all about how little they know about you and how little use they will make of it, the smart symbol on a web site would be a brag about how well the service provider can leverage your profile to improve your experience. Clicking on it would lead you to the details of what they now know about you, how they plan to use it, and what's in it for you. You would also be offered an opportunity to fill in more blanks and make corrections. Of course, every "smart" site would also have a "dumb" version for users who don't choose to opt in.

This morning, as I was telling Gordon about all this in a phone call, we started discussing some of our own feelings about privacy issues, many of which revolve around the power relationship between us as individuals and the organization wishing to make use of information about us. If the supermarket wants to use my loyalty card data to print coupons for me, I really don't mind. If an insurance company wants to use that same loyalty card data to deny me insurance because I buy too much meat and alchohol, I mind a lot. As I gave that example, I had an overwhelming feeling of déjà  vu. Or perhaps it was déjà lu? In fact, it was déjà écrit! I had posted a blog entry on this topic ten years ago, almost to the day. Only there weren't any blogs back then so attention-seeking consultants wrote columns in magazines instead. This one, which appeared in the October 26, 1999 issue of Intelligent Enterprise, said what I was planning to write today pretty well.

Saturday, October 17, 2009

See you at a conference soon

It appears to be conference season. I'll be taking part in several over the next several weeks and would enjoy meeting any readers who will also be in attendance. These are all business-oriented conferences with a practical rather than academic focus.

First up is Predictive Analytics World in Alexandria, VA October 20-21.  I will be speaking on our experience using survival analysis to forecast future subscriber levels for a variety of subscription-based businesses.This will be my first time attending this conference, but the previous iteration in San Francisco got rave reviews.

Next on my calendar is M2009 in Las Vegas, NV October 26-27. This conference is sponsored by SAS, but it is a general data mining conference, not a SAS users group or anything like that. I've been attending since the first one in 1998 (I think) and have watched it grow into the best-attended of the annual data mining conferences. Gordon Linoff and I will be debuting the latest version of our three-day class Data Mining Techniques Theory and Practice as part of the post-conference training October 28-30.  Too bad about the location, but I guess conference space is cheap in Las Vegas.

I only get to spend a partial weekend at home before leaving for TDWI World in Orlando, FL. This is not, strictly speaking, a data mining conference (the DW stands for "data warehouse") but once people have all that data warehoused, they may want to use if for predictive modeling. I will be teaching a one-day class called Predictive Modeling for Non-Statisticians on Tuesday, November 3. That is election day where I live, so I better get my absantee ballot soon.

The following week, it's off to San Jose, CA for the Business Analytics Summit November 12-13.  I believe this is the first annual running of this conference. I'll be on a panel discussing data mining and business analytics. Perhaps I will discover what the phrase  "business analytics" means! That would come in handy since I will be teaching a class on Marketing Analytics at Boston College's Carroll School of Management next semester. My guess: it means the same as data mining, but some people don't like that word so they've come up with a new one.

Friday, October 16, 2009

SVM with redundant cases

We received the following question from a reader:

I just discovered this blog -- it looks great. I apologize if this question has been asked before -- I tried searching without hits.

I'm just starting with SVMs and have a huge amount of data, mostly in the negative training set (2e8 negative examples, 2e7 positive examples), with relatively few features (eg less than 200). So far I've only tried linear SVM (liblinear) due to the size, with middling success, and want to under-sample at least the negative set to try kernels.

A very basic question. The bulk of the data is quite simple and completely redundant -- meaning many examples of identical feature sets overlapping both positive and negative classes. What differs is the frequency in each class. I think I should be able to remove these redundant samples and simply tell the cost function the frequency of each sample in each class. This would reduce my data by several orders of magnitude.

I have been checking publications on imbalanced data but I haven't found this simple issue addressed. Is there a common technique?

Thanks for any insight. Will start on your archives.
There are really two parts to the question. The first part is a general question about using frequencies to reduce the number of records. This is a fine approach. You can list each distinct record only once along with its frequency. The frequency counts how many times a particular pattern of feature values (including the class assigned to the target) appears. The second part involves the effect on the SVM algorithm of having many cases with identical features but different assigned classes. That sounded problematic to me, but since I am not an expert on support vector machines, I forwarded your question to someone who is--Lutz Hamel, author of Knowledge Discovery with Support Vector Machines.

Here is his reply:

I have some fundamental questions about the appropriateness of SVM for this classification problem:

Identical observation feature vectors produce different classification outcomes. If this is truly meaningful then we are asking the SVM to construct a decision plane through a point with some of the examples in this point classified as positive and some as negative. This is not possible. This means one of two things: (a) we have a sampling problem where different observations are mapped onto the same feature vectors. (b) we have a representation problem where the feature vector is not powerful enough to distinguish observations that should be distinguished.

It seems to me that this is not a problem of a simple unbalanced dataset but a problem of encoding and perhaps coming up with derived features that would make this a problem suitable for decision plane based classification algorithms such as SVMs. (is assigning the majority label to points that carry multiple observations an option?)
SVM tries to find a hyperplane that separates your classes. When, (as is very common with things such as marketing response data, or default, or fraud, or pretty much any data I ever work with), there are many training cases where identical values of the predictors lead to different outcomes, support vector machines are probably not the best choice. One alternative you could consider is decision trees. So long as there is a statistically significant difference in the distribution of the target classes, a decision tree can make splits. Any frequently occuring pattern of features will form a leaf and, taking the frquencies into account, the proportion of each class in the leaf provides estimates of the probabilities for each class given that pattern.