Tuesday, September 15, 2009

Adjusting for Oversampling

We recently received two similar questions about oversampling . . .

If you don´t mind, I would like to ask you a Question regarding Oversampling as you wrote in your book (Mastering Data Mining...).

I can understand how you calculate predictive lift when using oversampling, though don´t know how to do it for the confusion matrix.

Would you mind telling me how do I compute then the confusion matrix for the actual population (not the oversampled set)?

Thanks in advance for your reply and help.

Best,
Diego


Gentlemen-

I have severely unbalanced training data (180K negative cases, 430 positive cases). Yeah...very unbalanced.

I fit a model in a software program that allows instance weights (weka). I give all the positive cases a weight of 1 and all the negative cases a weight of 0.0024. I fit a model (not a decision tree so running the data through a test set is not an option to recalibrate) - like a neural network. I output the probabilities and they are out of whack - good for predicting the class or ranking but not for comparing predicted probability against actual.

What can we do to fit a model like this but then output probabilities that are in line with the distribution? Is this new (wrong) probabilities just the price we have to pay for instance weights to (1) get a model to build (2) get reasonably good classification? Can I have my cake and eat it too (classification and probs that are close to actual)?

Many many thanks!
Brian


The problem in these cases is the same. The goal is to predict a class, usually a binary class, where one outcome is rarer than the other. To generate the best model, some method of oversampling is used so the model set has equal numbers of the two outcomes. There are two common ways of doing this. Diego is probably using all the rare outcomes and an equal-sized random sample of the common outcomes. This is most useful when there are a large number of cases, and reducing the number of rows makes the modeling tools run faster. Brian is using a method where weights are used for the same purpose. Rare cases are given a weight of 1 and common cases are given a weight less than 1, so that the sum of the weights of the two groups is equal.

Regardless of the technique (neural network, decision trees, logistic regression, neearest neighbor, and so on), the resulting probabilities are "directionally" correct. A group of rows with a larger probability are more likey to have the modeled outcome than a group with a lower probability. This is useful for some purposes, such as getting the top 10% with the highest scores. It is not useful for other purposes, where the actual probability is needed.

Some tools can back into the desired probabilities, and do correct calculations for lift and for the confusion matrix. I think SAS Enterprise Miner, for instance, uses prior probabilties for this purpose. I say "think" because I do not actually use this feature. When I need to do this calculation, I do it manually, because not all tools support it. And, even if they do, why bother learning how. I can easily do the necessary calculations in Excel.

The key idea here is simply counting. Assume that we start with data that is 10% rare and 90% common, and we oversample so it is 50%-50%. The relationship between the original data and the model set is:
  • rare outcomes: 10% --> 50%
  • common outcomes: 90% --> 50%
To put it differently, each rare outcome in the original data is worth 5 in the model set. Each common outcome is worth 5/9 in the model set. We can call these numbers the oversampling rates for each of the outcomes.

We now apply these mappings to the results. Let's answer Brian's question for a particular situation. Say we have the above data and a result has a modeled probability of 80%. What is the actual probability?

Well, 25% means that there is 0.25 rare outcomes for 0.75 common ones. Let's undo the mapping above:
  • 0.80 / 5 = 0.16
  • 0.20 / (5/9) = 0.36
So, the expected probability on the original data is 0.16/(0.16+0.36) = 30.8%. Notice that the probability has decreased, but it is still larger than the 10% in the original data. Also notice that the lift on the model set is 80%/50% = 1.6. The lift on the original data is 3.08 (30.8% / 10%). The expected probability goes down, and the lift goes up.

This calculation can also be used for the cross-correlation matrix (or confusion matrix). In this case, you just have to divide each cell by the appropriate overampling rate. So, if the confusion matrix said:
  • 10 rows in the model set are rare and classified as rare
  • 5 rows in the model set are rare and classified as common
  • 3 rows in the model set are common and classified as rare
  • 12 rows in the model set are common and classified as common
(I apologize for not including a table, but that is more trouble than it is worth in the blog.)

In the original data, this means:
  • 2=10/5 rows in the original data are rare and classified as rare
  • 1=5/5 rows in the original data are rare and classified as common
  • 5.4 = 3/(5/9) rows inthe original data are common and classified as rare
  • 21.6 = 12/(5/9) rows in the original data are common and classified as common
These calculations are quite simple, and it is easy to set up a spreadsheet to do them.

I should also mention that this method readily works for any number of classes. Having two classes is simply the most common case.

Thursday, September 10, 2009

TDWI Question: Consolidating SAS and SPSS Groups

Yesterday, I had the pleasure of being on a panel for a local TDWI event here in New York focused on advanced analytics (thank you Jon Deutsch). Mark Madsen of Third Nature gave an interesting, if rapid-fire, overview of data mining technologies. Of course, I was excited to see that Mark included Data Analysis Using SQL and Excel as one of the first steps in getting started in data mining -- even before meeting me. Besides myself, the panel included my dear friend Anne Milley from SAS, Ali Pasha from Teradata, and a gentleman from Information Builders whose name I missed.

I found one of the questions from the audience to be quite interesting. The person was from the IT department of a large media corporation. He has two analysis groups, one in Los Angeles that uses SPSS and the other in New York that uses SAS. His goal, of course, is to reduce costs. He prefers to have one vendor. And, undoubtedly, the groups are looking for servers to run their software.

This is a typical IT-type question, particularly in these days of reduced budgets. I am more used to encountering such problems in the charged atmosphere of a client. The more relaxed atmosphere of a TDWI meeting perhaps gives a different perspective.

The groups are doing the same thing from the perspective of an IT director. Diving in a bit futher, the two groups do very different things -- at least from my perspective. Of course, both are using software running on computers to analyze data. The group in Los Angeles is using SPSS to analyze survey data. The group in New York is doing modeling using SAS. I should mention that I don't know anyone in the groups, and only have the cursory information provided at the TDWI conference.

Conflict Alert! Neither group wants to change and both are going to put up a big fight. SPSS has a stronghold in the market for analyzing survey data, with specialized routines and procedures to handle this data. (SAS probably has equivalent functionality, but many people who analyze survey data gravitate to SPSS.) Similarly, the SAS programmers in New York are not going to take kindly to switching to SPSS, even if offers the same functionality.

Each group has the skills and software that they need. Each group has legacy code and methods, that are likely tied to their tools. The company in question is not a 20-person start-up. It is a multinational corporation. Although the IT department might see standarizing a tool as beneficial, in actual fact, the two groups are doing different things and the costs of switching are quite high -- and might involve losing skilled people.

This issue brings up the question of what do we want to standardize on. The value of advanced analytics comes in two forms. The first is the creative process of identifying new and interesting phenomena. The second is the communication process of spreading the information where it is needed.

Although people may not think of nerds as being creative, really, we are. It is important to realize that imposing standards or limiting resources may limit creativity, and hence the quality of the results. This does not mean that cost control is unnecessary. Instead, it means that there are intangible costs that may not show up in a standard cost-benefit analysis.

On the other hand, communicating results through an organization is an area where standards are quite useful. Sometimes the results might be captured as a simple email going to the right person. Other times, the communication must go to a broader audience. Whether byy setting up an internal Wiki, updating model scores in a database, or loading a BI tool, having standards is important in this case. Many people are going to be involved, and these people should not have to learn special tools for one-off analyses -- so, if you have standardized on a BI tool, make the resources available to put in new results. And, from the perspective of the analysts, having standard methods of communicating results simplifies the process of transforming smart analyses into business value.

Monday, September 7, 2009

Principal Components: Are They A Data Mining Technique?

Principal components have been mentioned in passing several times in previous posts. However, I have not ever talked specifically about them, and their relationship to data mining in general.

What are principal components? There are two common definitions that I do not find particularly insightful. I repeat them here, mostly to illustrate the distance from important mathematical ideas and their application. The first definition is that the principal components are the eigenvectors of the covariance matrix of the variables. The eigenwhats of the what? Knowing enough German to understand that "eigen" means something like "inherent" does not really help in understanding this. An explanation of this -- with lots of mathematical symbols -- is available on Wikipedia. (And, it is not surprising that the inventor of covariance Karl Pearson also invented principal component analysis.)

The second definition (which is equivalent to the first) starts by imagining the data as points in space. Off all the possible lines in the space, the first principal component is the line that maximizes the variance of the points projected on the line (and also goes through the centroid of the data points). Points, lines, projections, centroids, variance -- that also sounds a bit academic. (By the way, for the seriously mathematically inclined, here are pointers to how these defintions are the same.)

I prefer a third, less commonly touted definition, which also assumes that data is spread out as points in space. Of all possible lines in space, the first principal component is the one that minimizes the square of the distance from each data point to the line. Hey, you may be asking, "isn't this the same as the ordinary least squares regression line?" The reason why I like this approach is because it compares principal components to something that almost everyone is familiar with -- the best-fit line. And that provides an opportunity to compare and contrast and learn.

The first difference between the two is both subtle and important. The best-fit line only looks at the distance from each data point to the line along one dimension; that is, the line minimizes the sum of the squares of the differences along the target dimension ("y"). The first principal component is looking at the sum of the squares of the overall distance. The "distance" in this case is the length of the shortest vector that connects each point to the line. In general, the best fit line and the first principal component, are not the same (and I'm curious if the angle between them might be useful). A little known factoid about best fit lines is worth dropping in here. Given a set of data points (x, y), the best fit line that fits y = f(x) is different from the best fit line that fits x = f(y). And the first principal component fits "between" these lines in some sense.

There is a corollary to this. For a best-fit line, one dimension is special, the "y" dimension, because that is how the distance is measured. This is typically the target dimension for a model, the values we want to predict. For the first principal component, there is no special dimension. Hence, principal components are most useful when applied only to input variables without the target. A major difference from best-fit lines.

For me, it makes intuitive sense that the line that best fits input values would be useful for analysis. And, it makes intuitive sense in a way that the eigen-whatevers of some matrix do not intuitively say "useful" or even that the line that maximizes the variance does not say "useful". Even though all are doing the same thing, some ways of explaining the concept seem more intuitive and applicable to data analysis.

Another difference from the best fit line involves what statisticians call residuals -- that is, the difference from each of the original data points to the corresponding point on the line. For a best-fit line, the residuals are simply numbers, the difference between the original "y" and the "y" on the line. For the first principal component, the residuals are vectors -- the vectors that connect each point perpendicularly to the line. These vectors can be plotted in space. And, given a bunch of points in space, we can calculate the principal component for them. This is the second principal component. And these have residuals, and the process can keep going, for a while, yielding the third principal component, and so on.

The first principal component and the second principal component have a very particular property; they are orthogonal to each other, which means that they meet at a right angle. In fact, all principal components are orthogonal to each other, and orthogonality is a good thing when working with input values for data. So, it is tempting to replace the data with the first few principal components. It is not only tempting, but this is often a successful way to reduce the number of variables used for analysis.

By the way, there are not an infinite number of principal components. The number of principal components is the dimensionality of the original data points -- which is never more than the number of variables that define each point.

There is much more to say about principal components. The original question asked whether they are part of data mining. I have never been particularly proud of what is and what is not data mining -- I'm happy to include anything useful for data analysis under the heading. Unlike other techniques, though, principal components are not a fancy method for building predictive or descriptive models. Instead, they are part of the arsenal of tools available for managing and massaging input variables to maximize their utility.

Friday, August 28, 2009

Shazam, A Case Study in Memory Based Reasoning (MBR)

Many users are probably aware of Shazam, one of the few mobile applications that really seems to live up to the notion "Wow! It's Magic." When you are listening to music, you can run the application (presumably on your phone), click "tag it", and after a few tens of seconds of listening and processing, Shazam will tell you what you are listening to, the artist and other details.

Recently, I found an interesting article describing how it works. A presentation, with more pictures and less text, is available here. Kudos to the company and to the author, Avery Wang, for providing technical detail.

The paper does a very good job explaining the details of the algorithm. My goal here is to describe the algorithm from a higher perspective, because it is an interesting example of a memory-based reasoning algorithm. That is, an algorithm that combines information from "nearest neighbors" to arrive at a prediction.

Assume that we have a database of songs and an excerpt that we want to find in a database of millions of songs. A first approach might be to do an exhaustive search of the database to find a match. This would take a long time.

Alternatively, we can frame the problem as follows: for all songs in the database, what is the longest period of time where the excerpt overlaps part of a song. The nearest neighbors are the ones with the longest period, and, in general, we would choose the single one with the longest overlap.

Simple problem to describe. However, the real world hits quickly. The songs are probably quite clean acoustically. However, the excerpt is subject to numerous problems: background noise, loss of fidelity due to compression as the excerpt is transmitted, poor (or at least different) equipment for recording the excerpt, and so on.

Fortunately, the world of acoustics has something of a solution for this, called the "frequency domain". This is a map of all the sound frequencies, taken at a periodic interval -- say, one second. If "frequency domain" conjures up memories of things like Fourier Transforms, then you really do understand the subject.

However, for our purposes, it is enough to say that the frequency domain for a song produces a very, very bumpy curve for each second of the song -- each point is the strength of a particular acoustic frequency at that point in the song. The song can be thought of as a collection of all these curves. Taken together, these curves might resemble a map of a very hilly area. This would be a three-dimensional map of the song.

This map has peaks anologous to the tops of hills (or perhaps the tops of buildings in a city). These peaks are called a constellations, and they pretty much uniquely identify the song, regardless of all the problems mentioned above. That is, the constellations are resilitient to background noise, loss of fidelity, and so on.

Of course, we can do this for the songs in the database in advance. And, we can do this processing for a single excerpt pretty quickly.

So, the problem of finding the song with the longest overlap in seconds with the excerpt is now handled by finding consecutive seconds in a song where the frequency domain peaks match the frequency domain peaks from the excerpt. This is still a daunting problem, because there are so many peaks available. In other word, comparing one excerpt to millions of songs requires comparing hundreds of peaks in the excerpts to the many, many billions in the database -- very time consuming.

Shazam takes a very clever approach to this problem. The algorithm treats each peak as an anchor, and creates peak-pairs with other peaks "close" to the anchor. Here, "close" means that the other peaks are within a few seconds of the first and not too different in frequency. These peak-pairs are then calculated for both the song and the excerpt. The pairs are used to find sets of anchors that match between each song and the excerpt. Because the algorithm is looking for exact matches, it can use some programming tricks to make things even faster (these are described in the paper).

In the end, there is a set of anchors for each song matched by a given excerpt. For each song, these are scanned to find consecutive seconds where the anchors in each second overlap. The longest period of overlap is the distance between the excerpt and the song.

The algorithm is quite clever on several different levels. I do think that understanding it at a high level is valuable, especially since it can provide guidance to other very difficult recognition problems. On the other hand, when I use Shazam and it identifies a song, I still think it's magic.

Tuesday, August 25, 2009

Neural Networks, Predicting Continuous Values

Hi!
Very good blog...
I'm doing some stuff with Clementine... and I have an issue...
My target for NN train dataset is a continuos value between 0 and 100... the problem is that is a normal/gaussian distribution and makes the NN predict bad...

How can I resolve the unbalancing data? split into classe with same frequency!?

Regards,
Pedro

Pedro,

I am not aware that neural networks have a problem with predicting values with normal distributions. In fact, if you randomize the weights in a neural network whose output layer has a linear transfer function, then the output is likely to follow a normal distribution -- just from the Central Limit Theorem of statistics.

So, you have a neural network that is not producing good results. There can be several causes.

The first thing to look for is too many inputs. Clementine has options to prune the input variables on a neural network. Be sure that you do not have too many inputs. I would recommend a variable reduction technique such as principal components, and advise you to avoid categorical variables that have many levels.

A similar problem can occur if your hidden layer is too large.

Whatever the network, it is worthwhile looking at the number of weights in the network (or a related measure called the degrees of freedom). Remember, you want to have lots of training data for each weight.

Another problem may be that the target is continuous, but bounded between 0 and 100. This could result in a neural network where the output layer uses a linear transfer function. Although not generally a bad idea, it may not work in this case because the range of a linear function is from minus infinity to positive infinity, which far exceeds the range of the data.

One simple solution would be to divide the output by 100 and treat it as a probability. The neural network should then be set up with a logistic function in the target layer.

Your idea of binning the results might also work, assuming that bins work for solving the business problem. Equal sized bins are reasonable, since they are readily understandable as quantiles.

Good luck.

Sunday, August 9, 2009

Pharmaceutical Data and Privacy

Today's New York Times has another misguided article on privacy in the medical world. This article seems to be designed to scare Americans into believing that health care privacy is endangered, and that such data is regularly and wantonly traded among companies.

My perspective is different, since I am coming from the side of analyzing data.

First, the pharmaceutical industry is different from virually every other industry in the United States. For the most part, it is illegal for pharmaceutical manufacturers to identify the users of their products. This is based originated with the Health Information Portability and Privacy Act (HIPAA), explained in more detail at this government site.

What is absurd about this situation is that pharmaceutical companies are, in theory, responsible for the health of the millions of people who use their products. To give an example of the dangers, imagine that you have a popular product that causes cardiac damage after several months of use. The cardiac damage, in turn, is sometimes fatal. How does the manufacturer connect the use of the product to death registries? The simple answer. They cannot.

This is not a made-up example. Millions of people used Cox-2 inhibitors, which were on the market until 2004, when Merck voluntarily took Vioxx off the market. This issue here is whether the industry could have known earlier that such dangers lurked in the use of the drug. My contention is that the manufactureres do not have a chance, because they could not do something that virutally every other company can do -- match their customer records to publicly available mortality records.

To be clear about the laws related to drugs. If someone has an adverse reaction while on a drug, then that must be reported to the pharmaceutical company. However, if the adverse reaction is detected a certain amount of time after the patient stops therapy (I believe two weeks), then there is not reporting requirement. Guess what. Cardiac damage caused by Cox-2 inhibitors does not necessarily kill patients right away. Nor is the damage necessarily detected while the patient is still on the therapy.

I have used deidentified records at pharmaceutical clients for various analyses that have ranged from the amusing (anniversary effects in the scripts for ED therapies) to the socially useful (do poor patients have less adherence due to copayments) to the actionable (what messages to give to prescribers). In all cases, we have had to do more work than necessary because of the de-identification requirements, and to make assumptions and work-arounds that may have hurt the analyses. And, contrary to what the New York Times article may lead you to believe, both IMS and Verispan take privacy very seriously. Were I inclined to try to identify particular records, it would be virtually impossible.

Every time a drug is used, there is perhaps an opportunity to learn about its effectiveness and interactions with other therapies. In many cases, these are questions that scientists do not even know to ask, and such exploratory data mining can be critical in establishing hypotheses. Questions such as:
  • Are the therapies equally effective, regardless of gender, age, race, and geography?
  • Do demographics affect adherence?
  • What interactions does a given therapy have with other therapies?
  • Does the use of a particular therapy have an effect on mortality?
Everytime patients purchase scripts and the data is shielded from the manufacturers, opportunities to better understand and improve health care outcomes are lost. Even worse, asa the New York Times article points out, HIPAA does not protect consumers from the actions of nefarious employees and adroit criminals. There has to be a better way.

Monday, July 27, 2009

Time to Event Models, When the Event Is Not Churn

Dear Data Miners,

I am trying to build a churn model to predict WHEN customers will become paying members.
Process:

1. Person comes to our web site.
2. They register for free to use the site.
3. If the want to have more access to the site and use more features they pay us.

What are the issues I should consider when I decide to set a cut date. The first step towards censoring the data.

For a classic churn model , we want to know when someone will stop paying us and leave our phone company. We censor those that we don’t know their final status pass our censor point.

I want to know when they will pay us and censor those I don’t know if they will pay us in the future.

Is the cut date choice arbitrary or is there some sampling rule?

Thank you;
Daryl

Daryl,

Your example is a time-to-event model that does not represent churn. There are many such examples in business (and this is something discussed in Data Analysis Using SQL and Excel in a bit of depth).

Think of your situation as two different time-to-event problems:

(1) A person visits the web site, what happens next? Does the person return to the web site or register? This is a time-to-event problem and analysis can provide information on customer registrations, particularly the lag between the initial visit and the registration.

(2) A person registers for free, how long until that person buys something? This can provide insight on paying visitors.

Once you have broken the problem into these pieces, imagining the customer signature is easier. For the first problem, the customer signature is a picture of customers when they initially visit (or for each pre-registration visit, for a time-to-next event problem). The "prediction" columns are the date of the registration (or for time-to-event, the date of the next visit and whether it involves a registration).

The second component is a picture of the customer when they first register, and the prediction columns are when (and whether) the customer every pays for anything. In this case, it is very important to treat this as a time-to-event problem, because older registrations have had more opportunity to pay for something and the analysis needs to take this into account.

As for the censor date, it is the most recent date of the data. So, if you have data through the end of yesterday, then that is the censor date. For instance, for the second component of the analysis, customers who registered before yesterday but never paid would have their outcomes censored (these customers have not paid yet but they may pay in the future).