Friday, October 23, 2009
PAW conference, privacy issues, déjà vu
One of the high points for me was a panel discussion on consumer privacy issues. Normally, I find panel discussions a waste of time, but in this case the panel members had clearly given a lot of thought to the issues and had some interesting things to say. The panel consisted of Stephen Baker, a long-time Business Week writer and author of The Numerati, (a book I haven't read, but which, I gather, suggests that people like me are using our data mining prowess to rule the world); Jules Polonetsky, currently of the Future of Privacy Forum, and previously Chief Privacy Officer and SVP for Consumer Advocacy at AOL, Chief Privacy Officer and Special Counsel at DoubleClick, New York City Consumer Affairs Commissioner in the Giuliani administration; and Mikael Hagström, Executive Vice President, EMEA and Asia Pacific for SAS. I was particularly taken by Jules's idea that companies that use personal information to provide services that would not otherwise be possible should agree on a universal symbol for "smart" kind of like the easily recognizable symbol for recycling. Instead of (well, I guess it would have to be in addition to) a privacy policy that no one reads and is all about how little they know about you and how little use they will make of it, the smart symbol on a web site would be a brag about how well the service provider can leverage your profile to improve your experience. Clicking on it would lead you to the details of what they now know about you, how they plan to use it, and what's in it for you. You would also be offered an opportunity to fill in more blanks and make corrections. Of course, every "smart" site would also have a "dumb" version for users who don't choose to opt in.
This morning, as I was telling Gordon about all this in a phone call, we started discussing some of our own feelings about privacy issues, many of which revolve around the power relationship between us as individuals and the organization wishing to make use of information about us. If the supermarket wants to use my loyalty card data to print coupons for me, I really don't mind. If an insurance company wants to use that same loyalty card data to deny me insurance because I buy too much meat and alchohol, I mind a lot. As I gave that example, I had an overwhelming feeling of déjà vu. Or perhaps it was déjà lu? In fact, it was déjà écrit! I had posted a blog entry on this topic ten years ago, almost to the day. Only there weren't any blogs back then so attention-seeking consultants wrote columns in magazines instead. This one, which appeared in the October 26, 1999 issue of Intelligent Enterprise, said what I was planning to write today pretty well.
Friday, October 16, 2009
SVM with redundant cases
I just discovered this blog -- it looks great. I apologize if this question has been asked before -- I tried searching without hits.There are really two parts to the question. The first part is a general question about using frequencies to reduce the number of records. This is a fine approach. You can list each distinct record only once along with its frequency. The frequency counts how many times a particular pattern of feature values (including the class assigned to the target) appears. The second part involves the effect on the SVM algorithm of having many cases with identical features but different assigned classes. That sounded problematic to me, but since I am not an expert on support vector machines, I forwarded your question to someone who is--Lutz Hamel, author of Knowledge Discovery with Support Vector Machines.
I'm just starting with SVMs and have a huge amount of data, mostly in the negative training set (2e8 negative examples, 2e7 positive examples), with relatively few features (eg less than 200). So far I've only tried linear SVM (liblinear) due to the size, with middling success, and want to under-sample at least the negative set to try kernels.
A very basic question. The bulk of the data is quite simple and completely redundant -- meaning many examples of identical feature sets overlapping both positive and negative classes. What differs is the frequency in each class. I think I should be able to remove these redundant samples and simply tell the cost function the frequency of each sample in each class. This would reduce my data by several orders of magnitude.
I have been checking publications on imbalanced data but I haven't found this simple issue addressed. Is there a common technique?
Thanks for any insight. Will start on your archives.
Here is his reply:
I have some fundamental questions about the appropriateness of SVM for this classification problem:SVM tries to find a hyperplane that separates your classes. When, (as is very common with things such as marketing response data, or default, or fraud, or pretty much any data I ever work with), there are many training cases where identical values of the predictors lead to different outcomes, support vector machines are probably not the best choice. One alternative you could consider is decision trees. So long as there is a statistically significant difference in the distribution of the target classes, a decision tree can make splits. Any frequently occuring pattern of features will form a leaf and, taking the frquencies into account, the proportion of each class in the leaf provides estimates of the probabilities for each class given that pattern.Identical observation feature vectors produce different classification outcomes. If this is truly meaningful then we are asking the SVM to construct a decision plane through a point with some of the examples in this point classified as positive and some as negative. This is not possible. This means one of two things: (a) we have a sampling problem where different observations are mapped onto the same feature vectors. (b) we have a representation problem where the feature vector is not powerful enough to distinguish observations that should be distinguished.It seems to me that this is not a problem of a simple unbalanced dataset but a problem of encoding and perhaps coming up with derived features that would make this a problem suitable for decision plane based classification algorithms such as SVMs. (is assigning the majority label to points that carry multiple observations an option?)
Monday, June 8, 2009
Confidence in Logistic Regression Coefficients
I never encountered this and was wondering what to do with these effects: should I kick them out of the model or not ? I decided to keep them in since they did have some business meaning and concluded that they must have become insignificant since it is only a micro-segment in your entire population.
To your opinion, did I interpret this correctly ? . . .
Many thanks in advance for your advice,
Wendy
![]() |
Michael responds:
Hi Wendy,
This question has come up on the blog before. The short answer is that with a logistic regression model trained at one concentration of responders, it is a bit tricky to adjust the model to reflect the actual probability of response on the true population. I suggest you look at some papers by Gary King on this topic.
Gordon responds:
Wendy, I am not sure that Prof. King deals directly with your issue, of changing confidence in the coefficients estimates. To be honest, I have never considered this issue. Since you bring it up, though, I am not surprised that it may happen.
My first comment is that the results seem usable, since they are explainable. Sometimes statistical modeling stumbles on relationships in the data that make sense, although they may not be fully statistically significant. Similarly, some relationships may be statistically significant, but have no meaning in the real world. So, use the variables!
Second, if I do a regresson on a set of data, and then duplicate the data (to make it twice as big) and run it again, I'll get the same estimates as on the orignal data. However, the confidence in the coefficients will increase. I suspect that something similar is happening on your data.
If you want to fix that particular problem, then use a tool (such as SAS Enterprise Miner and probably proc logistic) that supports a frequency option on each row. Set the frequency to one for the more common events and to an appropriate value less than one for more common events. I do this as a matter of habit, because it works best for decision trees. You have pointed out that the confidence in the coefficients is also affected by the frequencies, so this is a good habit with regressions as well.
Monday, April 13, 2009
Customer-Centric Forecasting White Paper Available
Wednesday, April 8, 2009
MapReduce, Hadoop, Everything Old Is New Again
What brings these feelings up is all the excitement around MapReduce. It's nice to see a parallel programming paradigm that separates the description of the mapping from the description of the function to be applied, but at the same time, it seens a bit underwhelming. You see, I literally grew up with the parallel programming language APL. In the late 60's and early 70's my father worked at IBM's Yorktown Heights research center in the group that developped APL and I learned to program in that language at the age of 12. In 1982 I went to Analogic Corporation to work on an array processor implementation of APL. In 1986, while still at Analogic, I read Danny Hillis's book The Connection Machine and realized that he had designed the real APL Machine. I decided I wanted to work at the company that was building Danny's machine. I was hired by Guy Steele, who was then in charge of the software group at Thinking Machines. In the interview, all we talked about was APL. The more I learned about the Connection Machine's SIMD architecture, the more perfect a fit it seemed for APL or an APL-like language in which hypercubes of data may be partitioned into subcubes of any rank so that arbitrary functions can be applied to them. In APL and its descendents such as J, reduction is just one of rich family of ways that the results of applying a function to various data partitions can be glued together to form a result. I described this approach to parallel programming in a paper published in ACM SIGPLAN Notices in 1990, but as far as I know, no one ever read it. (You can, though. It is available here.) My dream of implementing APL on the Connection Machine gradually faded in the face of commercial reality. The early Connection Machine customers, having already been forced to learn Lisp, were not exactly clamouring for another esoteric language; they wanted Fortran. And Fortran is what I ended up working on. As you can tell, I still have regrets. If we'd implemented a true parallel APL back then, no one would have to invent MapReduce today.
Saturday, November 1, 2008
Should model scores be rescaled?
Here’s a quick question for your blog;
- background -
I work in a small team of data miners for a telecommunications company. We usually do ‘typical’ customer churn and mobile (cell-phone) related analysis using call detail records (CDR’s)
We often use neural nets to create a decimal range score between zero and one (0.0 - 1.0), where zero equals no churn and maximum 1.0 equals highest likelihood of churn. Another dept then simply sorts an output table in descending order and runs the marketing campaigns using the first 5% (or whatever mailing size they want) of ranked customers.
- problem -
We have differing preferences in the distribution of our prediction score for churn. Churn occurs infrequently, lets say 2% (it is voluntary churn of good fare paying customers) per month. So 98% of customers have a score of 0.0 and 2% have a score of 1.0.
When I build my predictive model I try to mimic this distribution. My view that is most of the churn prediction scores would be skewed toward 0.1 or 0.2, say 95% of all predicted customers, and from 0.3 to 1.0 of the churn score would apply to maybe 5% of the customer base.
Some of my colleagues re-scale the prediction score so that there are an equal number of customers spread throughout.
- question -
What are your views/preferences on this?
I see no reason to rescale the scores. Of course, if the only use of the scores is to mail the top 5% of the list it makes no difference since the transformation preserves the ordering, but for other applications you want the score to be an estimate of the actual probability of cancellation.
In general, scores that represent the probability of an event are more useful than scores which only order a list in descending order by probability of the event. For example, in a campaign response model, you can multiply the probability that a particular prospect will respond by the value of that response to get an expected value of making the offer. If the expected value is greater than the cost, the offer should not be made. Gordon and I discuss this and related issues in our book Mastering Data Mining.
This issue often comes up when stratified sampling is used to create a balanced model set of 50% responders and 50% non-responders. For some modeling techniques--notably, decision trees--a balanced model set will produce more and better rules. However, the proportion of responders at each leaf is no longer an estimate of the actual probability of response. The solution is simple: simply apply the model to a test set that has the correct distribution of responders to get correct estimates of the response probability.
-Michael