Saturday, January 9, 2010

Hadoop and Parallel Dataflow Programming

Over the past three months, I have been teaching myself enough Hadoop to get comfortable with using the environment for analytic purposes.

There has been a lot of commentary about Hadoop/MapReduce versus relational databases (such as the articles referenced in my previous post on the subject). I actually think this discussion is misplaced because comparing open-source software with commercial software aligns people on "religious" grounds. Some people will like anything that is open-source. Some people will attack anything that is open-source (especially people who work for commercial software vendors). And, the merits of real differences get lost. Both Hadoop and relational databases are powerful systems for analyzing data, and each has its own distinct set of advantages and disadvantages.

Instead, I think that Hadoop should be compared to a parallel dataflow style of programming. What is a dataflow style of programming? It is a style where we watch the data flow through different operations, forking and combining along the way, to achieve the desired goal. Not only is a dataflow a good way to understand relational databases (which is why I introduce it in Chapter 1 of Data Analysis Using SQL and Excel), but the underlying engines that run SQL queries are dataflow engines.

Parallel dataflows extend dataflow processing to grid computing. To my knowledge, the first commercial tool that implements parallel dataflows was developed by Ab Initio. This company was a spin-off from a bleeding edge parallel supercomputer vendor called Thinking Machines that went bankrupt in 1994. As a matter of full disclosure: Ab Initio was actually formed from the group that I worked for at Thinking Machines. Although they are very, very, very resistant to sharing information about their technology, I am rather familiar it. I believe that the only publicly available information about them (including screen shots) is published in our book Mastering Data Mining: The Art and Science of Customer Relationship Management.

I am confident that Apache has at least one dataflow project, since when I google "dataflow apache" I get a pointer to the Dapper project. My wish, however, is that Hadoop were the parallel dataflow project.

Much of what Hadoop does goes unheralded by the typical MapReduce user. On a massively parallel system, Hadoop keeps track of the different parts of an HDFS file and, when the file is being used for processing, Hadoop does its darndest to keep the processing local to each file part being processed. This is great, since data locality is key to achieving good performance.

Hadoop also keeps track of which processors and disk systems are working. When there is a failure, Hadoop tries again, insulating the user from sporadic hardware faults.

Hadoop also does a pretty good job of shuffling data around, between the map and reduce operations. The shuffling method -- sorting, send, and sort again -- may not be the most efficient but it is quite general.

Alas, there are several things that Hadoop does not do, at least when accessed through the MapReduce interface. Supporting these features would allow it move beyond the MapReduce paradigm, giving it the power to support more general parallel dataflow constructs.

The first thing that bothers me about Hadoop is that I cannot easily take a text file and just copy it with the Map/Reduce primitives. Copying a file seems like something that should be easy. The problem is that a key gets generated during the map processing. The original data gets output with a key prepended, unless I do a lot of work to parse out the first field and use it as a key.

Could the context.write() function be overloaded with a version that does not output a key? Perhaps this would only be possible in the reduce phase, since I understand the importance of the key for going from map to reduce.

A performance issue with Hadoop is the shuffle phase between the map and the reduce. As I mentioned earlier, the sort-send-sort process is quite general. Alas, though, it requires a lot of work. An alternative that often works well is simply hashing. To maintain the semantics of map-reduce, I think this would be hash-send-combine or hash-send-sort. The beauty of using hashing is that the data can be sent to its destination while the map is still processing it. This allows concurrent use of the processing and network during this operation.

And, speaking of performance, why does the key have to go before the data? Why can't I just point to a sequence of bytes and use that for the key? This would enable a programming style that doesn't spend so much time parsing keys and duplicating information between values and keys.

Perhaps the most frustrating aspect of Hadoop is the MapReduce framework itself. The current version allows processing like (M+)(R)(M*). What this notation means is that the processing starts with one or more map jobs, goes to a reduce, and continues with zero or more map jobs.

THIS IS NOT GENERAL ENOUGH! I would like to have an arbitrary number of maps and reduces connected however I like. So, one map could feed two different reduces, each having different keys. At the same time, one of the reduces could feed another reduce without having to go through an intermediate map phase.

This would be a big step toward parallel dataflow parallel programming, since Map and Reduce are two very powerful primitives for this purpose.

There are some other primitives that might be useful. One would be broadcast. This would take the output from one processing node during one phase and send it to all the other nodes (in the next phase). Let's just say that using broadcast, it would be much easier to send variables around for processing. No more defining weird variables using "set" in the main program, and then parsing them in setup() functions. No more setting up temporary storage space, shared by all the processors. No more using HDFS to store small serial files, local to only one node. Just send data through a broadcast, and it goes everywhere. (If the broadcast is running on more than one node, then the results would be concatenated together, everywhere.)

And, if I had a broadcast, then my two-pass row number code (here) would only require one pass.

I think Hadoop already supports having multiple different input files into one reduce operator. This is quite powerful, and a much superior way of handling join processing.

It would also be nice to have a final sort operator. In the real world, people often do want sorted results.

In conclusion, parallel dataflows are a very powerful, expressive, and efficient way of implementing complex data processing tasks. Relational databases use dataflow engines for their processing. Using non-procedural languages such as SQL, the power of dataflows are hidden from the user -- and, some relatively simple dataflow constructs can be quite difficult to express in SQL.

Hadoop is a powerful system that emulates parallel dataflow programming. Any step in a dataflow can be implemented using a MapReduce pass -- but this requires reading, writing, sorting, and sending the data multiple times. With a few more features, Hadoop could efficiently implement parallel dataflows. I feel this would be a big boost to both performance and utility, and it would leverage the power already provided by the Hadoop framework.

Tuesday, January 5, 2010

MapReduce versus Relational Databases?

The current issue of Communications of the ACM has articles on MapReduce and relational databases. One, MapReduce a Flexible Data Processing Tool, explains the utility of MapReduce by two Google fellows -- appropriate authors, since Google invented the parallel MapReduce paradigm.

The second article, MapReduce and Parallel DBMSs: Friend or Foe, is written by a team of authors, with Michael Stonebraker listed as the first author. I am uncomfortable with this article, because the article purports to show the superiority of a particular database system, Vertica, without mentioning -- anywhere -- that Michael Stonebraker is listed as the CTO and Co-Founder on Vertica's web site. For this reason, I believe that this article should be subject to much more scrutiny.

Before starting, let me state that I personally have no major relationships with any of the database vendors or with companies in the Hadoop/MapReduce space. I am an advocate of using relational databases for data analysis and have written a book called Data Analysis Using SQL and Excel. And, over the past three months, I have been learning Hadoop and MapReduce, as attested to by numerous blog postings on the subject. Perhaps because I am a graduate of MIT ('85), I am upset that Michael Stonebraker uses his MIT affiliation for this article, without mentioning his Vertica affiliation.

The first thing I notice about the article is the number of references to Vertica. In the main text, I count nine references to Vertica, as compared to thirteen mentions of other databases:
  • Aster (twice)
  • DataAllegro (once)
  • DB2 (twice)
  • Greenplum (twice)
  • Netezza (once)
  • ParAccel (once)
  • PostgreSQL (once)
  • SQL Server (once)
  • Teradata (once)
The paper describes a study which compares Vertica, another database, and Hadoop on various tasks. The paper never explains how these databases were chosen for this purpose. Configuration issues for the other database and Hadoop are mentioned. The configuration and installation of Vertica -- by the absence of problems -- one assumes is easy and smooth. I have not (yet) read the paper cited, which describes the work in more detail.

Also, the paper never describes costs for the different system, which is a primary driver of MapReduce. The software is free and runs on cheap clusters of computers, rather than expensive servers and hardware. For a given amount of money, MapReduce may provide a much faster solution, since it can support much larger hardware environments.

The paper never describes issues in the loading of data. I assume this is a significant cost for the databases. Loading the data for Hadoop is much simpler . . . since it just reads text files, which is a common format.

From what I can gather, the database systems were optimized specifically for the tasks at hand, although this is not explicitly mentioned anywhere. For instance, the second tasks is a GROUP BY, and I suspect that the data is hash partitioned by the GROUP BY clause.

There are a few statements that I basically disagree with.

"Lastly, the reshuffle that occurs between the Map and Reduce tasks in MR is equivalent to a GROUP BY operation in SQL." The issue here at first seems like a technicality. In a relational database, an input row can only into one group. MR can output multiple records in the map stage, so a single row can go into multiple "groups". This functionality is important for the word count example, which is the canonical MapReduce example. I find it interesting that this example is not included in the benchmark.

"Given this, parallel DBMSs provide the same computing model as MR, with the added benefit of using a declarative language (SQL)." This is not true in several respects. First, MapReduce does have associated projects for supporting declarative languages. Second, in order for SQL to support the level of functionality that the authors claim, they need to use user defined functions. Is that syntax declarative?

More importantly, though, is that the computing model really is not exactly the same. Well, with SQL extensions such as GROUPING SETs and window functions, the functionality does come close. But, consider the ways that you can add a row number to data (assuming that you have no row number function built-in) using MapReduce versus traditional SQL. Using MapReduce you can follow the two-phase program that I described in an earlier posting. With traditional SQL, you have to do a non-equi-self join. MapReduce has a much richer set of built-in functions and capabilities, simply because it uses java, an established programming language with many libraries.

On the other hand, MapReduce does not have a concept of "null" built-in (although users can define their own data types and semantics). And, MapReduce handles non-equijoins poorly, because the key is used to direct both tables to the same node. In effect, you have to limit the MapReduce job to one node. SQL can still parallelize such queries.

"[MapReduce] still requires user code to parse the value portion of the record if it contains multiple attributes." Well, parse is the wrong term, since a Writable class supports binary representations of data types. I describe how to create such types here.

I don't actually feel qualified to comment on many of the operational aspects of optimizing Hadoop code. I do note that the authors do not explain the main benefit of Vertica, which is the support of column partitioning. Each column is stored separate, which makes it possible to apply very strong compression algorithms to the data. In many cases, the Vertica data will fit in memory. This is a huge performance boost (and one that another vendor, Paracel takes advantage of).

In the end, the benchmark may be comparing the in-memory performance of a database to general performance for MapReduce. The benchmark may not be including the ETL time for loading the data, partitioning data, and building indexes. The benchmark may not have allocated optimal numbers of map and reduce jobs for the purpose. And, it is possible that the benchmark is unbiased and relational databases really are better.

A paper that leaves out the affiliations between its authors and the vendors used for a benchmark is only going to invite suspicion.

Saturday, January 2, 2010

Hadoop and MapReduce: Normalizing Data Structures

To set out to learn Hadoop and Map/Reduce, I tackled several different problems. The last of these problems is the challenge of normalizing data, a concept from the world of relational databases. The earlier problems were adding sequential row numbers and characterizing values in the data.

This posting describes data normalization, explains how I accomplished it in Hadoop/MapReduce, and some tricks in the code. I should emphasize here that the code is really "demonstration" code, meaning that I have not worked hard on being sure that it always works. My purpose is to demonstrate the idea of using Hadoop to do normalization, rather than producing 100% working code.


What is Normalization and Why Do Want To Do It?

Data normalization is the process of extracting values from a single column and placing them in a reference table. The data used by Hadoop is typically unnormalized, meaning that data used in processing is in a single record, so there is no need to join in reference tables. In fact, doing a join is not obvious using the MapReduce primitives, although my understanding is that Hive and Pig -- two higher level languages based on MapReduce -- do incorporate this functionality.

Why would we want to normalize data? (This is a good place to plug my book Data Analysis Using SQL and Excel, which explains this concept in more detail in the first chapter.) In the relational world, the reason is something called "relational integrity", meaning that any particular value is stored in one, and only one, place. For instance, if the state of California were to its name, we would not want to update every record from California. Instead, we'd rather go to the reference table and just change the name to the new name, and the data field contains a state id rather than the state name itself. Relational integrity is particularly important when data is being updated.

Why would we want to normalize data used by Hadoop? There are two reasons. The first is that we may be using Hadoop processing to load a relational database -- one that is already designed with appropriate reference tables. This is entirely reasonable, relational databases are an attractive way to "publish" results from complex data processing since they are better for creating end-user reports and building interactive GUI interfaces.

The second reason is performance. Extracting long strings and putting them in a separate reference table can significantly reduce the storage requirements for the data files. By far, most of the space taken up in typical log files, for instance, consists of long URIs (what I used to call URLs). When processing the log files, we might want to extract some features from the URIs, but keeping the entire string just occupies a lot of space -- even in a compressed file.


The Process of Normalizing Data

Normalizing data starts with data structures. The input records are assumed to be in a delimited format, with the column names in the first row (or provided separately, although I haven't tested that portion of the code yet). In addition, there is a "master" id file that contains the following columns:
  • id -- a unique id for every value by column.
  • column name -- the name of the column.
  • value -- the id in the column.
  • count -- the total number of times the value as so far occurred.
This is a rudimentary reference file. I could imagine, for instance, having more information than just the count as summary information -- perhaps the first and last date when the value occurs, for instance.

What happens when we normalize data? Basically, we look through the data file to find new values in each column being normalized. We append these new values into the master id file, and then go back to the original data and replace the values with the ids.

Hadoop is a good platform for this for several reasons. First, because the data is often stored as text files, the values and the ids have the same type -- text strings. This means that the file structures remain the same. Second, Hadoop can process multiple columns at the same time. Third, Hadoop can use inexpensive clusters and free software for this task, rather than relying on databases and tools, which are often more expensive.

How To Normalize Data Using Hadoop/MapReduce

The normalization process has six steps. Most of these correspond to a single Map-Reduce pass.

Step 1: Extract the column value pairs from the original data.

This step explodes the data, by creating a new data set with multiple rows for each row in the original data. Each output row contains a column, a value, and the number of times the value appears in the data. Only columns being normalized are included in the output.

This step also saves the column names for the data file in a temporary file. I'll return to why this is needed in Step 6.

Step 2: Extract column-value Pairs Not In Master ID File

This step compares the column-value pairs produced in the first step with those in the master id file. This step is interesting, because it reads data from two different data source formats -- the master id file and the results from Step 1. Both sets of data files use the GenericRecord format.

To identify the master file, the map function looks at the original data to see whether "/master" appears in the path. Alternative methods would be to look at the GenericRecord that is created or to use MultipleInputs (which I didn't use because of a warning on Cloudera's web site).


Step 3: Calculate the Maximum ID for Each Column in the Master File

This is a very simple Map-Reduce step that simply gets the maximum id for each column. New ids that are assigned will be assigned one more than this value.

This is an instance where I would very much like to have two different reduces following a map step. If this were possible, then I could combine this step with step 2.


Step 4: Calculate a New ID for the Unmatched Values

This is a two step process that follows the mechanism for adding row numbers discussed in one of my earlier posts, with one small modification. The final result has the maximum id value from Step 3 added onto it, so the result is a new id rather than just a row number.


Step 5: Merge the New Ids with the Existing Master IDs

This step merges in the results from Step 4 with the existing master id file. Currently, the results are placed into another directly. Eventually, they could simply override the master id file.

Because of the structure of the Hadoop file system, the merge could be as simple as copying the file with the new ids into the appropriate master id data space. However, this would result in an unbalanced master id file, which is probably not desirable for longer term processing.


Step 6: Replace the Values in the Original Data with IDs

This final step replaces the values with ids -- the actual normalization step. This is a two part process. The map phase of the first part takes both the original data and the master key file. All the column value pairs are exploded from the original data, as in Step 1, with the output consisting of:
  • key: :
  • value: <"expect"|"nomaster">, ,
The first part ("expect" or "nomaster") is an indicator of whether this column should be normalized (that is, whether or not to expect a master id). The second field identifies the original data record, which is uniquely identified by the partition id and row number within that partition. The third is the column number in the row.

The master records are placed in the format:
  • key: :
  • value: "master",
The reduce then reads through all the records for a given column-value combination. If one of them is a master, then it outputs the id for all records. Otherwise, it outputs the original value.

The last phase simply puts the records back together again, from their exploded form. The one trick here is that the metadata is read from a local file.


Tricks Used In This Code

The code is available in these files: Normalize.java, GenericRecordInputFormat.java, GenericRecord.java, and GenericRecordMetadata.java. This code uses several tricks along the way.

One trick that I use in Step 4, for the phase 1 map, makes the code more efficient. This phase of the computation extracts the maximum row number for each column. Instead of passing all the row numbers to a combine or reduce function, it saves them in a local hash-map data structure. I then use the cleanup() routine in the map function to output the maximum values.

Often the master code needs to pass variables to the map/reduce jobs. The best way to accomplish this is by using the "set" mechanism in the Configuration object. This allows variables to be assigned a string name. The names of all the variables that I use are stored in constants that start with PARAMETER_, defined at the beginning of the Normalize class.

In some cases, I need to pass arrays in, for instance, when passing in the list of column that are to be normalized. In this case, one variable gives the number of values ("normalize.usecolumns.numvals"). Then each value is stored in a variable such as "normalize.usecolumns.0" and "normalize.usecolumns.1" and so on.

Some of the important processing actually takes place in the master loop, where results are gathered and then passed to subsequent steps using this environment mechanism.

The idea behind the GenericRecord class is pretty powerful, with the column names at the top of the file. GenericRecords make it possible to read multiple types of input in the same map class, for instance, which is critical functionality for combining data from two different input streams.

However, the Map-Reduce framework does not really recognize these column names as being different, once generic records are placed in a sequence file. The metadata has to be passed somehow.

When the code itself generates the metadata, this is simple enough. A function is used to create the metadata, and this function is used in both the map and reduce phases.

A bigger problem arises with the original data. In particular, Step 6 of the above framework re-creates the original records, but it has lost the column names, which poses a conundrum. The solution is to save the original metadata in Step 1, which first reads the records. This metadata is then passed into Step 6.

In this code, this is handled by simply using a file. The first map partition of Step 1 writes this file (this partition is used to guarantee that the file is written exactly once). The last reduce in Step 6 then reads this file.

This mechanism works, but is not actually the preferred mechanism, because all the reduce tasks in Step 6 are competing to read the same file -- a bottleneck.

A better mechanism is for the master program to read the file and to place the contents in variables in the jar file passed to the map reduce tasks. Although I do this for other variables, I don't bother to do this for the file.

Monday, December 28, 2009

Differential Response or Uplift Modeling

Some time before the holidays, we received the following inquiry from a reader:

Dear Data Miners,



I’ve read interesting arguments for uplift modeling (also called incremental response modeling) [1], but I’m not sure how to implement it. I have responses from a direct mailing with a treatment group and a control group. Now what? Without data mining, I can calculate the uplift between the two groups but not for individual responses. With the data mining techniques I know, I can identify the ‘do not disturbs,’ but there’s more than avoiding mailing that group. How is uplift modeling implemented in general, and how could it be done in R or Weka?



[1] http://www.stochasticsolutions.com/pdf/CrossSell.pdf

I first heard the term "uplift modeling" from Nick Radcliffe, then of Quadstone. I think he may have invented it. In our book, Data Mining Techniques, we use the term "differential response analysis." It turns out that "differential response" has a very specific meaning in the child welfare world, so perhaps we'll switch to "incremental response" or "uplift" in the next edition. But whatever it is called, you can approach this problem in a cell-based fashion without any special tools. Cell-based approaches divide customers into cells or segments in such a way that all members of a cell are similar to one another along some set of dimensions considered to be important for the particular application. You can then measure whatever you wish to optimize (order size, response rate, . . .) by cell and, going forward, treat the cells where treatment has the greatest effect.

Here, the quantity  to measure is the difference in response rate or average order size between treated and untreated groups of otherwise similar customers. Within each cell, we need a randomly selected treatment group and a randomly selected control group; the incremental response or uplift is the difference in average order size (or whatever) between the two. Of course some cells will have higher or lower overall average order size, but that is not the focus of incremental response modeling. The question is not "What is the average order size of women between 40 and 50 who have made more than 2 previous purchases and live in a neighborhood where average household income is two standard deviations above the regional average?" It is "What is the change in order size for this group?"

Ideally, of course, you should design the segmentation and assignment of customers to treatment and control groups before the test, but the reader who submitted the question has already done the direct mailing and tallied the responses. Is it now too late to analyze incremental response?  That depends: If the control group is a true random control group and if it is large enough that it can be partitioned into segments that are still large enough to provide statistically significant differences in order size, it is not too late. You could, for instance, compare the incremental response of male and female responders.

A cell-based approach is only useful if the segment definitions are such that incremental response really does vary across cells. Dividing customers into male and female segments won't help if men and women are equally responsive to the treatment. This is the advantage of the special-purpose uplift modeling software developed by Quadstone (now Portrait Software). This tool builds a decision tree where the splitting criteria is maximizing the difference in incremental response. This automatically leads to segments (the leaves of the tree) characterized by either high or low uplift.  That is a really cool idea, but the lack of such a tool is not a reason to avoid incremental response analysis.

Sunday, December 27, 2009

Hadoop and MapReduce: Characterizing Data

This posting describes using Hadoop and MapReduce to characterize data -- that is, to summarize the values in various columns to learn about the values in each column.

This post describes how to solve this problem using Hadoop. It also explains why Hadoop is better for this particular problem than SQL.

The code discussed in this post is available in these files: GenericRecordMetadata.java, GenericRecord.java, GenericRecordInputFormat.java, and Characterize.java. This work builds on the classes introduced in my previous post Hadoop and MapReduce: Method for Reading and Writing General Record Structures (the versions here fix some bugs in the earlier versions).

What Does This Code Do?

The purpose of this code is to provide summaries for data in a data file. Being Hadoop, the data is stored in a delimited text format, with one record per line, and the code uses GenericRecord to handle the specific data. The generic record classes are things that I wrote to handle this situation; the Apache java libraries apparently have other approaches to solving this problem.

The specific summaries for each column are:
  • Number of records.
  • Number of values.
  • Minimum and maximum values for string variables, along with the number of times the minimum and maximum values appear in the data.
  • Minimum and maximum lengths for string variables, along with the number of times these appear and an example of the value.
  • First, second, and third most common string values.
  • Number of times the column appears to be an integer.
  • Minimum and maximum values when treating the values as integers, along with the number of times that these appear.
  • Number of times the column appears to contain a real number.
  • Minimum and maximum values when treating the values as doubles, along with the number of times that these appear.
  • Count of negative, zero, and positive values.
  • Average value.
These summaries are arbitrary. The code should be readily extensible to other types and other summaries.

My ultimate intention is to use this code to easily characterize input and result files that I create in the process of writing Hadoop code.


Overview of the Code

The characterize problem is solved in two steps. The first creates a histogram of all the values in all the columns, and the second summarizes the histogram of values, which is handled by two passes of map reduce.

The histogram step takes files with the following format:
  • Key: undetermined
  • Values: text values separated by a delimited (by default a tab)
(This is the GenericRecord format.)
The Map phase produces a file of the format:
  • Key: column name and column value, separated by a colon
  • Value: "1"
Combine and Reduce then add up the "1"s, producing a file of the format:
  • Key: column name
  • Value: column value separated by tab
Using a tab as a separator is a convenience, because this is also the default separator for the key.

The second phase of the Map/Reduce job takes the previous output and uses the reduce function to summarize all the different values in the histogram. This code is quite specific to the particular summaries. The GenericRecord format is quite useful because I can simply add new summaries in the code, without worrying about the layout of the records.

The code makes use of exception processing to handle particular data types. For instance, the following code block handles the integer summaries:

try {
....long intval = Long.parseLong(valstr);
....hasinteger = true;
....intvaluecount++;
....intrecordcount += Long.parseLong(val.get("count"));
}
catch (Exception NumberFormatException) {
....// we don't have to do anything here
}

This block tries to convert the value to an integer (actually to a long). When this works, then the code updates the various variables that characterize integer values. When this fails, the code continues working.

There is a similar block for real numbers, and I could imagine adding more such blocks for other formats, such as dates and times.

Why MapReduce Is Better Than SQL For This Task

Characterizing data is the process of summarizing data along each column, to get an idea of what is in the data. Normally, I think about data processing in terms of SQL (after all, my most recent book is Data Analysis Using SQL and Excel). SQL, however, is particularly poor for this purpose.

First, SQL has precious few functions for this task -- basically MIN(), MAX(), AVG() and judicious use of the CASE statement. Second, SQL generally has lousy support for string functions and inconsistent definitions for date and time functions across different databases.

Worse, though, is that traditional SQL can only summarize one column at a time. The traditional SQL approach would be to summarize each column individually in a query and then connect them using UNION ALL statements. The result is that the database has to do a full-table scan for each column.

Although not supported in all databases, SQL syntax does now support the GROUPING SETS keyword which helps potentially alleviate this problem. However, GROUPING SETS is messy, since the key columns each have to be in separate columns. That is, I want the results in the format "column name, column value". With GROUPING SETS, I get "column1, column2 ... columnN", with NULLs for all unused columns, except for the one with a value.

The final problem with SQL occurs when the data starts out in text files. Much of the problem of characterizing and understanding the data happens outside the database during the load process.

Tuesday, December 22, 2009

Interview with Eric Siegel

This is the first of what may become an occasional series of interviews with people in the data mining field. Eric Siegel is the organizer of the popular  Predictive Analytics World conference series. I asked him a little bit about himself and gave him a chance to plug his conference.  A propos, readers of this blog can get a 15% discount on a two-day conference pass by pasting the code DATAMINER010 into the Promotional Code box on the conference registration page.

Q: Not many kids (one of mine is perhaps the exception that proves the rule) have the thought "when I grow up, I want to be a data miner!"  How did you fall into this line of work?

To many laypeople, the word "data" sounds dry, arcane, meaningless - boring! And number-crunching on it doubly so. But this is actually the whole point. Data is the uninterpreted mass of things that've happened.  Extracting what's up, the means behind the madness, and in so doing modeling and learning about human behavior... well, I feel nothing in science or engineering is more interesting.
In my "previous life" as an academic researcher, I focused on core predictive modeling methods. The ability for a computer to automatically learn from experience (data really is recorded experience, after all), is the best thing since sliced bread. Ever since I realized, as I grew up from childhood, that space travel would in fact be a tremendous, grueling pain in the neck (not fun like "Star Wars"), nothing in science has ever seemed nearly as exciting.


In my current 9-year career as a commercial practitioner, I've found that indeed the ability to analytically "learn" and apply what's been learned turns out to provide plenty of business value, as I imagined back in the lab.  Research science is fun in that you have the luxury of abstraction and are often fairly removed from the need to prove near-term industrial applicability. Applied science is fun for the opposite reason: The tangle of challenges, although some less abstract and in that sense more mundane, are the only thing between you and getting the great ideas of the world to actually work, come to fruition, and deliver an irrefutable impact.


Q: Most conferences happen once a year.  Why does PAW come around so much more frequently?

In fact, many commercial conferences focused the industrial deployment of technology occur multiple times per year, in contrast to research conferences, which usually take place annually.  There's an increasing demand for a more frequent commercial event as predictive analytics continues to "cross chasms" towards more widescale penetration. There's just too much to cover - too many brand-name case studies and too many hot topics - to wait a year before each event.


Q: You use the phrase "predictive analytics" for what I've always called "data mining." Do the terms mean something different, or is it just that fashions change with the times?


"Data mining" is indeed often used synonymously with "predictive analytics", but not always. Data mining's definitions usually entail the discovery of non-trivial, useful patterns/knowledge/insights from data -- if you "dig" enough, you get a "nugget." This is a fairly abstract definition and therefore envelops a wide range of analytical techniques. On the other hand, predictive analytics is basically the commerical deployment of predictive modeling specifically (that is, in academic jargon, supervised learning, i.e., optimizing a statitistical model over labeled/historical cases). In business applications, this basically translates to a model that produces a score for each customer, prospect, or other unit of interest (business/outlet location, SKU, etc), which is roughly the working definition we posted on the Predictive Analytics World website. This would seem to potentially exclude related data mining methods such as forecasting, association mining and clustering (unsupervised learning), but, naturally, we include some sessions at the conference on these topics as well, such as your extremely-well-received session on forecasting October 2009 in DC.



Q: How do you split your time between conference organizing and analytical consulting work?  (That's my polite way of trying to rephrase a question I was once asked: "What's the split between spewing and doing?")

When one starts spewing a lot, there becomes much less time for doing. In the last 2 years, as my 2-day seminar on predictive analytics has become more frequent (both as public and customized on-site training sessions - see http://www.businessprediction.com), and I helped launch Predictive Analytics World, my work in services has become less than half my time, and I now spend very little time doing hands-on, playing a more advisory and supervisory role for clients, alongside other senior consultants who do more hands-on for Prediction Impact services engagements.


Q: I can't help noticing that you have a Ph.D.  As someone without any advanced degrees, I'm pretty good at rationalizing away their importance, but I want to give you a chance to explain what competitive advantage it gives you.

The doctorate is a research-oriented degree, and the Ph.D. dissertation is in a sense a "hazing" process. However, it's become clear to me that the degree is very much net positive for my commercial career. People know it entails a certain degree of discipline and aptitude. And, even if I'm not conducting academic research most of the time, every time one applies analytics there there is an experimental component to the task. On the other hand, many of the best data miners - the "rock star" consultants such as yourself - did not need a doctorate program in order to become great at data mining.



Q: Moving away from the personal, how do you think the move of data and computing power into the cloud is going to change data mining?

I'd say there's a lot of potential in making parallelized deployment more readily available to any and all data miners.  But, of all the hot topics in analytics, I feel this is the one into which I have the least visibility. It does, after all, pertain more to infrastucture and support than to the content, meaning and insights gained from analysis.

But, turning to the relevant experts, be sure to check out Feb PAW's upcoming session, "In-database Vs. In-cloud Analytics: Implications for Deployment" - see http://www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda.php#day2-7


Q: Can you give examples of problems that once seemed like hot analytical challenges that have now become commoditized?

Great question. Hmm... common core analyical methods such as decision trees and logistic regression may be the only true commodities to date in our field. What do you think?

Q: There are some tasks that we used to get hired for 10 or 15 years ago that no one comes to us for these days. Direct mail response models is an example. I think people feel like they know how to do those themselves. Or maybe that is something the data vendors pretty much give away with the data.

Which of today's hot topics in data mining do you see as ripe for commiditization?

UPLIFT (incremental lift) modeling is branching out, with applications going beyond response and churn modeling (see http://www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda.php#day2-2).

Expanding traditional data sets with SOCIAL DATA is continuing to gain traction across a growing range of verticals as analytics pracitioners find great value (read: tremendous increases in model lift) leveraging the simple fact that people behave similarly to those to whom they're socially connected. Just as the healthcare industry has discovered that quitting smoking is "contagious" and that the risk of obesity dramatically increases if you have an obese friend, telecommunications, online social networks and other industries find that "birds of a feather" churn and even commit fraud "together". Is this more because people influence one-another, or because they befriend others more like themselves?  Either way, social connections are hugely predictive of the customer behaviors that matter to business.



Q: There have been several articles in the popular press recently, like this one in the NY Times,  saying that statistics and data mining are the hottest fields a young person could enter right now.  Do you agree?

Well, for the subjective reasons in my answer to your first question above, I would heartily agree. If I recall, that NY Times article focused on the demand for data miners as the career's central appeal. Indeed, it is a very marketable skill these days, which certainly doesn't hurt.

Friday, December 18, 2009

Hadoop and MapReduce: Method for Reading and Writing General Record Structures

I'm finally getting more comfortable with Hadoop and java, and I've decided to write a program that will characterize data in parallel files.

To be honest, I find that I am spending a lot of time writing new Writable and InputFormat classes, every time I want to do something. Every time I introduce a new data structure used by the Hadoop framework, I have to define two classes. Yucch!

So, I put together a simple class called GenericRecord that can store a set of column names (as string) and a corresponding set of column values (as strings). These are stored in delimited files, and the various classes understand how to parse these files. In particular, the code can read any tab delimited file that has column names on the first row (and changing the delimitor should be easy). One nice aspect is the ability to use the GenericRecord as the output of a reduce function, which means that the number and names of the output can be specified in the code -- rather than in additional files with additional classes.

I wouldn't be surprised if similar code already exists with more functionality than the code I have here. This effort is also about my learning Hadoop.

This posting provides the code and explains important features on how it works. The code is available in these files GenericRecord.java, GenericRecordMetadata.java, GenericRecordInputFormat.java, and GenericRecordTester.java.

What This Code Does

This code is analogous to the word count code, that must be familiar to anyone starting to learn MapReduce (since it seems to be the first example in all the documentation I've seen). Instead of counting words, this code counts the occurrence of values in the columns.

The code reads input files and produces output records with three columns:
  • A column name in the original data.
  • A value in the column.
  • The number of times the value appears.
Do note that for data with many unique values in many columns, the number of output records is likely to far exceed the number of input records. So, the output file can be bigger than the input file.

The input records are assumed to be in a text file with one record per row. The first row contains the names of the columns, delimited by a tab (although this could easily be changed to another delimiter). The rest of the rows contain values. Note that this assumes that the input files are all read from the beginning; that is, that a single input file is not split among multiple map tasks.

One irony of this code and the Hadoop framework is that the input files do not have to be in the same format. So, I could upload a bunch of different files, with different numbers of columns, and different column names, and run them all in parallel. I would have to be careful that the column names are all different, for this to work well.

Examples of such files are available on the companion page for my book Data Analysis Using SQL and Excel. These are small files by the standards of Hadoop (measures in megabytes) but quite sufficient for testing and demonstrating code.


Overview of Approach

There are four classes defined for this code:
  • GenericRecordMetadata stores the metadata (column names) for a record.
  • GenericRecord stores the values for a particular record.
  • GenericRecordInputFormat provides the interface for reading the data into Hadoop.
  • GenericRecordTester provides the functions for the MapReduce framework.
The metadata consists of the names of the columns, which can be accessed either by a column index or by a column name. The metadata has functions to translate a column name into a column index. Because it uses a HashMap, the functions should run quite fast, although they are not optimal in memory space. This is okay, because the metadata is stored only once, rather than once per row.

The generic record itself stores the data as an array of strings. It also contains a pointer to the metadata object, in order to fetch the names. The array of strings minimizes both memory overhead and time, but does require access using an integer. The other two classes are needed for the Hadoop framework.

One small challenge is getting this to work without repeating the metadata information for each row of data. This is handled by including the column names as the first row in any file created by the Hadoop framework, and not by putting the column names in the output for each row.


Setting Up The Metadata When Reading

The class GenericRecordInputFormat basically does all of its work in a private class called GenericRecordRecordReader. This function has two important functions: initialize() and nextKeyValue().

The initialize() function sets up the metadata, either by reading environment variables in the context object or by parsing the first line of the input file (depending on whether or not the environment variable genericrecord.numcolumns is defined). I haven't tested passing in the metadata using environment variables, because setting up the environment variables poses a challenge. These variables have to be set in the master routine in the configuration before the map function is called.

The nextKeyValue() function reads a line of the text file, parses it using the function split(), and sets the values in the line. The verification on the number of items read matching the number of expected items is handled in the function lineValue.set(), which raises an exception (currently unhandled) when there is a mismatch.


Setting Up The Metadata When Writing

Perhaps more interesting is the ability to set up the metadata dynamically when writing. This is handled mostly in the setup() function of the SplitReduce class, which sets up the metadata using various function calls.

Writing the column names out at the beginning of the results file uses a couple of tricks. First, this does not happen in the setup() function but rather in the reduce() function itself, for the simple reason that the latter handles IOException.

The second trick is that the metadata is written out by putting it into the values of a GenericRecord. This works because the values are all strings, and the record itself does not care if these are actually for the column names.

The third trick is to be very careful with the function GenericRecord.toString(). Each column is separated by a tab character, because the tab is used to separate the key from the value in the Hadoop framework. In the reduce output files, the key appears first (the name of the column in the original data), followed by a tab -- as put there by the Hadoop framework. Then, toString() adds the values separated by tabs. The result is a tab-delimited file that looks like column names and values, although the particular pieces are put there through different mechanisms. I imagine that there is a way to tell Hadoop to use a different character to separate the key and value, but I haven't researched this point.

The final trick is to be careful about the ordering of the columns. The code iterates through the values of the GenericRecord table manually using an index rather than a for-in loop. This is quite intentional, because it allows the code to control the order in which the columns appear -- which is presumably the original ordered in which they were defined. Using the for-in is also perfectly valid, but the columns may appear in a different order (which is fine, because the column names also appear in the same order).

The result of all this machinery is that the reduce function can now return values in a GenericRecord. And, I can specify these in the reduce function itself, without having to mess around with other classes. This is likely to be a big benefit as I attempt to develop more code using Hadoop.