On the precision rate of data mining models…

There is no question I hate more than ‘What is the overall precision (or goodness) of your data mining model?’. The most of the people I have met tend to think that a model far from 100% good is worth nothing, or at least it is not ready for the market. In the other hand, people often think that a model close to 100% is very good and needs no improvement at all. Nothing else but this can be very misleading. Precision as a unique and a common measure for data mining models leads to really bad strategies; a common mistake committed even by qualified engineers and managers. In order to avoid the bad interpretation of my answer I like to tell how our data mining model differs from others with pros and cons including precision rates. It sometimes seems that I avoid the question so now I try to answer the question as completely as possible, and I show my points worth to analyze before any decisions are made.

First of all, it is necessary to define what does ‘presicion’ means. While great many definition is known from the literature, commonly one states that precision is the overall (or average) rate of good guesses or decisions in data sets. That is, if we have 72 exact guesses and 11 false tips then the precision rate is 72 / ( 72 + 11 ) = 86.75%.

Besides the good guesses one should take into account the following data mining properties hence they are very closely related to precision rates, and sometimes they are more important or even crucial to meet at least a certain measure.

  • General precision. If we have a very promising algorithm it is worth to test it in international grounds. It is good to know that every task and every real life situation are different, there is no algorithm fit for any problem (unless you are writing a dissertation). Make a deep analysis why your algorithm adopts well or wrong to a problem, and hey, never give up! There is no reason to think that the best known algorithm for some problem is the best for the problem were are working on now. In 2007, an international challenge were held for analyzing generalization behaviors of algorithms. Results were no surprising there were found no universal solution even tricky combinations of algorithms are failed in some problems. Do you have ever met or written a marketing folio telling that product embodied solutions are far more better in general than those of competitors? Did you ever believe / prove that?
  • Relative, data dependent precision. In my point of view, determining baseline precision rate is the most important task of a data mining project. To tell you the truth I rather prefer this value than precision rate. It is worth to set baseline to the best known previous precision rate if it is available. In general, whenever customers need enhancements on a previously made model it clearly shows that they deeply understand how e.g. 1% improvement in precision boosts their business. Moreover, they usually know how to marketing the results.
  • Simplicity. Precision rate of a data mining model depends on the number of input parameters, i.e. on the information base it is working with. It is analogous to Occam’s razor problem: we prefer solutions (businesses, people) end-up with the same result (profit, information) from less information (money, communication). That kind of ‘intellectual highness’ can be converted easily to money hence a more simple model is faster, needs less resources and workarounds. While it seems natural to make a shot on this I have never met customers asking me about operational costs of a model. It happened in a case that a model with the best known precision rate required more money than model should spare for a company.
  • Precision in time. Many solutions working with the ‘more information more precise model’ motto is based on a false logic, or at least their precision rates degrade fast. Let me explain this with an example. Imagine that we have 8 apples with 2 distinctive properties, e.g. volume and color, and our task is to form 4 groups of them. It is easy to see that partitioning different apples based on key properties is not a hard task. In fact, if each apple belongs to exactly one group of four one can find a proper partitioning algorithm which results the same groups. In the case of machine learning and mining models ‘distinctive’ usually means linear independency. In general, it is true that if have M distinctive properties, T unique items, and M / T is less or equal to 2 than the problem can be learned ideally just like in the case of apples. Now, one can identify that number of records in an international data set are not changing over time, so M is constant. That is, precision can be easily improved by increasing T. At a cost of what? Well, we can have the most impressive precision rate, we can publish it anywhere, or we can get a Ph.D. for that. Nevertheless, the number of data points in real life is not limited so there will come the hour of truth as M / T is getting far more than 2. Same happens when students learn reference books by heart for exams. Knowledge they have seem thrilling until the graduation and (obviously) useless at work. In the other hand, wise data miners should know they learn the problem alongside with data mining models and this human learning must not built-in during the process. In this apple case, we made a hint for data mining model about the number of data points, i.e. we have just cheated the problem.
  • On the information relevancies. Usually more information available leads to a more precise model assuming information have similar relevancies or qualities. What happens if more information means more indirectly related sources are used? Algorithms treat information equally no data mining algorithm supports prejudice or racism, and does not distinguish soil from diamond. In other words: garbage in garbage out. If information quality differs significantly then more non-relevant information increase noise in data, and consequently decrease precision rate. At this point you may say: “Hey, that is why data cleaning and preprocessing is so important, what is the point here?” In case of preprocessing data is transformed from initial space S1 into another SA in order to find correlations between data. However, when data mining models are built S1 space is transformed into SM, and at the most of the cases SM and SA are not the same. Therefore preprocessing rather should be done at SM space which is not the case in data mining or else it may not calculate proper relevancy for data. Good data mining models pay attention to information quality, they discover important information elements and/or latent correlations between them. In some cases, e.g. in cancer diagnostics, it is rather important to find signs and reasons for cancers than to build a more precise model.
  • Critical errors, sequences. A good face recognition solution has cca. 97-98% precision rate which does not find terrorists actually but it is able to recognize employees with no twins. In reality no image based face recognition algorithms are that good, so don’t worry if you are not there. If you have a close to 85% algorithm, and you know that 2-3 seconds videos contain lots (12-75) of somewhat different (cca. 8-10 independent) face images then many attempts for recognition converge into a specific result. It is easy to prove that if error rate depends on video quality nuances then this 85% algorithm can be transformed into 97% recognition rate solution (see e.g. voting strategies). That is, precision rate can be extended easily if you know more about the usage.
  • Behind the curtains. The most common passenger counters also provide 97-98% precision rates. Datasheets do not remember how this rate should be treated. Note that it is hard to tell how to calculate precision in a public transportation vehicle: is error rate calculated after every stops or is it weighted on uniform time periods? A good performance on the former one implies that algorithm is possibly more precise on rush hours hence nearly empty vehicles generate lower error rate than vehicles with many people. A good performance on the latter one indicates the opposite since rush hours form a relatively small portion of the day. In either cases error rate may have a large deviation, i.e. sometimes it might reach even 10%. Assume that precision rate is calculated on a daily basis by comparing results to manual counts then a solution optimized for this rate focuses on general attributes of customers, e.g. their heights, weights. That is, if it works for Europe it should not be sold to China.
  • When 99% is still bad. I read some times ago a scientific like leaflet on a “magic gadget” which promises to tell me whether I have a specific disease or not. It stated that e.g. meningitis is recognized by 99,2% (with no guarantee). Wow! The only tiny little problem is to have a meningitis has 5 from 1,000,000 apriori chance. At this point I have suddenly realized that I can produce in no time a much more precise “machine” with 99,9999% precision rate. My solution would repeat that “you have no meningitis”. I believe in probabilities. Nevertheless, I think we should rather go to a doctor.
  • Precision and quantities. Good optical character recognizers (OCR) have 98-99% precision rate for English. I have often heard from members of SMEs that electronic document digitalization is cheap, simple, and 1-2% is low enough not to worry about. But what does 2% means in this case? It means on an average basis there are 2 non-recognized from 100 letters. Since an English sentence consists of 7-10 words, and a word consists of 5-7 letters in general it means there is 1 error per sentence as an average. As a result we get a digitalized but poor quality text.
  • When 55% is very good. Predicting the future is a hard task even for stock exchange problems hence data elements are constantly changing their relative importance. From an observer point of view a stock exchange problem is rather a random or stochastic process (like gambling). For example, to guess whether tomorrow the closing price for Google will be higher or lower than it was today is very similar to spin a coin. In order to build a winning strategy we should change uniform apriori probabilities of different sides. Assume that we have 1% charges for transactions there can be easily found a winning strategy based on a 55% predictor, and with enough amount of money. You just need to guarantee that average amount of losses is compensated by average amount of gains (i.e. use stop loss).
  • As a consequence, I would say that “What is the precision of your data mining model?” and “How good is you model?” are ill-stated questions. Simple answers for these questions must be misleading at some point or they could feed illusions about a certain application. Nevertheless, precision rate is important we just need a more general (or complex) measure which helps stakeholders to properly understand the risks they take by using a data mining model.