Hey there. Today we are going to talk a little bit about Machine Learning and Data Discovery in Databases. Why? Well, why not.
Those are some interesting topics which have started to gain more and more attention during the past years, despite the theoretical aspects being available (but not applied until recently) since the 1980-1990s. Please keep in mind that this post is written in a very very very basic and "easy-to-understand" way without any inclinations for academic purity. Things are actually way more complex and technical and you can never learn something in 5 minutes, if someone has spent year or two on it.
So, what the fuck is this thing called "Machine Learning" and why suddenly it got so "popular"?
Machine Learning is part of the studies in the field of Computer Science, and precisely Artificial Intelligence (although some might debate is it a part of AI or a separate field). It aims to achieve "pattern recognition" and thus, classify data in meaningful sets, discovering previously unknown relationships and facts about the collected data.
How and why?
Currently we are living in the world where data storage is very cheap and there is too much data about almost anything (Big Data). There are databases with pictures, banking transactions, temperature recordings but we don't know the general meaning of these values in them. This leads to the paradox of databases being defined as "with rich data, but poor or no information in them".
Imagine, we are living a world, where already some of the data of hospital patients is being saved in the hospital computers. If there is one thing that all of us know from Mathematics (regardless you love it or hate it) is that could you create a model on the most of things in the world. And then find relations between these models. We can know what is the chance if you have two diseases (X and Y) and get a drug (Z) which treats X, for Z to cause a side effect on Y, based on all the statistical data we have in our database.
Or, imagine the classic case: Mall databases. People buy products (Computers, Clothes, Food, Games). Using clusterization (finding groups with same relations) we can actually find previously "hidden groups/clusters" of data that we never knew that existed.
For instance, analyzing the data we can discover there is a cluster of people who buy Apple computers and then in the same day buy at least 2 shirts from H&M. Or we have a person who always after buying a game (let's say Call of Duty: Modern Warfare 9001) goes and buys something to eat.
Using that knowledge, not only companies can "pair" products and shop locations in order to accommodate best such users but they can also come up with better "promotional pairing" offers or entire marketing campaigns.
So how exactly does it all work? Most programmers might tell you it's complicated. Scratch that. It's not, if you can explain it simply (Hello, Einstein!).
Basically you have a set of data which is called "Training set" with which you can "teach" a computer/machine/soulless programmer to "learn" a rule and/or generate a valuable for the business conclusion from the data.
Let's imagine, we have a database with 1 million people, together with their age, income and sex. We can run an algorithm which can tell us which people are the richest (let's say males aged between 30-50 have the highest average income).
We can also discover clusters of hidden social groups: the algorithm can tell us there is a specific cluster of data which we didn't know about: The highest percentage of people with low incomes is actually those which are male, unmarried and with age over 60 or less than 20 without listed parents.
Now, that we have "learned" from our "training set", we can do a cool experiment: Let's go outside and just ask random people about their age, sex and marital status. Using our "learned model" from our big "training set" we can start "predicting" what can be the average income of the people we interview. And then things get really interesting, when we can apply "error correction" methods to our data. Let's say that from 100 people we interviewed, we have managed to predict the income of only 60 people, with 30 being close and 10 being not relevant at all.
We can add parameters in order to include percentage about how much of the interviewers might have lied (let's say 5%), bringing "noise" to our measurements, so we take that "noise" in our calculations.
There are three main types of machine learning, with two of them being most popular: Supervised (Guided) vs Unsupervised (Unguided) "learning". There is also Reinforcement learning and some semi-guided methods.
In Supervised (Guided) learning there is a Tutor"("Teacher") who teaches us about some facts and how we can use them to classify data and predict missing data attributes based on already known properties. One interesting application of it is human face recognition (let's say we want to train our program to group people in two categories: young and old people, so we show them different services). We start training our program by showing first say 1000 pictures of young faces, so our program can learn what the eyes, the skin tone, the hair, the lips of young people look like and save models about those. Then train with 1000 pictures of old people, with their skin being not being smooth, with less hair (even white hairs), wrinkles and etc. Then we show a random face to our program, hoping that based on the models we've shown it will be able to recognize if this person is elderly or young. Let's take a training set for "Young faces":
From the above training set, we can notice the following: Young people generally have tight and bright skin with a healthy hair and overall fresh outlook. Now let us see what an old person's face looks like.
As we can see, the primary difference of old faces are that they have a lot more wrinkles, a little bit darker coloring and also way more artifacts around the eyes, together with some white hair. Good, but unfortunately computers cannot understand those definitions. We need to define the human face in some sort of models and sections which we can analyze visually as color readings and geometric proportions.
We need to add some "areas of interest" which are both easy to process, universal for all people and also valuable of aging information. If you ask someone "what do you see first in a person" (if it's not the breasts) the answer would be "the eyes". This goes all back to genetics (remember, you can define someone's personality by their face?). This is actually correct also from signal processing point of view, because based on the shape and coloring, the eyes are very contrasting to the other face features. So it's very easy to write a code snippet which would look for two identical pairs with a very different coloring and shape from everything else (containing primarily white, with a combination of blue, brown or green). The eyes are symmetric, so we are basically searching for two symmetric image areas. When found, we combine them, as define a rectangle with the sizes of where the differing color ends (it is good also to add a 5-10% buffer space for the area around the eyes). Side note: According to one of the Bulgarian professors (Dimo Dimov) in the Bulgarian Academy of Sciences, the human ears are even more unique (from biometric point of view) even than the eyes. He actually has a software which does it, to anyone interested: I recommend finding him at Sofia University's FMI.
Now, things get interesting: we have used some hard (and somewhat error-prone computations) in order to locate the eyes. Locating the mouth can be a done a little bit easier. Have you noticed that most people's distance between the eye line and the mouth is always a specific proportion? This is so, because most of the characteristics of the face are filled with Golden ratios. So we can easily just calculate what might be the distance of the mouth when we have the eyes and just look again for an area which has a line of black pixels, together with somewhat pink coloring around it. Locating the nose can be done without any color analysis at this point: if we draw a triangle from the geometric center of the mouth with the bottom edges of the eye rectangle we will automatically locate the nose with some pretty high accuracy. Then, based on the current figures we have, we can actually draw a rectangle which defines the entire face size (in blue). In just under 2 minutes we have defined 4 very informative regions in the face (this is very very basic and it is not based on any existing algorithms, it's one I have came up with myself). Now, we can start recording their coloring and fuzziness and try to differentiate an old face from a young one by color histograms.
What is the idea? Basically for each of our regions we can come up with a "how old this area looks like" factor. Then we sum the values and we will have an "aging" factor for the entire face. This value then we can compare with our averaged value (or whatever crazy math we come up with) from our training set and decide whether this person is old or not. We can set a very low parameter, which will perhaps only rate teenagers as youngsters and vice versa. Or we can put two faces side by side and our algorithm will tell us who is older (now, that's cool).
Let's start analyzing the differences in our models, shall we?
Eyes: It seems like, due to aging older people have a little bit more red colors in their eyes due to veins bursting and enduring much drama (lol). But this is not a very secure classifier, so we leave this with a low statistical importance value. However, the wrinkles mean that we must expect slightly more darker colors around the eyes.
Mouth: Again, the wrinkles indicate a little bit darker general coloring.
Nose: It is apparent that old people love getting their nose painted with a lot of dark spots and past boxing memories. We add those to our classifier as well. Also, note that the eye wrinkles are located here.
Cheeks, chin and forehead: This is where our gold mine is. All signs of aging are vastly available here. However, this area is too big and complex to be analyzed as a whole. In Software Engineering it is never a good practice to work with big chunks of data, so to get a better detail (and performance) we will divide the Face to even more detailed regions: Cheeks, Chin and Forehead. This is very easy to do, since we already have the locations of the eyes and the mouth and we can just create 3 more rectangles based on their locations.
Cheeks: We first start with tracking pixel contrasting: how much does each pixel contrast with it's 1/2/3/N-th px neighbour? For young people this must be a very low value, since the skin is bright and smooth. Next, we need to start searching for wrinkles. Real wrinkles. We search for dark pixels and once we find one, we start counting the length of the curve it defines. For young people it is generally acceptable to have only 2 curves around the cheeks (with most of them not even having any yet). For old people we must find at least 2, 4, 6 and etc. (since wrinkles in the cheek region are for most people located at both sides, however the amount of mutated people due to Coca Cola and Facebook gets serious!). This area should be rated with the highest value for our statistical importance.
Chin: The chin is a tricky one. Women don't have beards (I hope so, ladies!), however with the new generation of lumberjack geeks (God, what the fuck is wrong with this planet) a lot of young people have also beards. In any case, our classifier here will know that if someone has a beard (with high confidence) he is: a male, perhaps older than 20 years and is an axe owner (must be verified). For our case, we can add a classifier which will basically increase the "age" factor if we detect a non-white beard slightly and dramatically if the beard is white.
Forehead: The forehead can also be misleading: A lot of old people have a considerably smooth forehead (but still not as smooth as youngsters). Even if wrinkles are absent the neighbour pixel contrast ratio will give out some aging information.
Now, let's calculate the "age factor" for all our regions (the graphic ignores the eyes and nose, but those are included also):
Now, this algorithm is very basic, but yeah, this is basically how some of the face recognition is actually done. There are various approaches and I do not claim mine to be the best. Below is another example with a simple Python program I've wrote a while ago which after training with a set of hand-drawn digits attempts to guess what is the digit that we draw (like sort of an OCR - Optical Character Recognition).
Here are some live tests:
As you can see, the program, based on the training, attempts to "guess" with ratings the closeness of the input data to the learned sets of digits. Of course, there are errors. Since the training data is ridiculously small (10 examples per each digit) if you draw a number a little bit differently it might actually fail. By providing more training data this error will be reduced.
Here the digit 6 is being confused with 5 due to the extra curve added. Based on the training set the top curve is very characteristic for the digit 5. Here is the training set which I have used.
And this is actually what is going on in the world at the moment. Big Data companies are farming more and more databases in order to "train" their programs for better recognition and results. Sometimes even we, as humans, without knowing are being used for training computers. How?
Well, each time (or at least sometimes) when you are filling a CAPTCHA code by Google, you are actually acting as their own "tutor". Basically they are giving you images/text which they could not reliable decode and based on the average results recorded by your typing and other users they are decoding the text/image with a higher chance of success. Now, this is not a wide practice, nor it is officially confirmed (AFAIK) but... yeah, it happens as we speak. We are helping digitalize books and images via CAPTCHAs sometimes. :)
With Unsupervised (unguided) learning things get a little bit more complex. We have no teachers, no tutors, no friends, only our beloved algorithms, some unexplained set of data and our hopes. If we use the above example, our process might be modified as follows: we start showing different pictures to our program: faces of people, cows, horses, cats, murlocs, dogs! Based on the fact, that human faces look similar (let's say most people have hair and bright to dark orange-brown skin), and most cows have huge mouths with very small blackish eyes, then we could expect our program to start dividing data to groups. And learn by itself if an image is a face of a human or animal. Unsupervised learning is also called "Data mining" or "Knowledge discovery in databases". Another example of unsupervised learning is finding similarities between groups. For example let us say we have a database which has people with their IQ points and yearly salary. We would like to know if there is a correlation between how much intelligence is related to actually earned money. And also, what kind of social groups there might be, if so? In order to get things straight, first we need to define some terms. Covariance is a statistical term which identifies how much two "random variables" change together, by affecting each other. Correlation shows the same thing, but by always limiting the relationship status between the range -1,+1 (with 0 meaning there is no passive relationship).
The process of discovering of groups of data is called "Clusterization" and there are many popular algorithms for it, with some of the most popular ones being DBSCAN (and no, it does not mean Database Scan) and K-Means. So here is how our raw data set looks like (each dot is a person or even a group of similar people):
Bunch of dots. Even without any math calculation we can visually see that the set looks like a pointed arrow, which means yes, there is a positive correlation between IQ and money. Now, this is not that great. We would be more interested in discovering things like what kind of groups exist in our data? The Hollywood culture of the rich and famous? Is there a "middle class" at all? Are there categories in the middle class?
Most clusterzation algorithms require an input parameters in order to produce a result (like, how many clusters of data we would like to identify).
Let's start with 3 clusters:
We see that there are three main social groups. This was quite predictable, eh? We have the cluster of rich and smart people, of stupid and smart and of basically everyone else. Notice that we also have some groups in the right corner which are far too remote from everyone else. At this current parameter (n=3) the used algorithm (k-means, will not go on details how it works at the moment) we see that data has been marked as "noise" and has not been added to any of the clusters because it does not belong to any of the groups' characteristics.
Let's try with more clusters, say 5:
Now, we see the big cluster of "Average people" have been divided to sub-clusters. Notice, the "noise" has been formed into a new cluster of people who are very intelligent but broke (as such people are very rare). We just have gained so much knowledge about our user groups with just an input parameter of how many clusters we would like to be formed. However, this is the main drawback of the used algorithm (k-means). Basically, the user can specify any number and the algorithm will try to discover clusters. But if the number is too high, noise data (values which are just statistical variations or pure errors) will be added/formed in clusters. If too small parameters are specified, the different clusters might be merged together, thus losing information and detail. This excessive dependance on "correct input parameters" is the main drawback of k-means (together with high computation complexity level of the algorithm itself). There are way better algorithms than k-means which actually learn intelligently to avoid noise and to use computation power optimally. You can actually see that the way k-mean works with a pretty "brute force" mind state, almost reminding me of some starship battles. :)
K-Means algorithm with N=4 and starting point from the top-most:
As a closer, we will finish with a strange paradox, named the AI Effect which I came to discover just lately:
"As soon as AI successfully solves a problem, the problem is no longer a part of AI. Every time we figure out a piece of it, it stops being magical; we say, 'Oh, that's just a computation.'"