BayMiner Method
1. General Information
The BayMiner analysis process can be divided into three distinctive phases: pre-processing, processing, and representation. A crude description of the phases might be as follows: the pre-processing phase is about uploading the data and fixing it for analysis, the profound analysis is accomplished at the processing phase, and its results are offered to the user via the BayMiner interface at the representation phase.
During one data analysis session this three-phase cycle can be reiterated several times in its entirety. Any uploaded data set and the latest state of the corresponding session is automatically stored in the server so that the user may optionally continue the sessions afterwards.
2. First Phase: Pre-processing
When the data are uploaded to the BayMiner server, the first task is to decide, which parts of them are to be further analysed. The user guides this pre-processing process. All the headers of the data columns are shown with some sample data taken from the current data set to illustrate the data as they are seen by BayMiner.
At this point, the central terms of BayMiner pre-processing scheme need to be introduced. Ignoring some data column means that it is not used when the predictive joint probability distribution is calculated at the processing phase; otherwise, every value of every column is simultaneously taken into consideration when the data set is analysed. This means that the distributions of different columns (the values of the variables) are not handled separately but in concert. This is one of the most noteworthy and differentiating features of the BayMiner method (see Section 3. About Bayesian Networks).
Nominalizing a data column is tantamount to using its values as distinctive alternatives; the opposite alternative is letting BayMiner discretize the values, i.e., group nearby values together in order to use these interval values in the processing phase instead of the original numbers. This latter alternative is feasible with numeric values only-alphanumeric strings are always handled as nominal values, and the same holds for missing values. If there are both numeric and alphanumeric values in some column, the default is to discretize the numeric ones anyhow; in that case the column is called mixed. During the course of analysis, the user can use other tools to combine both discrete and numeric values into subgroups.
Each data column has a tick-box to let the user choose whether to ignore the column during the processing phase. Nevertheless, there are two special situations when the column is automatically ignored: if all the values in the column are identical or there are too many different values in it. Under these circumstances, it would be futile and even dysfunctional to try and analyse the values, so they are omitted. The limit is dependent on the data but is about 100.
If any column with valuable information becomes automatically ignored because of it having too many different values, the natural way to get this information analysed is to re-edit the contents of the column, classifying the contents into wider categories, fewer in number than the original value set; the original column may either be preserved along with its newly formed counterpart or replaced with it-in any case, the valuable information has been recovered for analysis. At the representation phase, all the ignored columns are available and can be shown on the screen. Ignoring variables at the pre-processing time means just that these variables have no effect on the formation of the joint probability distribution that is central to the analysing process.
The choices made at the pre-processing phase determine how the data are seen during the processing phase, so each different set of choices gives a somewhat different result. As the data has been gathered for a purpose, it is natural to have it analysed in its entirety to start with. The possibility of excluding some variables, e.g. with strong influence, can be used to increase the sensitivity regarding variables with weak influence. Alternatively, the user might consider a procedure of first ignoring all but the most prominent variables, seeing the results, and then activating more variables to the analysis as long as the results seem to become more useful-this is entirely up to the user's needs and considerations. Using the latter alternative, same data set can be uploaded several times, pre-processed differently, and the different pictures may be shown on different windows even simultaneously.
The numeric (or mixed) columns are best turned nominal in cases when some qualitative measurement is coded numerically, such as concerning an I.D. value, or there is some other reason not to let the neighbouring values become joined together. The rationale behind discretization is, namely, that it both improves the quality of the results and speeds up the processing, when the numeric values can be seen as coming from a continuous numerical range-that's why discretization is the default. This being the case, having the observed ranges of numerical values divided to several sub-ranges is appropriate, since the values within any of these subranges are conveying approximately the same information. In the case of numerically encoded qualitative information, on the other hand, the preconditions for discretization are not met. BayMiner knows nothing of how certain numeric values are to be interpreted, so an informed choice considering possible discretization is strongly encouraged, especially when the number of data rows is low compared to the number of variables (column).
There are still two choices given to the user at the pre-processing phase worth mentioning, namely, that of determining the focus of the model, and that of changing the sample size. The focuses are discussed below (Section 6. Focused Models and Predicting), so only the sample size option is considered here. The choice is made by optionally altering the contents of an input field surrounded by the texts "Use a sample of 1 <=" and "<= #", where "#" denotes the number of rows in the current data set. This choice has no effect during the processing phase: all of the rows are taken into consideration regardless of this choice. Instead, it directs the visualization at the representation phase by giving a chance of sampling just a (pseudo-randomly chosen) subset of the data rows to be visualized at the representation phase-this saves time in the case of large data sets (over 500 rows or so). The dots that represent the sampled rows will be in approximately same relative places in the visual representation, in which they had appeared, had all the other rows been visualized as well, so the result is comparable to that of total visualization. If the user needs to identify individual cases (rows) all of them must be included or the analysis must be done in phases with only part of data in each model.
3. About Bayesian Networks
The BayMiner technology is rooted in the theory and application of Bayesian networks. These are computational tools that apply mathematical calculus of probability to various modelling tasks. It is a generally accepted fact that application of the probability theory is the theoretically correct approach to different modelling tasks, and developments in the theory of Bayesian networks, on one hand, and huge improvements of the computational capacities of modern hardware, on the other, have rendered these theoretically valid ideas into actual reality.
Among the most important things to be known about Bayesian networks is that they represent joint probability models among given variables (i.e., columns in the input file). This means that every variable is represented by a correspondingly named node in the network. The (assumed) direct dependences between the (values of the) variables are often visualized as (directed) arcs between the corresponding nodes, and the conditional probabilities are stored in tables attached to the dependent nodes, giving the distributions of the values of the variable, depending on the possible value combinations of its immediate predecessors in the network. This approach means that the information about one variable's observed value is propagated in the network to influence the assessments of the most likely distributions over the possible values of other, not directly observed, variables. Using the famous Bayes's theorem, these influences are identified also "backwards", from dependent variables to their predecessors; hence the name Bayesian networks. Thus, the values of the variables are assessed, not in isolation, but in concert.
BayMiner constructs the Bayesian network automatically from the contents of the input file. The construction process is deterministic, so various uploads of same input always lead to identical Bayesian networks (if the discretizing/nominalizing and ignoring of variables are identical, too); the operation of any Bayesian network itself is, likewise, deterministic, so the results when using this technology will be self-consistent, as they should. However, in the BayMiner realization the visual representation may vary slightly.
There are many different subclasses of Bayesian networks, but a general feature of them all is that they combine two mathematically well-defined and widely known and accepted formalisms- net theory and probability theory-to get a semantically understandable and computationally implementable way of modelling, analysing, and predicting real-world phenomena.
All in all, the data analysing methods of BayMiner are ultimately based on the generally acknowledged mathematical theory of probability. The analysis is made using BayesIT's proprietary computer implementations of Bayesian nets, which enable a straightforward application of probability calculus to different kinds of data. The data may even have a multitude of missing values without leading BayMiner methods astray: the missing values are then assessed from what is present.
4. Second Phase: Processing
The main problem the BayMiner data processing phase is about to handle is to assess the probabilistic distances between different rows of the data. These distances are the very basis of the visualization of the data, as they determine the shape of the visualization: the information learned from the data is (approximately; see Section 5. Third Phase: Representation) expressed as the distances between the dots representing the data rows.
The computation of the probabilistic distances is quite complex and not described in detail here. Interested readers are referred to scientific papers produced by the CoSCo research group.
The method produces more than just one probabilistic distance between any two data rows: each column has its own probabilistic distance values for each data row pair. There is also one overall probabilistic distance, not focused on any one column but on the general picture. It is determined using all the focused probabilistic distances together. All of these different models can be shown during the presentation phase (Section 6. Focused Models and Predicting).
So, the way BayMiner handles questions of the differences between any two rows is not based on any ad hoc heuristics, such as just counting the numbers of similar column values, but on a thorough analysis of the entire data set and mathematically valid calculations concerning its joint probability distribution. This principled approach makes BayMiner a proper data analysis tool even for demanding applications.
5. Third Phase: Representation
From the probabilistic distance results between data rows (calculated at the processing phase), three-dimensional distances between the corresponding dots (shown on screen) can be computed. As the data is multidimensional (each analysed column counting as one more dimension), all the distances are not expressible in any one picture drawn in some lower-dimensional space. BayMiner iteratively seeks a good three-dimensional approximation of the relative probabilistic distances between the data rows in order to show a highly intuitive, visual representation of the data set on the user's screen.
Initially, every row in the visualization set is given a default position as a dot in a three-dimensional representation space. The differences of the distances between these dots from the probabilistic distances between the corresponding data rows are first calculated, then squared and summed up to form an error term (this is possible, since both distance measures are plain numbers). Thereafter, each dot position is gone through and adjusted in order to lessen the error. After every dot position is visited, the error term is recalculated, and, in case of improvement of more than one thousandth (1/1000) compared with the previous one, the adjustment cycle is repeated. During this iterative phase, there is a text pane on the black area in the middle of the screen, stating: "Fetching visualization".
After the computing has sufficiently converged a corresponding "scatter plot" is painted against black background in the middle area of the screen-the three-dimensionality of the representation is demonstrable by rotating it. Each dot in the picture represents a data row, and the three-dimensional distances between each pair of dots on the screen are approximations of the probabilistic distances between the corresponding data rows. The dot clusters that are formed are therefore genuine properties of the analyzed data sets, seen from a probabilistic perspective.
Thus, there is no such thing as the axles of the picture, at least not in any semantically meaningful sense. The dot cloud seen on the screen can be rotated into whichever position, and none of them is more correct than any of the others. It is the relative positions of the dots that matters, not the place on the screen they are seen at.
The relative dot positions remain intact regardless the rotations, shifts or zooms that the dot cloud may have gone through (though these operations might render some or even all of the dots temporarily off screen). The colours of the dots, too, remain intact through the moving of the dot cloud. The colouring of each dot represents the value of a variable (column) in the row corresponding to that particular dot (the current colour scale with legend is seen on the left). The variable, the values of which are shown, can be altered from the list of choices on the left. In that case, the colours of the dots (as well as the contents of the legend) change in order to convey the desired information to the user.
6. Focused Models and Predicting
There are two different kinds of models BayMiner produces; models with a specific variable as focus and a model with no specific focus. A focused model is produced when the rows are considered in regard to some particular variable-in this case, the relative dot positions express the probabilities that the corresponding data rows had similar values in this particular column based on their other values. There is one focused model for each variable that has been taken into consideration during the processing phase (i.e., hasn't been ignored at the pre-processing phase), and each of them is usually different. There is only one unfocused model per the whole data set, and it has been built from all the focused models and thus reflects the overall probabilistic distances between the data row pairs. The model with no specific focus is shown at the beginning of the representation phase by default, which can e.g. be altered at the pre-processing phase from the "Model focus" field just below the sample size choice.
All these different models may be chosen from the list of choices near the upper left corner of the display during the representation phase. Each focused model has a parenthetic percentage value after its name; these percentages tell the relative improvement using the BayMiner focused model for respective prediction compared with predicting the most common value in every case. By convention, this value is never negative, so a zero value tells that no improvement was made, regardless of whether the result was as good as or worse than its trivial counterpart.
The focused modelling enables prediction, too. This is achieved by choosing the known values of a "new" data row, whose expected location in the focused model is to be predicted (i.e., approximate focused probabilistic distances to the existing rows). The user may set the value of just one variable, all the variables, or any combination of more than one variables from the profile bar on the right, and let BayMiner show the predicted relative position as a sight figure right on the screen by clicking the "Predict" button. If no value is set but the prediction is still ordered, the sight will show an average location in the middle of the "dot cloud". Current prediction settings may be changed and a new prediction executed or erased by clicking the "Clear" button.
7. Cluster Formation and Analysis
In the context of BayMiner cluster analysis, the clusters are some sets of dots that are chosen by the user. In a cluster depicting a typicalness of the domain of the data, the dots are located near to each other and, ideally, apart from the remaining dots. The rationale for using this kind of clusters is that they represent probabilistically natural subgroups within analysed data. This is achieved, again, thanks to the joint probability distribution approach that is basic to the BayMiner methodology.
If distinctive clusters do form on the display, it is an indication of the existence of corresponding substructures within the domain of the data, so analysing the properties of different clusters enables the user to identify hidden, often co-occurring phenomenon leading to the formation of the clusters.
A randomly generated data set is typically seen as a uniform cloud of dots on the screen-every dot tends to be at an approximately equal distance from its nearest neighbours, the overall shape of the cloud is approximately round, without salient features, and the distribution of different values within the cloud (i.e., the colours of the dots) is haphazard, regardless of the focus in question (including the choice of no specific focus), and of the variable currently shown.
There are several cluster editing options given to the BayMiner user, e.g., the possibility to choose all the dots of certain colour (thus getting a whole defined class of data to be further analysed) and add new dots to the current cluster or remove some old ones from it. Together, these tools enable the user to form any kind of clusters at will; nevertheless, the more natural ones are probably the most useful to analyse further. BayMiner lets the user name clusters at will, and all the named clusters are automatically stored with the session. All of the dots belong to the predefined supercluster named All.
The cluster naming and saving option offered below the main display and the profile bar on the right are the primary tools for cluster analysis in BayMiner. One of the named clusters can be chosen for a more profound analysis and one for that of comparison; the "All" supercluster is the default for the latter role. The profile bar simultaneously shows all the distinctive (as well as less distinctive) relative features of the selected cluster (which is shown coloured on the central pane), compared against those of the comparison cluster. The degree of distinctiveness is indicated by the amount of red color under the baseline. These two clusters can optionally have common elements, too.
Looking through the profile bar gives the user a general view of the overall differences and similarities between the compared clusters, thus facilitating the grasp of what really matters concerning their differences. Clicking any of the profile components gives a magnified version of it in a separate window for closer inspection. These windows have a set of functions that can be chosen by clicking any of the icons in the right-top corner. The first transforms the selection into a list format to display long names in a more readable format. The second in middle changes order of values. The third hides the values that are not present in the selection. Several variable windows can also be opened at will.
The current cluster, shown coloured on the screen, is compared with the comparison cluster even without first naming the former. The default name of the current cluster (perhaps still under construction) is Untitled. It starts as identical with the "All" supercluster (all the dots are included), but every time some cluster editing operation is changing the contents of the current cluster, it is the new contents that are now referred to as Untitled. The centre of each named cluster is shown simultaneously with a name legend when the "Show labels" option is ticked.
8. Summary
In short, the BayMiner data analysis facilities enable the user to actually watch the current data from different angles, and come to grips with it. The user works with a dynamic model of the domain, which is a reality that replaces a static picture. Thus the user is not left alone with some predefined forms of printed analysis reports of the distributions of some separate variables, nor with abstract descriptions of some hypothetical factors behind them. Instead, the data set as a whole is first optionally tailored by the user's preferred choices at the pre-processing phase, then probabilistically multi-dimensionally analysed during the processing phase, and last-but not least-three-dimensionally visualized at the representation phase in order to become readily offered for further elaboration by the user-easily, interactively, and in an intuitively natural way.