Frequently asked question about using BayMiner

Q: How does BayMiner use weighted mean?
A: BayMiner does not use any form of formal weighting, it identifies a sort of information technical weight itself. It does not use any method to produce means either. From the distribution diagrams you can easily identify a means with significance for you. It is much more valuable for you, because it is a probability distribution over all available values. You must interpret it, but the end result is much more real information than a cryptic overall figure whose components you can neither see nor understand.

Q: Can I estimate the distribution of a mass or density function of my observed sample data?
A: Yes, you can estimate the distribution of your observed sample data. Just check the visualization of the distributions.

Q: What sampling scheme is used to select a data sample?
A: BayMiner uses a simple random sampling to select the data sample.

Q: What types of data can BayMiner read?
A: BayMiner requires data to be in a simple table and accepts the tab delimited text format only. This means it can use data sets exported from a variety of database management system (DBMS) as well as most spreadsheets. BayMiner's data read engine is very tolerant.

Q: Why does BayMiner not show a log of the calculations it takes to produce a model?
A: BayMiner does not use passes in the calculation. Neither does it reduce the number of passes that the analytical engine of conventional statistical tools needs to get through the data within a sensible time frame. Therefore BayMiner does not after the calculation contain metadata with summary statistics about how it handled various types of numeric or categorical variables.

Q: How long does it take to calculate a model?
A: Calculation times are very difficult to predict. The time is depends on the predictability of the data. Smore general estimations: Small models (tens of Kbytes) take tens of seconds, the medium sized (hundred of Kbytes) minutes and the big ones (Mbytes) tens of minutes. The biggest factors that contribute to increase of processing time is the number of variables, the second most significant is the number of values of those variables and the third is the number of observations.

Q: Are "ignored" variables still identifiable during the analysis phase?
A: Yes, "ignored" variables are excluded from the model building process, but they are still identifiable in the analysis phase.

Q: Why is BayMiner setting my date variable to "ignored"?
A: BayMiner sets variables that are associated with a date format to "ignore" usually because there are too many different values. Nominal variables are set to "ignore" when the number of variable values is greater than approx 100. This takes place only if you force the type of numeric variable to nominal.

Q: When I upload a table, why is one of the variables set to "ignore"?
A: When a variable has only one value, the variable is set to "ignore" because a constant value cannot help predict the target variable. The same concerns a situation when each value is different. In either case the data does not provide any useful information for the model building.

Q: Why does BayMiner not use rule induction to improve classifications accuracy?
A: The rule induction method usually produces a tree model. Since a medium size tree cannot be calculated considering all possible influencing variables within a sensible time frame, there is no sense to try to limit the number of combinations artificially by forcing a simplified classification either. Neither does BayMiner remove any variables from the training data set and place it to the side. BayMiner calculates a probabilistic model so it does not need to distort information to speed up calculations. Besides, if you have very rare event data and need to improve the classification of it, it is not possible to increase information by reducing data. It may be possible to improve the performance of a model but it is much safer that a human with domain knowledge does the exclusion.

Q: Is there a way to display a listing of supervised models in BayMiner?
A: Yes, just look at the "Focus on" feature (drop down list)

Q: How can I compute values for density, mass, or a cumulative distribution function of certain types of theoretical distributions?
A: You cannot do it in BayMiner. BayMiner does not ask you to guess on what distribution a phenomenon follow; it automatically approximates even very complex distributions, which is important as real life data seldom follow theoretical distributions. Use BayMiner to identify the phenomenon and let it help you set parameters correctly in conventional statistics program. After it is done, repeat the analysis using the conventional program and use it to calculate those values you need. This is by the way a very fast method to achieve sensible results when a phenomenon is obscure and difficult to identify with conventional methods.

Q: How do I modify a data set in place?
A: To modify a data set in place, use the "ignore" and "nominalization" commands on the "new visualization set" page on the user interface.

Q: How do I collapse a BayMiner data set into one observation?
A: Why should you? You cannot collapse a set of observations by a command, if BayMiner identifies similar instances it will collapse them automatically. To learn how to interpret visual phenomenon, see Quick Guide to Interpretation and Interpretation Guide.

Q: In BayMiner, how can I make sure that the browser window always has focus or is on top?
A: This is dependent of the browser and its settings on the local PC so it is not possible to determine it by default to operate in a certain way.

Q: In BayMiner, how can I resize the browser window?
A: Use the conventional methods provided by the browser.

Q: How can I calculate the P-value in BayMiner?
A: You cannot do it, because the concept P-value does not exist in the Bayesian world. If you need to calculate a p-value, collect sample data and calculate the appropriate test statistic for the test you are performing. It is however worth to use BayMiner to find the result rapidly, and use a conventional statistical tool to come up with the P-value with the dataset.

Q: In BayMiner, can I sort on columns by clicking on the columns headers like in Excel?
A: No, it is not possible.

Q: How can I see the number of missing values and patterns of missing values in my data file?
A: A data set frequently has missing values i.e. "holes" in them. Some statistical procedures such as regression analysis will not work as well, or at all on data set with missing values. For a statistical procedure to produce meaningful results the observations with missing values have to be either deleted or the missing values have to be substituted. This leads to that you may want to know the number of missing values and the distribution of those. BayMiner can handle the modelling although the data is even VERY incomplete. Of course the quality of the model is not at its best if there are many missing values.

Q: How can I change a string variable into a numeric variable?
A: The problem occurs if you have a data set with a variable that appears to be a numeric variable, but is really a string variable. Because you cannot perform most statistical operations on a string variable, you may want to turn the string variable into a numeric variable. BayMiner can handle the numeric variable as a string or as numeric. You should select which mode you use it according to the problem you want to solve. I.e. if the number is an identification of a failed part and you want to trace it, it is better to handle it at least at the end of the process as a string. It may be favourable to handle it as a numeric in the beginning to identify to overall correlations.

Q: How can I do a scatterplot with regression line as in SPSS?
A: You cannot draw regression lines in the BayMiner scatter plot as in SPSS. One way to circumvent this is to reduce the number of variables concerned. The strength of BayMiner is just that you do not need to guess what variables you can omit safely. Anyway, there are no tools in BayMiner to draw regression lines.

Q: How can I graph two (or more) groups using different symbols?
A: You can add a variable with the names of the groups in the final table you upload to BayMiner. Select the variable with the "highlight variable" function and either use the values of it as such, or create groups of them using the "select these" function. Name the groups using the "named selections" (selection/comparison)"function and compare them to each other.

Q: How do I interpret the results from crosstabs?
A: BayMiner does not produce crosstabs. The crosstab, in the sense that it indicates anticipated relationship between two (or some few more) variables, is a very limited view on complex data. It is also tricky to interpret the results from crosstabs in many tools. The fundamental problem is there is a risk that the data contains co-occurrences that are not identified with crosstabs. Another risk is that the researcher when selecting what to include in the calculation makes judgment errors that obviously influence the results. With BayMiner these problems do not complicate your analysis.

Q: How can I analyse my data by categories?
A: Use the "highlight variable" function to visualize the categories deterministically. If you wish to identify complex categories, use the cluster view to identify categories. If your data consist of tens of thousands of observations you might need to do it in phases. One way that is frequently recommended is to split the data file into different data files and conduct the same analyses on the two (or more) data sets. However, that is cumbersome and error prone. Therefore use sampling as far as possible.

Q: How can I test contrasts and interaction contrasts?
A: It can be very tricky to carry out the tests in conventional statistical packages. When there are a slightly higher number of interactions (e.g. three-way interactions, four-way interactions, etc.) it is still possible but the result becomes very error prone. When there are tens of them it is virtually impossible. In BayMiner you do not need to do these tests!

Q: Can I have several models from same data open simultaneously? Yes. It is a highly recommended practice that power users enjoy much. BayMiner allows you to work with numerous data files at once.

Q: How do I look at the distributions of variable values?
A: Double click the distribution icons you want to look at. To gain more information out of our data, you may manipulate the visualization of the distributions using the commands, whose icons are in the upper right corner. There are commands for change to list view, change value ordering and to hide values.

Q: How do I look at the average variable values?
A: Look at the distributions and draw your conclusions directly.

Q: How can I select only certain cases for analysis?
A: Choose the class variable using the "highlight variable" command and select from the list of values those you want. You may add them to a subset (menu opens with the right button when cursor is on any of the coloured squares) and save it e.g. for comparison with an other subset.

Q: How do I collapse interval variables into categories?
A: BayMiner creates intervals of numeric variables by default so you get categories automatically, unless you specify a numeric variable to be handled as a set of discrete values.

Q: When would use of the Filter Outliers function be beneficial?
A: There is no Filter Outliers function in BayMiner. It separates and visualizes the outliers automatically so the user may remove them from the original data set if he wishes. Since they do not usually cause a deterioration of the model this is a much more practical way to handle outliers. In most other methods the outliers have a significant effect on model parameters but as BayMiner does not use model parameters it does not matter in BayMiner.

Q: On the "Focus on" list in the BayMiner, why aren't all of my variables displayed?
A: BayMiner considers only active variables when creating the general model; the list displays only those variables that influence the general model (i.e. are not ignored)

Q: In the BayMiner, is the entire data set used or just the data sample?
A: By default the whole data set is used but the visualisation uses as default a sample of 500. You can select to visualize the entire data set if you have a license covering your need and the data set is not too big. However, if you select the entire data set, be aware that the entire data set will be uploaded to the BayesIT server, which may take time; it is dependent of the bandwidth of your connection.

Q: How can I place a list and the dot cloud graph side-by-side in BayMiner?
A: You can use several incarnations of the browser and position them as you like. It is a good idea to e.g. place an "output" variable window in list format adjacent to the worktable to e.g. watch how changes to a set of particular criteria causes changes to the content of a short list. The enlarged distribution opens in own window so it is easy to realize.

Q: How does a high value of kurtosis influence the BayMiner modelling?
A: BayMiner's proprietary Bayesian Networks engine includes a very advanced algorithm to optimise the discretization of continuous values. If the kurtosis value is high it usually means that the number of intervals produced by the algorithm is smaller. But the end result is also dependent on how many observations are processed, the shape of the function and how strongly the values correlates with each other.

Q: What should I do if I have systematic blocks of missing data?
A: BayMiner's data modelling (processing) method is very tolerant to missing data, but if you have one big block the data set might be problematic to analyse. Probably the best alternative is to do a test using both one combined data set that includes the block, and then two parts, of which one is including the block and the other not. If the combined data set clusters so that the block with missing data separates itself visually; it is probably better to continue the analysis with the part that does not include the block. One of BayMiner's strengths is rapid visual testing; use it! Generally speaking common characteristics of different data blocks bring more value to the model than their differences damage it!

Q: Why do the bar heights representing a cluster sometimes exceed those of the total data set?
A: They do not. The most likely explanation is that you have forgot to change the comparison set in the "named variables" function. Another alternative is that there are bars hidden. Use the hide/expose function on the distribution window (rightmost icon) to expose them.

Q: Why do I sometimes get a dot cloud that does not show any colour shapes?
A: The most likely reason is that the first column in you table contains a variable without information, such as an ID number or a name. Use the "highlight variable" function to check does the phenomenon repeat for all other variables. If it does, you have a fully random data set.

Q: Why are all the squares sometimes not framed although I have all dots selected in the cloud?
A: The most likely reason is that the visualization sampling has not included all of a particular value from the data set. The thick frame size signifies that all instances are included in the selection. The narrow frame size signifies that some instances are included in the selection. No frame signifies that no instances are included in the selection.

 
Copyright © Bayes Information Technology Oy 2005. All rights reserved. See Legal Notice.
Please, do give us feedback.