At a glance... |
Syllabus |
Models |
Code |
Lecturer
- What's defect prediction?
- How to predict?
- Why transfer?
- How to transfer?
Human programmers are clever, but not flawless. Coding adds functionality, but also defects. Since prograrmning inherently introduces defects into program, it's important to test them before releasing.
For software developers(bug creators), there're several tools to help you get rid of some bugs:
For quality assurance(QA) team, software assessment budgets are finite while assesment effort increases exponentionally wrt. assesment effectiveness. For example, for black-box testing methods, a linear increase in the confidence C of finding defects can take expoentntially more effort. So the standard practice is to apply the available resources on code sections that seem most critical(most bug-prone).
We need a rig to predict which files, modules or classes are (probably) bug-prone before testing. This is where defect predictor comes in!
Menzies, T.; Greenwald, J.; Frank, A., "Data Mining Static Code Attributes to Learn Defect Predictors," in Software Engineering, IEEE Transactions on , vol.33, no.1, pp.2-13, Jan. 2007
What does prediction mean? Use historical data as the training data to train(fit) the data mining algorithm . When the new testing data comes in, we pass the data into the learner(predictor) to get the estimated label(defective or non-defective) for this data.
Here's a data set example of ivy.
Example:
- Training data set: ivy-1.1
- Learner: CART, randomForests, Logistic Regresssion and so on.
- Predicting data set: ivy-1.4
Such defect predictors are easy to use, widely-used, and useful.
- Easy to use: static code attributes can be automatically collected, even for very large systems.
- Widely used: researchers and industrial practitioners use static attibutes to guide software quality predictors(NASA).
- Useful: defect precitors often find the location of 70% (or more) defects in the code.
Q: What's the problem/limitation of this paradigm? data? attributes?
The above paradigm is useful when the training and testing(predition) datasets are available within the same project, we called it Within-Project Defect Prediciton.
What if we want to predict defects on a new project with few/no historical information? How to get training data sets. Can we use:
- data sets from different projects (within the same organization) with the same attributes
- or data sets from different projects(within the same organization) with different attributes
- or data sets from different organization
In one word, can we use data form other sources as traininng data?
Q: For data mining, what is the relationship between training and testing data?
Nam, Jaechang, and Sunghun Kim. "Heterogeneous defect prediction." Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 2015.
Key idea: Synonym discovery
Given a target(testing) data set, we have to find the appropriate traning set to build the learner. Here "appropriate" means the distribution of the source(training) set should be "the most simialr" to the target(testing) data set.
Assumption: Training and testing data sets are from different projects with different attributes.
Data set:
Steps:
- Metric(attribute) selection: applying metric selection technique to the source.
- Feature selection is a common method used in data minning for selecting a subset of features by removing redundant and irrelevant features
- E.g. grain ratio, chi-square, relief-F methods
- Top 15% metrics are selected
- Metirc(attribute) matching
- The key idea is to compute matching scores fall all pairs between source and target metrics.
- Metrics based on their similarity such as distribution or correlation between source and target metrics are mached together.
- Percentile based matching
- Kolmogorov-Smirnov Test based matching
- Spearman's correlation based matching
- Maximum weighted bipartite matching is used to select a group of matched matrics, whose sum of matching scores is highest.
- Prediction: after we get best matched source and target metric sets, we can build learners with the source data set and predict the label of target data sets.
Details:
- KS test is a non-parameteric two sample test that can be applicable when we are not sure abou the normality of two samples. In defect prediction data sets, some features have exponential distribution and others are unkown. using KS test, we can find the best matched source-target attributes.
- For each target data set, we compare all the source data sets except for itself and the source data set with the highest score will be selected as the training data for this testing data.
Copyright © 2015 Tim Menzies.
This is free and unencumbered software released into the public domain.
For more details, see the license.