Presented at ESCOM 6, Rolduc 1995
Chris Schofield and Martin Shepperd, Department of Computing
Bournemouth University, Talbot Campus
Poole, BH12 5BB, UK
email: cschofie@bournemouth.ac.uk
Abstract
Software project costs are notoriously difficult to estimate in advance. To date most work has focused upon algorithmic cost models such as COCOMO and Function Points. These suffer from the disadvantage of the need to calibrate the model to each individual measurement environment. An alternative approach is to use analogy for estimation. This has shown considerable promise (see for example Atkinson and Shepperd [2] where the method was shown to out perform traditional algorithmic methods for at least some datasets).
A disadvantage of estimation by analogy is that it requires a considerable amount of computation. This paper describes an automated environment that supports the collection, storage and identification of the most analogous projects in order to estimate the effort for a new project. The system is based upon the minimisation of Euclidean distance in n-dimensional space. The software is flexible and can deal with differing datasets both in terms of the number of observations (projects) and in the variables collected. This is demonstrated by using the system to analyse three distinct datasets drawn from three different environments.
It is widely accepted that effective cost estimation demands more than one technique. We have shown that costing by analogy is a candidate technique and that with the aid of an automated environment is an eminently practical technique.
Keywords: cost estimation, analogy, automation, software tool
1. Introduction
Effective methods for software project cost[1] estimation has been a research goal for more than 20 years. A variety of approaches have been advocated such as expert judgement (including bottom up methods), structured expert methods (for example the Delphi technique [3]), algorithmic models (COCOMO and Function Points) and analogy based estimation. The latter approach has received comparatively little attention in part due to the difficulties associated with finding suitable analogies within a software development organisation. Nevertheless the prospect of analogy based estimation is attractive because the approach is self calibrating -- unlike the algorithmic models -- and because it can be deployed as a complementary technique acting as a "sanity" check for estimates derived by other means.
This paper presents our technique for estimating by analogy and then goes onto describe an automated tool that supports data collection and analogy by estimation. The approach is demonstrated using three different datasets. The paper concludes with an analysis of the strengths and weaknesses of estimation by analogy and suggestions for further work.
2. Estimation by Analogy
The basis of estimation by analogy is to characterise (in terms of a number of variables) the project for which the estimate is to be made and then to use this characterisation to find other similar projects that have already been completed. The known effort values for these completed projects can then be utilised to construct an estimate for the new project.
Although the concept for estimating by analogy is relatively straightforward, there are a number of problems that must be addressed. First, we have to determine how best to characterise projects. Possibilities include the application domain, the number of inputs, the number of screens and so forth. Note that different variables are measured on different scales -- application domain type is a categorical variable and so is measured on a nominal scale whilst the number of inputs is a count and so is measured on an absolute scale. Scale is important as it impacts upon the manner in which we can handle a variable.
The second problem, is even once we have characterised projects, how do we determine similarity and related to this, how much confidence can we place in the analogies? Related, is the problem of knowing how many analogies we search for; too few might lead to maverick projects being used; too many might lead to the dilution of the effect of the closest analogies.
The third, and final, problem is how do we use the known effort values from the analogous projects to derive an estimate for the new project? Possibilities include means and weighted means (giving more influence to the closer analogies).
Our approach is to be flexible in terms of the variables available to characterise software projects. In general, more variables are better than fewer, however, in practice one is constrained to use the data that is available. Analogies are found by measuring Euclidean distance in n-dimensional space where each dimension corresponds to a variable. Values are standardised so that each dimension contributes equal weight to the process of finding analogies. The only limitation we impose is that we cannot handle categorical data other than binary valued variables (e.g. small-large or realtime-information system). We also offer some flexibility for the number of analogies that we search for, ranging between one and three. However, our experience with an industrial dataset suggests that two[2] is the most effective [2]. It is likely, though, that different datasets will exhibit different characteristics. Having found the analogous projects we use an unweighted mean of the known effort values in order to determine the predicted value.
3. The ANGEL Software Tool
Finding analogies using the approach described in the previous section can be both time consuming and error prone, particularly if there are many projects or many variables. For this reason we decided to automate the process and provide an environment where data can be stored, analogies found and estimates generated. A prototype has been developed using Visual Basic to run under Windows on a PC, and is christened ANaloGy softwarE tooL (ANGEL). It is a prototype in two senses. First, it is presently limited in the number of variables it can handle for determining the optimum combination of variables for finding analogies (maximum 10) and second, we have yet to provide a context sensitive help facility. In all other respects ANGEL is fully functional.
Figures 1 to 4 illustrate ANGEL in operation. In Figure 1 we see part of a template for recording project data. Templates can be configured to suit the individual data collection environment of an organisation. All variables and variable names are user determined other than project number, status and actual effort which are mandatory. In this instance, the projects displayed are all completed, and we see the different values stored for each characteristic coupled with actual effort utilised. Note that minus one signifies data not available.
The next step (Figure 2) is to select the variables that will be used to find analogies. These can be all, or just a subset, of the variables stored in the project template. ANGEL can also automatically determine the best combination of variables to be used for finding analogies for a particular dataset. At present this relies upon a brute force or exhaustive search of all possible permutations, hence we are restricted to a maximum of 10 variables. In addition the user can select between various estimation methods including the average of the two closest analogies.
The third step (Figure 3) is determining the confidence we can have in using analogy on any given dataset. Clearly where we only have a few projects or the projects differ very widely in nature we will not obtain such accurate results. ANGEL has a facility to find the mean magnitude of relative error (MMRE) and the Pred (25) for any dataset by means of jack knifing. This involves successively removing each project from the dataset and using the remaining projects to derive an estimate. the estimate is then compared with the actual and the absolute percentage error computed. The mean of all absolute percentage errors (i.e. MMRE) together with the percentage of projects that lie within 25% of the actual value (Pred (25)) are indicators of the likely accuracy we can obtain from using the dataset for future projects.
The final step (Figure 4) involves predicting effort for a selected project, in this case Project 20, using the completed projects from the dataset. Here we see a predicted value of 586.25 person days. The confidence that we can have in the estimate is automatically provided in the form of the MMRE and Pred(.25) values as described for Figure 3.
4. Empirical Results
We now demonstrate the flexibility of the ANGEL by using it on three different datasets (Albrecht [1], Atkinson [2] and Kemerer [4]) each with differing project attributes and numbers of projects. The results are summarised in Tables 1 and 2 where Table 1 shows the MMRE values and Table 2 the Pred (25).
Dataset Analogy Regression
Albrecht 62% 90%
Atkinson 39% 99%
Kemerer 62% 106%
Table 1: MMRE Values for Effort Estimation
Dataset Analogy Regression
Albrecht 33% 30%
Atkinson 38% 22%
Kemerer 40% 13%
Table 2: Pred (25) Values for Effort Estimation
For each dataset we see that the analogy based approach outperforms the algorithmic approach using linear regression to calibrate the model to the particular dataset. These results are particularly gratifying as none of the datasets were collected with estimation by analogy in mind. It is likely that one could improve upon these results if data were collected specifically to characterise projects.
5. Discussion
In this paper we have shown that software project effort estimation using analogies is a viable estimation method, particularly when given the necessary tool support. Apart from its superior accuracy for the three datasets we studied, the analogy based approach has the benefit of being self calibrating whereas other research [5] has shown calibration to be essential for algorithmic methods. Likely weaknesses of analogy based estimation are its inability to cope with very small datasets (i.e. a shortage of analogies) and in circumstances of great variation (i.e. a shortage of appropriate analogies). Although it would be unreasonable to conclude that it is always superior to linear regression methods our analysis shows that it has much to commend it, and at the very least could be considered as a complementary estimation method.
Future work will need to examine additional datasets to generate greater confidence in the technique and to extend the method to deal with categorical data.
References:
[1] Albrecht, A.J. and J.R. Gaffney, 'Software function, source lines of code, and development effort prediction: a software science validation', IEEE Trans. on Softw. Eng., 9(6), pp639-648, 1983.
[2] Atkinson, K. and M.J. Shepperd. 'The use of function points to find cost analogies', in Proc. European Software Cost Modelling Meeting. Ivrea, Italy: 1994.
[3] Boehm, B.W., Software Engineering Economics. Prentice-Hall: Englewood Cliffs, N.J., 1981.
[4] Kemerer, C.F., 'An empirical validation of software cost estimation models', CACM, 30(5), pp416-429, 1987.
[5] Low, G.C. and D.R. Jeffery, 'Calibrating estimation tools for software development', Softw. Eng. J., 5(4), pp215-221, 1990.