The article explains the rationale for the development of standards for evaluation practice, the process followed in developing those standards, and how those standards inform the quality assessment of evaluations. Quality assessment of evaluations are conducted as a routine activity of the South African National Evaluation System (NES). The importance of quality assessment for improving the state of evaluation practice in South Africa is illustrated by presenting results from the quality assessments undertaken to date. The paper concludes by discussing the progress on the development of a public Evaluations Repository to manage and provide access to completed evaluations and their quality assessment results, and offering some concluding analytical remarks.
The Department of Planning, Monitoring and Evaluation (DPME) in the South African Presidency is the custodian of the national monitoring and evaluation (M&E) system. Prior to 2011, work on evaluation in South Africa was sporadic. There was no established national evaluation system, no common approaches and no set standards were applied. In November 2011 a National Evaluation Policy Framework (NEPF) was approved by Cabinet and DPME began implementing the National Evaluation System (NES). Part of this work has been to facilitate national priority evaluations as stipulated in the National Evaluation Plan (NEP). The NEP is approved by Cabinet, and all reports are submitted to Cabinet. This is discussed further in the article on the NES.
Other work involved setting up the elements of the NES that apply not just to NEP evaluations but to evaluations across government, such as standards and competences for evaluations. One of these elements has been a quality assessment system for government evaluations, informed by the standards and competences developed by government. All NEP evaluations completed to date have been subjected to quality assessment.
In 2012, DPME commissioned a research paper that explored the development of evaluation standards to support the implementation of the NES. The paper aimed to generate a basis for discussion and shared understanding of evaluation standards, and the related evaluation competences that are needed to effectively undertake, commission and use evaluation within the South African government (King & Podems
The qualitative exploratory research relied on a desk review of published evaluation standards and in-depth interviews with South African government officials, civil society members and academics. In addition, English-speaking evaluators in other countries (Australia, New Zealand, Canada, Great Britain, and the United States) who had experience with developing evaluations standards, guidelines or evaluator competences were consulted. Informed by this research, and further guided by the principles and values stated in the NEPF, a draft evaluation standards document was produced in June 2012 (Podems
The eventual draft Evaluation Standards drew heavily from several guidelines, including the Joint Committee on Standards for Educational Evaluation (JCSEE), Program Evaluation Standards (JCSEE 1994) referenced by both the Canadian Evaluation Society (CES 2012) and the American Evaluation Association (AEA), the African Evaluation Association (AfrEA) Standards, standards developed by the German
The JCSEE, a coalition of major professional associations concerned with the quality of evaluation, developed a set of standards for the evaluation of educational programmes based on the concepts of utility, feasibility, propriety and accuracy. The AEA recognised programme evaluation standards developed and refined by the JCSEE in 1981, 1994 and 2011. In 2011 the JCSEE added a group of standards on accountability. Interviews suggest that this was a tumultuous process and that there were many detractors to the additional standards (Podems
The AfrEA Evaluation Guidelines (AfrEA
The 25 DeGEval standards (DeGEval
Another interesting aspect of the SEVAL standards is the use of a ‘Functional Overview’ that maps out what evaluation activity requires which evaluation standards. Functional areas include the decision to conduct an evaluation, defining and planning the evaluation, collecting and analysing the information, evaluation reporting and budgeting, concluding an evaluation contract, managing the evaluation, and personnel and evaluation. This approach is echoed in the structure of the standards endorsed in February 2010 by the Development Assistance Committee (DAC) of the Organisation for Economic Co-operation and Development (OECD). The aim of these standards ‘… to improve quality and ultimately to strengthen the contribution of evaluation to improving development outcomes’ (DAC
One standard not discussed or mentioned in the literature reviewed was a standard that addressed equity, which is of interest given the emphasis on gender-responsive and equity focused evaluations by international groups such as EvalPartners. In addition, whilst there is substantial discussion on the need for evaluation in Africa to be more contextually embedded (Tarsilla
The standards review demonstrated that there were two main models to draw from, that is, the quality criteria based approach of the JCSEE and its derivatives, or the phase based approach used by the DAC, and suggested in the functional overview of SEVAL. The JCSEE standards are based on years of specialised and focused discussions and have been referenced and applied for more than a decade and the AfrEA standards were adapted from this model (pre 2011). Alternatively, the SEVAL standards suggested a useful approach to present standards by grouping them into functional categories. At the same time they recognise that standards need to be adapted to each situation, and that not all standards have the same weight in any given situation. In addition, the DAC standards offer a different model that also appears relevant given the functional phased approach and cross-cutting considerations in South Africa, and touching on many core issues.
Based on a review of these different approaches, it was decided in July 2012 at a workshop with DPME and the South African Monitoring and Evaluation Association (SAMEA) that the most useful framework for South Africa was the DAC approach because of its functional application across phases, and provision for overarching considerations that could be interwoven throughout an evaluation. South African standards were drafted and first published in August 2012.
Later the same year, DPME facilitated several participatory public forums to gather feedback and refine the standards document. The first forum presented the standards to key stakeholders, mainly from government and academia, with some representation from civil society. This initial gathering achieved consensus on the structure and content of the standards document. DPME then circulated the document through various forums to gain wider stakeholder feedback. In this process, the DPME partnered with SAMEA to facilitate three national and provincial workshops. DPME then used this feedback to refine the standards document and published this document on their website for general use (DPME
In the final draft of the South African standards (DPME
DPME then elected to break the quality standards down into four phases, with the overarching considerations blended across them. The four phases and the comprising standards are listed below:
Within each of these four phases is a set of evaluation standard items, totalling 74 standards across all phases. The overarching considerations, phases and standards therefore served as the basis for developing a quality assessment tool. This tool, along with an accompanying framework, was then developed into an online quality assessment system.
The assessment tool developed for the quality assessment of government evaluations, that is, the Evaluation Quality Assessment Tool (EQAT), has followed the structure and evolving content of the
Each of the standards is scored using a Likert-type rating scale. These standards are rated on an interval scale ranging from very poor (1), inadequate (2), adequate (3), good (4) to excellent (5). Following the first two rounds of quality assessments, the need was identified to enhance the reliability of quality assessors’ application of the scale, and a set of standard level definitions for each of the five levels is in development.
In rare instances when an evaluation standard does not apply for a given evaluation, a not applicable (N/A) rating is provided. However, the N/A is not a rating in the true sense, as it designates that the evaluation standard is omitted entirely from the composite measure of the criteria and phase.
In the case of the seven overarching considerations listed earlier, these cross-cutting assessment principles are applied over the four phases, and reflected in standards within the phase that are aligned directly to the overarching consideration. Thus, standards aggregated within a phase produce a phase score, but some of the same standards also align across phases to produce a score for each of the overarching considerations. In this way, a group of evaluation standard items from across all four phases can be combined differently to provide a measure of an overarching consideration.
When calculating an overall quality rating for the evaluation, it is recognised that the different phases are of different degrees of significance to the overall evaluation relative to the others. Thus, in producing an overall composite measure of all the evaluation standard items within each of the four phases (a quality assessment score), each of the phases is given a differential weighting based on the significance.
Weighting of evaluation phases for historical evaluations.
Phase of evaluation | Weighting round 1 | Weighting round 2 |
---|---|---|
1. Planning and design | 10 | 20 |
2. Implementation | 30 | 20 |
3. Report | 50 | 40 |
4. Follow-up, use and learning | 10 | 20 |
An audit exercise of existing government evaluations by the European Union-funded Programme to Support Pro-poor Policy Development (PSPPD) initially identified 135 evaluations conducted between 2005 and 2011. As DPME constituted its evaluation panel in 2012, a further 34 possible evaluations were identified which had been undertaken by panel members. However, on closer scrutiny of some of the reports, many appeared to be classic surveys, general research, compliance and performance audits, rather than having a distinct evaluative approach. In the end, using evaluation report availability as a condition, along with a set of criteria for determining the evaluative nature of the report, a set of 83 evaluations were included for quality assessment (DPME
In the second round, the anticipated set of 70 evaluations for quality assessment (40 national and 30 provincial evaluations) have not been forthcoming. As a result, the sample included for analysis includes only those 25 evaluations that have been quality-assessed to date. Of those 25 evaluations included for this analysis, 5 are NEP evaluations, 14 are national evaluations conducted outside of the NEP, and 6 are provincial evaluations. These are sampled on the basis of availability for the quality assessment at this time.
Assessors review all available evaluation documentation at the outset of a quality assessment. The minimum following documentation is generally sought and reviewed:
Terms of Reference (ToR).
Inception Report.
Data Collection Tools or Instruments.
Evaluation Report.
In addition, any other supporting documentation relevant to the evaluation process is also considered if available, including, but not limited to the proposal, meeting minutes, progress and draft reports, presentations, and so forth. ToRs and evaluation reports are consistently available, whilst the availability of inception reports and data collection instruments is variable for non-NEP evaluations.
Documents are considered as evidence relevant to the different phases of the evaluation (e.g. ToR to the planning and design phase, evaluation report to the reporting phase, etc.) and serve as an evidence base, in combination with qualitative data obtained by interview.
Assessors attempt to engage a minimum of three role-players for each evaluation. As per the
Stakeholder interviews are conducted using a semi-structured interview guideline that reflects the phases, criteria and standards applied in the EQAT, with an emphasis on obtaining information for those phases and overarching considerations not covered by the document review, such as the follow-up, use and learning. Questions are differentiated between the three types of respondents so as to triangulate perspectives and deepen the information available to inform the rating.
Assessors (e.g. evaluation consultants, academics and government staff with significant evaluation experience) are responsible for using the evaluation documents available and the qualitative data obtained from the stakeholder interviews to synthesise and judge each government evaluation against 74 standard items on the 5 point scale (this increased from 67 standards in the first round and is currently under review). Assessors are responsible for providing supporting commentary to justify the score based on the available evidence for every standard.
An online web-based platform was developed by the consulting team as part of the first round to facilitate the quality assessment scoring, commenting, capturing, analysis and document management process (DPME
Once all standards are completed and composite measures have been generated for each phase and overarching consideration, these scores form the analytical basis for writing quality assessment summaries that pronounce on the overall quality of the evaluation (DPME
All quality assessments are moderated prior to finalising reporting. Moderation entails a review of the consistency, completeness and rigour of the quality assessment against all the 74 standard items based on the evaluation details, motivating commentary, scoring, overall summary and overall documentation and respondents. This seeks to ensure that the approach of different assessors is generally consistent and ensure inter-assessor reliability (DPME
A quality assessment report is then generated to share with the key evaluation stakeholders (e.g. the DPME evaluation project manager, the participating department, and the evaluation team). A window of three weeks is provided for stakeholder comment. If comment is received then the quality assessment goes back to the assessor for revision in light of the feedback and any evidence received. Once final revisions are made or the three week window passes, quality assessments are considered final (DPME
The overall quality assessment summary, supported by the ratings and commentary for each standard, together with the categorisation information about the evaluation and references for all source documentation and interviews, comprises the reporting content for each of the 25 government evaluations quality-assessed in this round.
The quality assessment methodology has some limitations, not least that the tool itself adopts a ‘one-size’ approach to applying standards to the six different types of evaluations identified in the NEPF. This provides comparable measures for evaluations of varying degrees of sophistication (e.g. quasi-experimental impact and formative design or implementation evaluations).
Subjecting new completed evaluations to quality assessment has also rendered some standards developed in the first round inappropriate, notably those that assumed significant time had elapsed to demonstrate use. The time between completion of the evaluation and subjecting it to quality assessment has been significantly shortened, rendering it too soon to meaningfully pass judgement on the evaluation in terms of the follow-up, use and learning phase. However, this has prompted investigation of a possible later follow-up, use and learning assessment.
Inter-assessor reliability of scoring has been largely managed through the moderation period. However, arising from an ongoing need to provide greater guidance for those standards, five standard level definitions are being developed.
Lastly, the sample of evaluations is relatively small and uneven across government and is therefore not necessarily representative of all evaluations conducted in the period. Extrapolating the results must be treated with caution.
Applying the minimum rating of 3 (deemed to be on average of an adequate evaluation standard) as the cut-off point for considering evaluations as an acceptable quality, 13 evaluations were assessed as falling below this standard of quality.
Ranked total scores of each evaluation with contributing components.
Although the nature and type of evaluations varied considerably, evaluations rated well overall. The majority of those under quality assessment in round 2 exceeded the minimum threshold of 3 (
Distribution of total scores in round 1 and round 2.
Score range† | Number of evaluations round 1 | % of total | Number of evaluations round 2 | % of total |
---|---|---|---|---|
< 2.0 | 0 | 0 | 0 | 0 |
< 2.5 | 1 | 1 | 0 | 0 |
< 3 | 12 | 14 | 5 | 20 |
< 3.5 | 14 | 17 | 5 | 20 |
< 4 | 40 | 48 | 11 | 54 |
< 4.5 | 15 | 18 | 4 | 16 |
< 5 | 1 | 1 | 0 | 0 |
Total | 83 | 100 | 25 | 100 |
†, The score range in the table reflects the upper limit of each category, for ease of reading. The lower limit is not stated but is above the previous score range. For example the score range < 3.5 is meant to imply all evaluations that scored < 3.5 but who scored ≥ 3, the preceding upper limit.
In the first round the relatively highly score of 3.57 represented the average quality assessment rating. However, in round 2 this score declined slightly to 3.50 on average. The distribution of the total scores achieved by the rounds 1 and 2 evaluations is presented in the table below.
The distribution of the data across the two rounds above indicates skewness above the mid-point of 3 on the rating scale in both. The spread of the data for round 2 indicates 10 evaluations (40%) of the sample scored below 3.5, whilst only 4 evaluations (16%) stand out as exemplifying a particularly high quality (
Distribution of evaluations by type.
Type | Number round 1 | Number round 2 |
---|---|---|
Diagnostic | 8 | 4 |
Design | 1 | 1 |
Implementation | 38 | 16 |
Impact | 37 | 7 |
Economic | 6 | 1 |
Evaluation synthesis | 6 | 0 |
Other | 3 | 1 |
Although implementation evaluations still predominate in round 2, there is a shift in the distribution away from evaluations considered impact assessments. Although the sample is limited, it reflects greater emphasis on formative assessments in round 2.
The totals in both rounds are higher than the number of evaluations subjected to assessment in each round. In this case 5 of the 25 evaluations were listed as multiple types, and so the total in
The graph presented in
Average score of evaluation types.
The graph below illustrates the spread of ratings for all evaluations with respect to each of the evaluation standards within the first phase: planning and design. In the case of ‘not applicable’ ratings, the bar chart is coloured white, creating the appearance of no distributions. However, as could be expected of the planning and design phase given the relatively small proportion of recent evaluations, there is a significant proportion of ‘not applicable’ ratings included from the first round.
The colour coding of the five point scale demonstrates those standards that fare poorly in terms of the frequency with which they scored 1s and 2s especially. In this regard, standard
Similarly, standards
Distribution of ratings per standard: Planning and Design Phase.
The graph below presents the ratings for the standards covering the implementation phase. Of note are the three new standards introduced for the implementation phase in round 2. They account for the comparatively smaller stacks under standards
Within this phase, standard
Many of the evaluations are also examples of good practice, as is the case for standards
Distribution of ratings per standard: Implementation Phase.
The graph below presents the ratings for all of the standards within the reporting phase. In this phase there are also three new standards as indicated by the three comparative short stacks. Comparatively fewer standards were ‘not applicable’ in this phase, as a result of the availability of the evaluation report as evidence is a prerequisite for undertaking the quality assessment. The standards could therefore be said to be biased in terms of historical application to the primary document on which the quality assessment was based.
There is one case within this phase where an evaluation standard was not applied by the quality assessor nearly half of the time, namely, for standard
Standard
Distribution of ratings per standard: Implementation Phase.
In terms of an evaluation standard that rated well across all the evaluations, standard
The graph below presents ratings for each standard in the fourth evaluation phase of follow-up, use and learning. Because these standards are rated based on information obtained mostly after completion of the evaluation report, they rely heavily on interviews with evaluation role players and are therefore subjective and perception based. Further, because round 2 occurred in close proximity to the completion of the evaluation report, there has not always been sufficient time to allow for utilisation of the evaluation results.
It is clear that the standard
Other standards that had some low ratings within this phase included
The standard that is consistently rated highly within this phase is
Distribution of ratings per standard: Follow-up, Use and Learning Phase.
The round 2 quality assessments have highlighted some of the challenges of applying standards for this fourth phase within such a short time of completing the evaluation, particularly for NEP evaluations that must go for quality assessment before being considered for approval by Cabinet. This has prompted further investigation of a more appropriate tool and approach for assessing utilisation at a later stage.
The graph below presents average scores for six of the seven overarching considerations based on an aggregate measure of the standards aligned to that consideration. An average score for project management could not be produced at this time on account of the timing of its introduction after the start of the round 2 quality assessments.
Average score of overarching considerations.
The overarching consideration of a free and open evaluation process is rated highly at 4.23. However, closer scrutiny of the ratings and weighting system reveals that this consideration is, in particular, distorted as a result of the historical application of fewer standard items, including public access to an evaluation report, which biases it towards a higher rating. Again there is a self selection bias in that the evaluations in the sample were those which departments were willing to make available, and therefore more likely to be open about the evaluation process. Since adjusting the alignment standard items between rounds 2, there has been a decline of 0.21 points, in part because it is now more representative of a broader set of related standards.
Core to DPME's approach to evaluation is ensuring quality and an evaluation repository was developed to make available the findings of all evaluations that have been undertaken within national and provincial departments. This is aimed at facilitating informed decisions about programmes being implemented across all spheres of government. The repository can be accessed at: http://evaluations.dpme.gov.za/sites/EvaluationsHome/SitePages/Home.aspx
For the meta-evaluations conducted of the respective samples of 83 and 25 evaluations, the quality assessment found that evaluation practice is generally rated well although the second round saw a slight decline in quality, based on overall scores. However, this decline should be seen within the context of an expanded set of standards, as well as the fact that in round 2 much more information was available about the process, and not just the product, providing more evidence.
Implementation and impact evaluations still constitute the majority of evaluations undertaken to date, but there has been a drop in impact evaluations as a proportion of the overall total. Data from the second round suggest a shift towards more formative assessments, a large contributor being that data often do not exist for impact evaluations, as this was not built into programme design. This is a significant issue moving forward.
The body of all evaluations assessed has generally shown above satisfactory levels of methodological appropriateness for relevant standards in the implementation and reporting phases and, in particular, has employed appropriate data gathering techniques consistent with the type of evaluation and its objectives. Background reviews of legislative, policy or programme contexts together with literature reviews have generally been executed to an above adequate standard.
However, there are a number of specific shortcomings in evaluation practice that continue. Programme intervention logic or a theory of change is not explicitly referred to in either the evaluation design or in the drawing of conclusions in many cases. Clarity of the intervention logic or theory of change, and regular reference to this, is critical to good evaluation practice, hence DPME's recent requirement that this occurs as part of all NEP evaluations.
New shortcomings identified in round 2 include the lack of preparatory review, either by peers or in testing data collection instruments according to specific standards in the implementation phase. Further, analytical rigour and the lack of exploration of alternative interpretations of evaluation findings in deliberating on the conclusions are two shortcomings persisting across both rounds in the reporting phase, which require urgent attention through support and guidelines.
Round 2 quality assessments also posed problems for assessing
When considering the overarching considerations, capacity development stands out as an area in need of concerted improvement in evaluation practice. The lack of planned and well executed capacity development is a particular concern which DPME is using as one of the criteria in assessing evaluation proposals. Unless addressed, this lack of capacity poses a risk to the state's capability to effectively oversee, manage and utilise evaluations in the future.
The quality assessment practice, including the methodology and tools applied, has produced a useful analysis of the quality of evaluations to date. A quality assessment tool has now been developed with an electronic platform that can be applied to future evaluations. The system is currently being expanded to offer evaluation planning, management and a document repository function, in addition to quality assessment. The online platform would thereby facilitate an evaluation lifespan tracking and monitoring system, complete with quality assessment. A subsequent rapid utilisation assessment is also being investigated two years after the conclusion of the evaluation to revisit use, learning and improvement.
The benefit in undertaking quality assessments of evaluations is that it opens up the potential to learn, and improve evaluations, through improved standards and support so that they increasingly be on par with best practice. Between the first two rounds these standards have expanded to cater for all elements of evaluation practice. Going forward, it is necessary to consolidate those standards most critical to determining evaluation quality and refine the tool to produce the most credible quality assessment possible.
In addition to the authors, we wish to acknowledge the team of quality assessors: Fatima Rawat, Kevin Foster, Meagan Jooste, Katie Gull, Tim Mosdell, Nana Davies, Cathy Chames, Wilma Wessels, Robin Richards, Raymond Basson, Stephen Rule, Thandeka Mhlantla, Chiweni Chimbwete, Lewis Ndlovu and Kevin Kelly. Sean Walsh assisted in the development of software systems. Rosina Maphalla and Nazreen Kola also contributed to an earlier article on the quality assessment system.
The authors declare that they have no financial or personal relationship(s) that may have inappropriately influenced them in writing this article.
This article is the scholarly culmination of various pieces of evaluation related work commissioned by the DPME. I.G., C.J. and M.E. contributed to the background, South African standards, and national evaluation repository sections. M.L. and N.M. wrote the sections on the quality assessment system and results. D.P. contributed to the background section and T.B. to the audit of government evaluations.