A demonstration of the application of the new paradigm for the evaluation of forensic evidence under conditions reflecting those of a real forensic-voice-comparison case
Introduction
In Daubert v Merrell Dow Pharmaceuticals [1993, 509 US 579] the United States Supreme Court instructed judges to consider several factors in determining the admissibility of forensic evidence, including whether the methodology applied is scientifically valid and whether it has been empirically tested and found to have an acceptable error rate. Saks and Koehler [1] described a paradigm shift in forensic science which they proposed was in part driven by the Daubert ruling and in part by the shift already having occurred for DNA evidence. They “envision[ed] a paradigm shift in the traditional forensic identification sciences in which untested assumptions and semi-informed guesswork are replaced by a sound scientific foundation and justifiable protocols.” (p. 895). They also proposed that “the time is ripe for the traditional forensic sciences to replace antiquated assumptions of uniqueness and perfection with a more defensible empirical and probabilistic foundation.” (p. 895). The 2009 National Research Council (NRC) report to the U.S. Congress [2] was highly critical of contemporary practice across a broad range of forensic science disciplines. Their recommendations included that procedures be adopted which include “quantifiable measures of the reliability and accuracy of forensic analyses” (p. 23), “the reporting of a measurement with an interval that has a high probability of containing the true value” (p. 121), and “the conducting of validation studies of the performance of a forensic procedure” (p. 121). In response to the R v T ruling by the Court of Appeal of England & Wales (R v T [2010] EWCA Crim 2439, [2011] 1 Cr App R 9), a large number of individuals and organisations have affirmed or reaffirmed that the likelihood-ratio framework is the logically correct framework for the evaluation of forensic evidence [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13] (see also [14], [15], [16], [17]). The need for transparency was also a major theme in the R v T ruling itself and in several of the responses.
Drawing on the ongoing changes and calls for change in forensic science, Morrison and colleagues have formulated a description of a new paradigm for the evaluation of forensic evidence which includes the following key elements:
- •
use of the likelihood-ratio framework for the evaluation of the strength of forensic evidence
- •
use of approaches based on relevant data, quantitative measurements, and statistical models (relevant data is representative of the relevant population)
- •
empirical testing of the validity and reliability of the forensic analysis system under conditions reflecting those of the case under investigation.
For the first time here we propose the promotion of a fourth concern to be an explicit member of this list of key elements:
- •
transparent reporting of choices made and procedures employed.
An early formulation of Morrison and colleagues' conception of the new paradigm, and a description of the history of the paradigm shift in forensic voice comparison appeared in Morrison [18]. Another early formulation appeared in Morrison [19], and later formulations in Morrison [9], Morrison [20], and Morrison and Stoel [21]. Morrison et al. [22] focussed particularly on the selection of the relevant population for the defence hypothesis, and Morrison [23] on procedures for empirically testing validity and reliability within the likelihood-ratio framework.
The following is a description of general procedures for performing a source-level forensic comparison within the new paradigm (it is based on the description in Morrison and Stoel [21]):
First, the forensic scientist must define and communicate the prosecution and defence hypotheses as they understand them. A forensic likelihood ratio is the answer to a specific question,1 and to make sense of the likelihood ratio both the forensic scientist and the trier of fact need to understand that question. The question is specified by two hypotheses: the prosecution hypothesis, which pertains to the numerator of the likelihood ratio, and the defence hypothesis, which pertains to the denominator. A typical prosecution hypothesis is that the sample of questioned origin comes from the same source as the sample of known origin. A typical defence hypothesis is that the sample of questioned origin does not come from the same source as the sample of known origin, but from some other source in the relevant population. The relevant population is specific to the particular case under investigation (see, for example, Curran et al. [24], on glass and Kerkhoff et al. [25], on firearms). In most jurisdictions, it is not common for the court to provide the forensic scientist with explicit hypotheses to test prior to the forensic scientist beginning their analysis. In such circumstances, the forensic scientist must therefore use their own judgement and adopt hypotheses which they believe will be of interest to the trier of fact. Analysis cannot proceed unless both a prosecution and defence hypothesis are either provided to or adopted by the forensic scientist. A legitimate question to debate before the trier of fact would be whether the alternative hypothesis adopted by the forensic scientist is appropriate. That is, does it lead to a likelihood ratio which answers the question that the trier of fact wants to have answered.2 By making their adopted hypotheses explicit, the forensic scientist facilitates consideration of this important question.
Next, the forensic scientist must obtain a sample from the relevant population. This sample is to be used to train the model which will calculate the denominator of the likelihood ratio. A legitimate issue to debate before the trier of fact would be whether the sample is sufficiently representative of the relevant population (see Hancock et al. [27], Morrison [9]).
The forensic scientist must make measurements which quantity the properties of the sample of known origin (suspect sample), the sample of questioned origin (offender sample), and each item in the sample representative of the relevant population. These measurements constitute relevant data.
Next, the forensic scientist must choose the statistical models that they will use to calculate the likelihood ratio. Part of the expertise of the forensic scientist is to select a model which they expect will give a reasonable approximation of the distribution of the population without overfitting the model to the particular training data. They can conduct tests using development data to help them select a model which gives what they themselves consider to be sufficiently acceptable performance under the conditions of the case under investigation.3 The models should be trained and optimised using data which reflect the conditions of the case under investigation. In a forensic-voice-comparison case this would include recording and transmission channel (e.g., landline or mobile telephone, compression algorithms), background noise, reverberation, speaking style (conversation, formal speech), etc. To avoid condition-dependent bias in the calculation of the denominator versus the numerator of the likelihood ratio, the data used to train the model for the denominator should be in the same condition as the known-origin data which is used to train the model for the numerator. Ideally the statistical models would also incorporate techniques which attempt to compensate for mismatches between the conditions of samples of known and questioned origin. The description of the conditions also forms part of the specific question which is to be answered by the likelihood ratio. For example: What is the probability of getting the properties of the distorted partial latent mark if it were produced by the same finger as made the high-quality suspect fingerprint versus if it were made by a finger of someone else from the relevant population? The forensic scientist should communicate to the trier of fact the conditions of the case as they understand them, and how they form part of the specific question to be answered by the likelihood ratio.
Once relevant training data have been selected and a model has been chosen, trained, and optimised to the conditions of the case under investigation, the system should be frozen, i.e., no other changes are allowed. Then the system should be tested using new pairs of sample items drawn from the relevant population and reflecting the conditions of the actual samples of known and questioned origin from the case under investigation. In this way the forensic scientist obtains an indication of how well the system is expected to perform on previously unseen data from the relevant population under these conditions. Testing using samples from some other population or under different conditions will not be informative as to how well the system is expected to perform on the actual samples of known and questioned origin from the case under investigation. Testing using some other population and/or under some other conditions, could potentially be highly misleading with respect to the performance of the system in the particular case under investigation. An issue for debate would be whether the conditions of the training and test data adequately reflect the conditions of the samples of known and questioned origin.
If the judge at an admissibility hearing or the trier of fact at trial is satisfied that the samples adequately reflect the relevant population and conditions specific to the case, and is satisfied that the model is answering a question which is relevant to the trier of fact, then they should consider whether the empirically demonstrated degree of validity and reliability of the system is sufficient for the output to be of use to the trier of fact. If they are not satisfied on any of these points, then the output of the system will be of little or no value to them. It is therefore essential that the forensic scientist be transparent as to what they have done, and that they present the results of validity and reliability testing.
After the performance of the system has been empirically tested, the system and the test results are frozen, i.e., no other changes are allowed to the system, and the test data cannot be changed and new tests cannot be run. The last thing the forensic scientist does as part of the analysis is to calculate a likelihood ratio for the actual samples of known and questioned origin from the case.
In a review of forensic-speech-science research literature published between 2010 and 2013 Morrison & Enzinger [28] found that, in contrast to earlier years, the majority of studies used data, quantitative measurements, and statistical models to calculate likelihood ratios, and empirically tested the performance of the system. It therefore appears that the majority of research studies in the field now attempt to operate within the new paradigm. Many studies in the review, however, suffered from problems including poorly defined hypotheses, small databases, the use of data not representative of casework conditions, and training and testing on the same data. Few, or none, of the published studies were conducted in a way which attempted to satisfy all elements of the new paradigm under conditions reflecting those of real forensic cases. Also, in a survey of practitioners by Gold and French [29], only 4 of 36 respondents said they reported strength of evidence as a numeric likelihood ratio. Thus, although we may have reached an inflection point in the paradigm shift in the context of research, there is clearly still a long way to go, and even further to go with respect to the implementation of the new paradigm in casework. The aim of the present paper is to demonstrate that forensic-voice-comparison casework can be, and has been, performed in a manner consistent with the new paradigm.
The present paper describes the implementation of all key elements of the new paradigm under conditions reflecting those of a real case, a case on which we actually worked. One previously published study described the implementation of the new paradigm under conditions reflecting those of a different real case [30]. The circumstances of the latter case were not very typical, whereas the circumstances of the case described in the present paper are much more typical: the recording of the speaker of questioned identity (offender recording) is a recording of a telephone conversation recorded by a device attached to a telephone system, and the recording of the speaker of known identity (suspect recording) is a recording of a police interview with a suspect made in a police station interview room. In the research study we have replicated the analyses we conducted for the actual case. Details of the recording conditions and other factors such as the durations of the recordings are taken from the recordings in the actual case, but the acoustic and statistical analyses in the research study are of recordings of speakers in a research database rather than of the recordings of the speakers in the actual case. Nothing we say with respect to the particular values of the strength of the evidence of the recordings analysed in this research report should be interpreted as relating to the specific values of the strength of the evidence of the recordings analysed in the original case. The research report has been streamlined, omitting some details of the actual case which are peripheral to the research issues. The research also expands on the casework analyses by addressing additional research questions which were not appropriate to address within the constraints of performing the actual casework. The primary expansion is the investigation of different techniques for dealing with mismatches in recording conditions between the suspect and offender recording.
Below we describe:
- 1.
how we chose the relevant hypotheses, and hence the relevant population;
- 2.
how we sampled from the relevant population;
- 3.
how we simulated the conditions of the suspect and offender recordings;
- 4.
how we measured acoustic properties of the recordings;
- 5.
how we built statistical models to calculate likelihood ratios which addressed the relevant hypotheses on the basis of these measurements;
- 6.
how we empirically tested the degree of validity and reliability of our system under conditions reflecting those of the case;
- 7.
and finally how we reported the strength of the evidence for the comparison of the suspect and offender recordings.
The conditions of the present case are that a telephone call was made from a landline telephone to a call centre. A recording was made at the call centre. This is the recording of the voice of questioned identity, the offender recording. It includes background office noise (multi-speaker babble and typing noises). It was saved in a compressed format. Some time later a suspect was interviewed at a police station. A recording was made of this interview. This is the recording of the voice of known identity, the suspect recording. There was substantial room reverberation, the recording included background noise from a ventilation system, and it was saved in a different compressed format. Mismatches in recording conditions can severely degrade the performance of forensic-voice-comparison systems (see, for example, Zhang et al. [31]). A major component of the present paper is an investigation of three different techniques to compensate for differences in the conditions between the suspect and offender recordings. To simplify exposition, and to illustrate the importance of applying compensation techniques under the conditions of the present case, we first present a forensic analysis system which does not include any compensation techniques. We then describe three techniques, add them to the forensic analysis system, choose the technique (or combination of techniques) which gives best performance under the conditions of this case, and retest the suspect and offender recordings using a system which includes this technique.4
In general we would not expect to be able to control the recording conditions for offender recordings, but in theory we should be able to obtain reasonably good quality recordings of the suspect. The quality of the suspect recording in this case was quite poor. To give an idea of what system performance could be like if better quality recordings were made of police interviews, the penultimate section of the paper retests the system using higher quality audio recordings for the suspect condition.
Section snippets
Definition of hypotheses
Based on the circumstances of the case, as described above, we adopted the following two competing hypotheses:
- •
Prosecution hypothesis: The voice on the offender recording was produced by the suspect.
- •
Defence hypothesis: The voice on the offender recording was not produced by the suspect, but by some other speaker from the relevant population.
In our analysis we instantiated the prosecution and defence hypotheses as the numerator and denominator of the likelihood ratio being answers to the
Testing of validity and reliability and evaluation of the likelihood ratio
After finalising development of the forensic-voice-comparison system and before it was applied to the offender and suspect samples, its validity and reliability were tested on data from a separate set of speakers (the test data set). Every speaker's Session 1 offender-condition recording was compared with their own Session 2 suspect-condition recording, and with their Session 3 suspect-condition recording if one was available. These were same-speaker comparisons. Every speaker's Session 1
Recording-condition mismatch compensation
Mismatches in recording conditions in the present case included differences in background noise, room reverberation, and transmission and recording systems. Filtering and additive noise corrupt MFCC features. The following exposition is based on Pelecanos and Sridharan [54]. Assuming that the speech signal xs[i] and the background noise xn[i] are uncorrelated and the linear filtering effect Hk is consistent over the frequency range of the filterbank, the log filterbank energies log(Ek) can be
Testing of validity and reliability and evaluation of the likelihood ratio: System incorporating mismatch compensation
Here we repeat the testing of validity and reliability, but now on a system incorporating mismatch compensation in the form of combined feature warping and probabilistic feature mapping. The resulting system had a Cllr-mean of 0.344, a 95% CI of ± 0.95 orders of magnitude, and a Cllr-pooled of 0.423 (a 98% credible interval estimate was ± 1.13 orders of magnitude. A Tippett plot of results is provided in Fig. 7b. The likelihood-ratio value calculated for the suspect and offender recordings was
Effect of the recording condition of the suspect on forensic-voice-comparison performance
In forensic casework, there is typically a mismatch in conditions between suspect and offender recordings. We would not normally expect to be able to control the recording conditions for the offender recording, but in theory we should be able to control the recording conditions for the suspect recording when it is a recording of a police interview. The conditions of the suspect recording in the present case were quite poor. In 2012, the U.S. National Institute of Justice released draft
Conclusion
We have demonstrated the evaluation of forensic evidence under conditions reflecting those of an actual forensic-voice-comparison case. This includes consideration of the relevant prosecution and defence hypotheses to address in this case, selection of data reflecting the adopted defence hypothesis, simulation of recording conditions reflecting those of the suspect and offender recordings in the case, quantitative measurement and statistical modelling to calculate a likelihood ratio given the
Acknowledgements
This research was supported by the Australian Research Council, Australian Federal Police, New South Wales Police, Queensland Police, National Institute of Forensic Science, Australasian Speech Science and Technology Association, and the Guardia Civil through Linkage Project LP100200142. NICTA is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program. Opinions expressed are those of the
References (67)
- et al.
Evidence evaluation: a response to the Court of Appeal judgment in R v T
Sci. Justice
(2011) Is forensic science the last bastion of resistance against statistics?
Sci. Justice
(2013)Standards for the formulation of evaluative forensic science expert opinion
Sci. Justice
(2009)Forensic voice comparison and the paradigm shift
Sci. Justice
(2009)Distinguishing between forensic science and forensic pseudoscience: testing of validity and reliability, and approaches to forensic voice comparison
Sci. Justice
(2014)Measuring the validity and reliability of forensic likelihood-ratio systems
Sci. Justice
(2011)- et al.
The interpretation of shoeprint comparison class correspondences
Sci. Justice
(2012) - et al.
Mismatched distances from speakers to telephone in a forensic-voice-comparison case
Speech Comm.
(2015) - et al.
Effects of telephone transmission on the performance of formant-trajectory-based forensic voice comparison — female voices
Speech Comm.
(2013) - et al.
Speaker verification using adapted Gaussian mixture models
Digital Signal Process.
(2000)
Probabilistic feature-based transformation for speaker verification over telephone networks
Neurocomputing
The coming paradigm shift in forensic identification science
Science
Strengthening Forensic Science in the United States: A Path Forward
Expressing evaluative opinions: a position statement
Sci. Justice
Improve statistics in court
Nature
Forensic science evidence in question
Crim. Law Rev.
Extending the confusion about Bayes
Mod. Law Rev.
How to assign a likelihood ratio in a footwear mark case: an analysis and discussion in the light of R v T
Law Prob. Risk
The likelihood-ratio framework and forensic evidence in court: a response to R v T
Int. J. Evid. Proof
The likelihood ratio as value of evidence: more than a question of numbers
Law Prob. Risk
How clear is transparent? Reporting expert reasoning in legal cases
Law Prob. Risk
Bad cases make bad law: reactions to R v T
Law Prob. Risk
Expert working group on human factors in latent print analysis, latent print examination and human factors: improving the practice through a systems approach
Technical Report, NIST
Response to Draft Australian Standard: DR AS 5388.3 Forensic Analysis — Part 3 — Interpretation
ENFSI guideline for evaluative reporting in forensic science
Technical Report
Forensic voice comparison
Forensic strength of evidence statements should preferably be likelihood ratios calculated using relevant data, quantitative measurements, and statistical models — a response to Lennard (2013) Fingerprint identification: how far have we come?
Aust. J. Forensic Sci.
Database selection for forensic voice comparison
Forensic Interpretation of Glass Evidence
The likelihood ratio approach in cartridge case and bullet comparison
J. Assoc. Firearm Toolmark Examiners
Forensic Speaker Identification
Forensic speech science — review: 2010–2013
International practices in forensic speaker comparison
Int. J. Speech Lang. Law
Cited by (31)
Validation of an ECAPA-TDNN system for Forensic Automatic Speaker Recognition under case work conditions
2024, Speech CommunicationSpeaker identification in courtroom contexts – Part II: Investigation of bias in individual listeners’ responses
2023, Forensic Science InternationalValidations of an alpha version of the E<sup>3</sup> Forensic Speech Science System (E<sup>3</sup>FS<sup>3</sup>) core software tools
2022, Forensic Science International: SynergyCitation Excerpt :To maximize use of available case-relevant data, and to avoid training and testing on the same data, the calibration model is trained using leave-one-speaker-out/leave-two-speakers-out cross-validation, see [4] §2.5.4. The forensic_eval_01 benchmark dataset and validation protocols are described in [37] and [38]. The speakers are male Australian-English speakers.
Consensus on validation of forensic voice comparison
2021, Science and JusticeBayesian multivariate models for case assessment in dynamic signature cases
2021, Forensic Science InternationalDynamic signatures: A review of dynamic feature variation and forensic methodology
2018, Forensic Science InternationalCitation Excerpt :This implies that there are still important steps to be taken to attain the demands set by documents such as the NAS [137] and PCAST reports [136]. Other forensic fields have already started adapting to the requirements set to forensic science by the legal system [215–217]. Forensic handwriting examination should also adapt, especially when dynamic signatures and quantitative data are involved.