Elsevier

Science & Justice

Volume 56, Issue 1, January 2016, Pages 42-57
Science & Justice

A demonstration of the application of the new paradigm for the evaluation of forensic evidence under conditions reflecting those of a real forensic-voice-comparison case

https://doi.org/10.1016/j.scijus.2015.06.005Get rights and content

Highlights

  • Evaluation of strength of evidence under conditions reflecting those of a real forensic-voice-comparison case

  • Use of relevant data, quantitative measurements, and statistical models to calculate likelihood ratios

  • Empirical testing of validity and reliability under conditions reflecting those of the case under investigation

  • Exploration of different techniques to compensate for mismatched recording conditions

Abstract

The new paradigm for the evaluation of the strength of forensic evidence includes: The use of the likelihood-ratio framework. The use of relevant data, quantitative measurements, and statistical models. Empirical testing of validity and reliability under conditions reflecting those of the case under investigation. Transparency as to decisions made and procedures employed. The present paper illustrates the use of the new paradigm to evaluate strength of evidence under conditions reflecting those of a real forensic-voice-comparison case. The offender recording was from a landline telephone system, had background office noise, and was saved in a compressed format. The suspect recording included substantial reverberation and ventilation system noise, and was saved in a different compressed format. The present paper includes descriptions of the selection of the relevant hypotheses, sampling of data from the relevant population, simulation of suspect and offender recording conditions, and acoustic measurement and statistical modelling procedures. The present paper also explores the use of different techniques to compensate for the mismatch in recording conditions. It also examines how system performance would have differed had the suspect recording been of better quality.

Introduction

In Daubert v Merrell Dow Pharmaceuticals [1993, 509 US 579] the United States Supreme Court instructed judges to consider several factors in determining the admissibility of forensic evidence, including whether the methodology applied is scientifically valid and whether it has been empirically tested and found to have an acceptable error rate. Saks and Koehler [1] described a paradigm shift in forensic science which they proposed was in part driven by the Daubert ruling and in part by the shift already having occurred for DNA evidence. They “envision[ed] a paradigm shift in the traditional forensic identification sciences in which untested assumptions and semi-informed guesswork are replaced by a sound scientific foundation and justifiable protocols.” (p. 895). They also proposed that “the time is ripe for the traditional forensic sciences to replace antiquated assumptions of uniqueness and perfection with a more defensible empirical and probabilistic foundation.” (p. 895). The 2009 National Research Council (NRC) report to the U.S. Congress [2] was highly critical of contemporary practice across a broad range of forensic science disciplines. Their recommendations included that procedures be adopted which include “quantifiable measures of the reliability and accuracy of forensic analyses” (p. 23), “the reporting of a measurement with an interval that has a high probability of containing the true value” (p. 121), and “the conducting of validation studies of the performance of a forensic procedure” (p. 121). In response to the R v T ruling by the Court of Appeal of England & Wales (R v T [2010] EWCA Crim 2439, [2011] 1 Cr App R 9), a large number of individuals and organisations have affirmed or reaffirmed that the likelihood-ratio framework is the logically correct framework for the evaluation of forensic evidence [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13] (see also [14], [15], [16], [17]). The need for transparency was also a major theme in the R v T ruling itself and in several of the responses.

Drawing on the ongoing changes and calls for change in forensic science, Morrison and colleagues have formulated a description of a new paradigm for the evaluation of forensic evidence which includes the following key elements:

  • use of the likelihood-ratio framework for the evaluation of the strength of forensic evidence

  • use of approaches based on relevant data, quantitative measurements, and statistical models (relevant data is representative of the relevant population)

  • empirical testing of the validity and reliability of the forensic analysis system under conditions reflecting those of the case under investigation.

    For the first time here we propose the promotion of a fourth concern to be an explicit member of this list of key elements:

  • transparent reporting of choices made and procedures employed.

An early formulation of Morrison and colleagues' conception of the new paradigm, and a description of the history of the paradigm shift in forensic voice comparison appeared in Morrison [18]. Another early formulation appeared in Morrison [19], and later formulations in Morrison [9], Morrison [20], and Morrison and Stoel [21]. Morrison et al. [22] focussed particularly on the selection of the relevant population for the defence hypothesis, and Morrison [23] on procedures for empirically testing validity and reliability within the likelihood-ratio framework.

The following is a description of general procedures for performing a source-level forensic comparison within the new paradigm (it is based on the description in Morrison and Stoel [21]):

First, the forensic scientist must define and communicate the prosecution and defence hypotheses as they understand them. A forensic likelihood ratio is the answer to a specific question,1 and to make sense of the likelihood ratio both the forensic scientist and the trier of fact need to understand that question. The question is specified by two hypotheses: the prosecution hypothesis, which pertains to the numerator of the likelihood ratio, and the defence hypothesis, which pertains to the denominator. A typical prosecution hypothesis is that the sample of questioned origin comes from the same source as the sample of known origin. A typical defence hypothesis is that the sample of questioned origin does not come from the same source as the sample of known origin, but from some other source in the relevant population. The relevant population is specific to the particular case under investigation (see, for example, Curran et al. [24], on glass and Kerkhoff et al. [25], on firearms). In most jurisdictions, it is not common for the court to provide the forensic scientist with explicit hypotheses to test prior to the forensic scientist beginning their analysis. In such circumstances, the forensic scientist must therefore use their own judgement and adopt hypotheses which they believe will be of interest to the trier of fact. Analysis cannot proceed unless both a prosecution and defence hypothesis are either provided to or adopted by the forensic scientist. A legitimate question to debate before the trier of fact would be whether the alternative hypothesis adopted by the forensic scientist is appropriate. That is, does it lead to a likelihood ratio which answers the question that the trier of fact wants to have answered.2 By making their adopted hypotheses explicit, the forensic scientist facilitates consideration of this important question.

Next, the forensic scientist must obtain a sample from the relevant population. This sample is to be used to train the model which will calculate the denominator of the likelihood ratio. A legitimate issue to debate before the trier of fact would be whether the sample is sufficiently representative of the relevant population (see Hancock et al. [27], Morrison [9]).

The forensic scientist must make measurements which quantity the properties of the sample of known origin (suspect sample), the sample of questioned origin (offender sample), and each item in the sample representative of the relevant population. These measurements constitute relevant data.

Next, the forensic scientist must choose the statistical models that they will use to calculate the likelihood ratio. Part of the expertise of the forensic scientist is to select a model which they expect will give a reasonable approximation of the distribution of the population without overfitting the model to the particular training data. They can conduct tests using development data to help them select a model which gives what they themselves consider to be sufficiently acceptable performance under the conditions of the case under investigation.3 The models should be trained and optimised using data which reflect the conditions of the case under investigation. In a forensic-voice-comparison case this would include recording and transmission channel (e.g., landline or mobile telephone, compression algorithms), background noise, reverberation, speaking style (conversation, formal speech), etc. To avoid condition-dependent bias in the calculation of the denominator versus the numerator of the likelihood ratio, the data used to train the model for the denominator should be in the same condition as the known-origin data which is used to train the model for the numerator. Ideally the statistical models would also incorporate techniques which attempt to compensate for mismatches between the conditions of samples of known and questioned origin. The description of the conditions also forms part of the specific question which is to be answered by the likelihood ratio. For example: What is the probability of getting the properties of the distorted partial latent mark if it were produced by the same finger as made the high-quality suspect fingerprint versus if it were made by a finger of someone else from the relevant population? The forensic scientist should communicate to the trier of fact the conditions of the case as they understand them, and how they form part of the specific question to be answered by the likelihood ratio.

Once relevant training data have been selected and a model has been chosen, trained, and optimised to the conditions of the case under investigation, the system should be frozen, i.e., no other changes are allowed. Then the system should be tested using new pairs of sample items drawn from the relevant population and reflecting the conditions of the actual samples of known and questioned origin from the case under investigation. In this way the forensic scientist obtains an indication of how well the system is expected to perform on previously unseen data from the relevant population under these conditions. Testing using samples from some other population or under different conditions will not be informative as to how well the system is expected to perform on the actual samples of known and questioned origin from the case under investigation. Testing using some other population and/or under some other conditions, could potentially be highly misleading with respect to the performance of the system in the particular case under investigation. An issue for debate would be whether the conditions of the training and test data adequately reflect the conditions of the samples of known and questioned origin.

If the judge at an admissibility hearing or the trier of fact at trial is satisfied that the samples adequately reflect the relevant population and conditions specific to the case, and is satisfied that the model is answering a question which is relevant to the trier of fact, then they should consider whether the empirically demonstrated degree of validity and reliability of the system is sufficient for the output to be of use to the trier of fact. If they are not satisfied on any of these points, then the output of the system will be of little or no value to them. It is therefore essential that the forensic scientist be transparent as to what they have done, and that they present the results of validity and reliability testing.

After the performance of the system has been empirically tested, the system and the test results are frozen, i.e., no other changes are allowed to the system, and the test data cannot be changed and new tests cannot be run. The last thing the forensic scientist does as part of the analysis is to calculate a likelihood ratio for the actual samples of known and questioned origin from the case.

In a review of forensic-speech-science research literature published between 2010 and 2013 Morrison & Enzinger [28] found that, in contrast to earlier years, the majority of studies used data, quantitative measurements, and statistical models to calculate likelihood ratios, and empirically tested the performance of the system. It therefore appears that the majority of research studies in the field now attempt to operate within the new paradigm. Many studies in the review, however, suffered from problems including poorly defined hypotheses, small databases, the use of data not representative of casework conditions, and training and testing on the same data. Few, or none, of the published studies were conducted in a way which attempted to satisfy all elements of the new paradigm under conditions reflecting those of real forensic cases. Also, in a survey of practitioners by Gold and French [29], only 4 of 36 respondents said they reported strength of evidence as a numeric likelihood ratio. Thus, although we may have reached an inflection point in the paradigm shift in the context of research, there is clearly still a long way to go, and even further to go with respect to the implementation of the new paradigm in casework. The aim of the present paper is to demonstrate that forensic-voice-comparison casework can be, and has been, performed in a manner consistent with the new paradigm.

The present paper describes the implementation of all key elements of the new paradigm under conditions reflecting those of a real case, a case on which we actually worked. One previously published study described the implementation of the new paradigm under conditions reflecting those of a different real case [30]. The circumstances of the latter case were not very typical, whereas the circumstances of the case described in the present paper are much more typical: the recording of the speaker of questioned identity (offender recording) is a recording of a telephone conversation recorded by a device attached to a telephone system, and the recording of the speaker of known identity (suspect recording) is a recording of a police interview with a suspect made in a police station interview room. In the research study we have replicated the analyses we conducted for the actual case. Details of the recording conditions and other factors such as the durations of the recordings are taken from the recordings in the actual case, but the acoustic and statistical analyses in the research study are of recordings of speakers in a research database rather than of the recordings of the speakers in the actual case. Nothing we say with respect to the particular values of the strength of the evidence of the recordings analysed in this research report should be interpreted as relating to the specific values of the strength of the evidence of the recordings analysed in the original case. The research report has been streamlined, omitting some details of the actual case which are peripheral to the research issues. The research also expands on the casework analyses by addressing additional research questions which were not appropriate to address within the constraints of performing the actual casework. The primary expansion is the investigation of different techniques for dealing with mismatches in recording conditions between the suspect and offender recording.

Below we describe:

  • 1.

    how we chose the relevant hypotheses, and hence the relevant population;

  • 2.

    how we sampled from the relevant population;

  • 3.

    how we simulated the conditions of the suspect and offender recordings;

  • 4.

    how we measured acoustic properties of the recordings;

  • 5.

    how we built statistical models to calculate likelihood ratios which addressed the relevant hypotheses on the basis of these measurements;

  • 6.

    how we empirically tested the degree of validity and reliability of our system under conditions reflecting those of the case;

  • 7.

    and finally how we reported the strength of the evidence for the comparison of the suspect and offender recordings.

The conditions of the present case are that a telephone call was made from a landline telephone to a call centre. A recording was made at the call centre. This is the recording of the voice of questioned identity, the offender recording. It includes background office noise (multi-speaker babble and typing noises). It was saved in a compressed format. Some time later a suspect was interviewed at a police station. A recording was made of this interview. This is the recording of the voice of known identity, the suspect recording. There was substantial room reverberation, the recording included background noise from a ventilation system, and it was saved in a different compressed format. Mismatches in recording conditions can severely degrade the performance of forensic-voice-comparison systems (see, for example, Zhang et al. [31]). A major component of the present paper is an investigation of three different techniques to compensate for differences in the conditions between the suspect and offender recordings. To simplify exposition, and to illustrate the importance of applying compensation techniques under the conditions of the present case, we first present a forensic analysis system which does not include any compensation techniques. We then describe three techniques, add them to the forensic analysis system, choose the technique (or combination of techniques) which gives best performance under the conditions of this case, and retest the suspect and offender recordings using a system which includes this technique.4

In general we would not expect to be able to control the recording conditions for offender recordings, but in theory we should be able to obtain reasonably good quality recordings of the suspect. The quality of the suspect recording in this case was quite poor. To give an idea of what system performance could be like if better quality recordings were made of police interviews, the penultimate section of the paper retests the system using higher quality audio recordings for the suspect condition.

Section snippets

Definition of hypotheses

Based on the circumstances of the case, as described above, we adopted the following two competing hypotheses:

  • Prosecution hypothesis: The voice on the offender recording was produced by the suspect.

  • Defence hypothesis: The voice on the offender recording was not produced by the suspect, but by some other speaker from the relevant population.

    In our analysis we instantiated the prosecution and defence hypotheses as the numerator and denominator of the likelihood ratio being answers to the

Testing of validity and reliability and evaluation of the likelihood ratio

After finalising development of the forensic-voice-comparison system and before it was applied to the offender and suspect samples, its validity and reliability were tested on data from a separate set of speakers (the test data set). Every speaker's Session 1 offender-condition recording was compared with their own Session 2 suspect-condition recording, and with their Session 3 suspect-condition recording if one was available. These were same-speaker comparisons. Every speaker's Session 1

Recording-condition mismatch compensation

Mismatches in recording conditions in the present case included differences in background noise, room reverberation, and transmission and recording systems. Filtering and additive noise corrupt MFCC features. The following exposition is based on Pelecanos and Sridharan [54]. Assuming that the speech signal xs[i] and the background noise xn[i] are uncorrelated and the linear filtering effect Hk is consistent over the frequency range of the filterbank, the log filterbank energies log(Ek) can be

Testing of validity and reliability and evaluation of the likelihood ratio: System incorporating mismatch compensation

Here we repeat the testing of validity and reliability, but now on a system incorporating mismatch compensation in the form of combined feature warping and probabilistic feature mapping. The resulting system had a Cllr-mean of 0.344, a 95% CI of ± 0.95 orders of magnitude, and a Cllr-pooled of 0.423 (a 98% credible interval estimate was ± 1.13 orders of magnitude. A Tippett plot of results is provided in Fig. 7b. The likelihood-ratio value calculated for the suspect and offender recordings was

Effect of the recording condition of the suspect on forensic-voice-comparison performance

In forensic casework, there is typically a mismatch in conditions between suspect and offender recordings. We would not normally expect to be able to control the recording conditions for the offender recording, but in theory we should be able to control the recording conditions for the suspect recording when it is a recording of a police interview. The conditions of the suspect recording in the present case were quite poor. In 2012, the U.S. National Institute of Justice released draft

Conclusion

We have demonstrated the evaluation of forensic evidence under conditions reflecting those of an actual forensic-voice-comparison case. This includes consideration of the relevant prosecution and defence hypotheses to address in this case, selection of data reflecting the adopted defence hypothesis, simulation of recording conditions reflecting those of the suspect and offender recordings in the case, quantitative measurement and statistical modelling to calculate a likelihood ratio given the

Acknowledgements

This research was supported by the Australian Research Council, Australian Federal Police, New South Wales Police, Queensland Police, National Institute of Forensic Science, Australasian Speech Science and Technology Association, and the Guardia Civil through Linkage Project LP100200142. NICTA is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program. Opinions expressed are those of the

References (67)

  • M.-W. Mak et al.

    Probabilistic feature-based transformation for speaker verification over telephone networks

    Neurocomputing

    (2007)
  • M.J. Saks et al.

    The coming paradigm shift in forensic identification science

    Science

    (2005)
  • U.S. National Research Council (NRC)

    Strengthening Forensic Science in the United States: A Path Forward

    (2009)
  • I.W. Evett et al.

    Expressing evaluative opinions: a position statement

    Sci. Justice

    (2011)
  • N.E. Fenton

    Improve statistics in court

    Nature

    (2011)
  • M. Redmayne et al.

    Forensic science evidence in question

    Crim. Law Rev.

    (2011)
  • B. Robertson et al.

    Extending the confusion about Bayes

    Mod. Law Rev.

    (2011)
  • A. Biedermann et al.

    How to assign a likelihood ratio in a footwear mark case: an analysis and discussion in the light of R v T

    Law Prob. Risk

    (2012)
  • G.S. Morrison

    The likelihood-ratio framework and forensic evidence in court: a response to R v T

    Int. J. Evid. Proof

    (2012)
  • A. Nordgaard et al.

    The likelihood ratio as value of evidence: more than a question of numbers

    Law Prob. Risk

    (2012)
  • M.J. Sjerps et al.

    How clear is transparent? Reporting expert reasoning in legal cases

    Law Prob. Risk

    (2012)
  • W.C. Thompson

    Bad cases make bad law: reactions to R v T

    Law Prob. Risk

    (2012)
  • Expert working group on human factors in latent print analysis, latent print examination and human factors: improving the practice through a systems approach

    Technical Report, NIST

    (2012)
  • G.S. Morrison et al.

    Response to Draft Australian Standard: DR AS 5388.3 Forensic Analysis — Part 3 — Interpretation

    (2012)
  • S.M. Willis et al.

    ENFSI guideline for evaluative reporting in forensic science

    Technical Report

    (2015)
  • G.S. Morrison

    Forensic voice comparison

  • G.S. Morrison et al.

    Forensic strength of evidence statements should preferably be likelihood ratios calculated using relevant data, quantitative measurements, and statistical models — a response to Lennard (2013) Fingerprint identification: how far have we come?

    Aust. J. Forensic Sci.

    (2014)
  • G.S. Morrison et al.

    Database selection for forensic voice comparison

  • J.M. Curran et al.

    Forensic Interpretation of Glass Evidence

    (2000)
  • W. Kerkhoff et al.

    The likelihood ratio approach in cartridge case and bullet comparison

    J. Assoc. Firearm Toolmark Examiners

    (2013)
  • P. Rose

    Forensic Speaker Identification

    (2002)
  • G.S. Morrison et al.

    Forensic speech science — review: 2010–2013

  • E. Gold et al.

    International practices in forensic speaker comparison

    Int. J. Speech Lang. Law

    (2011)
  • Cited by (31)

    • Validations of an alpha version of the E<sup>3</sup> Forensic Speech Science System (E<sup>3</sup>FS<sup>3</sup>) core software tools

      2022, Forensic Science International: Synergy
      Citation Excerpt :

      To maximize use of available case-relevant data, and to avoid training and testing on the same data, the calibration model is trained using leave-one-speaker-out/leave-two-speakers-out cross-validation, see [4] §2.5.4. The forensic_eval_01 benchmark dataset and validation protocols are described in [37] and [38]. The speakers are male Australian-English speakers.

    • Dynamic signatures: A review of dynamic feature variation and forensic methodology

      2018, Forensic Science International
      Citation Excerpt :

      This implies that there are still important steps to be taken to attain the demands set by documents such as the NAS [137] and PCAST reports [136]. Other forensic fields have already started adapting to the requirements set to forensic science by the legal system [215–217]. Forensic handwriting examination should also adapt, especially when dynamic signatures and quantitative data are involved.

    View all citing articles on Scopus
    View full text