Better Data and Smarter Data Policy for a Smarter Criminal Justice System System

Professor of Law, Santa Clara University; fellow, Stanford Computational Policy Lab

Founder, Justice Codes; legal affairs writer, ABA Journal (Mr. Tashea does not speak for the American Bar Association or the ABA Journal.)

If the promise of artificial intelligence is to make systems smarter and more efficient, there may be no better candidate than the US criminal justice system. On any given day, close to half a million people—many whose charges will ultimately be dropped—sit in jail awaiting trial, at an estimated cost of $14 billion per year to taxpayers [1]. One in three, or 74 million adults in the United States, have a criminal record [2]. The majority of these records, however, comprise non-violent misdemeanors or charges that never led to a conviction [3, 4]. For those who serve time, failure to rehabilitate is the norm in this country: just shy of a staggering 77 percent of state prisoners are rearrested within five years of their release [5].

These burdens fall disproportionately on communities of color and the poor. African Americans are incarcerated in state prisons at five times the rate of white Americans [6]. Black people are four times more likely to be arrested than whites for marijuana offenses, despite similar rates of use [7]. Money bail locks people up primarily for their inability to purchase their freedom, rather than their risk to society.

Thankfully, “tough on crime” policies are giving way to “smart on crime” approaches. Nationwide, efforts to reform money bail could avoid much of the human and financial toll— about $38 million per day— associated with pretrial incarceration, though, the results are mixed [8-11]. In Georgia, the expansion of drug, mental health, and veteran courts has led to a decrease in crime and incarceration across the state, including a 36 percent decline in youth imprisonment [12]. When New York City ended stop and frisk—a police practice of temporarily detaining and searching disproportionately black and brown people on the streets—in 2013, recorded stops fell from over 685,000 in 2011 to 12,000 in 2016 (a 98 percent drop) and crime continued to decline [13].

Data, automation, and a culture of experimentation can further hasten reforms, making our criminal justice system smarter, fairer, and more just. However, careful attention must be paid to the infrastructure at the heart of every artificial intelligence system: the data and its algorithms, the human beings that use it, and accountability.

AI is the easy part: we need better data and data policy to end mass incarcerationIt’s our belief that this starts with policymakers, who need to pay more attention to the foundational issues of data collection and standardization, which includes training data that build artificial intelligence systems and data sharing, as well as the oversight needed to ensure that automated processes are yielding desired outcomes in practice--not theory. As it stands, deficiencies in these areas are already presenting challenges in three major areas: pretrial risk assessment, reentry, and second chances.


Pretrial Risk Assessment

Across the country, courts are using profile-based risk assessment tools to make decisions about pretrial detention. The tools are built on aggregated data about past defendants to identify factors that correlate with committing a subsequent crime or missing a trial date. They are used to score individuals and predict if pretrial incarceration is necessary.

Each risk assessment tool available on the market relies on different factors. The Public Safety Assessment (PSA) tool, developed by the Laura and John Arnold Foundation and deployed in 40-plus jurisdictions, uses nine factors, including historic criminal convictions and the defendant’s age at the time of arrest to determine scoring [14]. Equivant’s COMPAS Classification software uses six factors in risk assessment and over 100 factors to carry out needs assessments that determine what services a person needs [15, 16]. Despite their differences, because these tools are built on historical data, they run a real risk of reinforcing the past practices that have led to mass incarceration, like the over incarceration of poor and minority people [17]. COMPAS and the PSA have each been challenged in court, thus far unsuccessfully, regarding their accuracy, transparency, or impact on a defendant’s due process rights.

Data-Driven Recidivism Reduction

There are similar concerns about the application of evidence-based tools at the opposite end of the carceral cycle effecting the 640,000 prisoners who reenter society each year. The President has supported the First Step Act, made reentry a priority, and promoted evidence-based recidivism reduction in the federal prison system [18-20].

But to build successful, data-driven programs requires shoring up the underlying criminal justice data, which is notoriously messy and siloed. As a recent report by the White House Council of Economic Advisers concluded, investments in better evidence and assessment tools and carefully designed empirical evaluations are needed to determine what does and doesn’t work to close prison’s revolving doors [21]. The Act would require the DOJ to implement a risk and needs assessment system to determine how to assign programming and provide incentives and rewards to inmates. But concerns remain that, because the tool is likely to be built using historical data and will be implemented at the attorney general’s discretion (although with the input of an Independent Review Committee), it may amplify existing existing racial and other biases.

Second Chances

Also in the realm of reentry, waves of “second chance” reforms have been enacted across the country. These policies increase the eligibility of individuals for early release, clear their criminal records, or help them regain the right to vote. But while much attention has been paid to the increasing availability of second chance opportunities, less is known about their uptake and impact.

Recent research conducted by one of us defines and documents the “second chance gap” between eligibility for and award of receive second chance relief - in the form of re-sentencing, records clearing, and re-enfranchisement [22]. It finds that although tens of millions of Americans could clear their records, only a fraction of them have, leaving behind a lower bounds estimate of 25-30M persons living with records that could, under current law, be cleaned up, and the damage of living with a criminal record, to employment, housing and host of other prospects, lessened. The large number of individuals in the gap stems from a variety of reasons including a lack of awareness of eligibility, prohibitive costs, fines and fees, and cumbersome application processes. The potential gains from closing this “second chance gap”—including decreased incarceration costs, restored dignity among former prisoners, public safety, and employment— are too valuable to ignore. Machine automation can help remove the red tape, not steel bars, that hold individuals back, as demonstrated in California, Maryland, and Pennsylvania [23-25]. The devil, as usual, is in the data and design details - with the reach of clearance and its cost - from pennies to thousands - depending on how its implemented [26].

What all three contexts—pretrial detention, recidivism, and second chances—have in common is that the quantity of potential improvements and accountability regarding their delivery depends on the quality of the underlying data and algorithms in use, as well as access to the resulting outcome data.

Machine automation and machine learning require machine readable, structured, and, in the case of supervised learning, labeled data. The algorithms derived from this data need to be evaluated and benchmarked for their performance. Once deployed, novel interventions and their impacts need to be validated. These steps, each challenging in their own right, can happen. Proactively attending to the vital issues of data collection, sharing, and oversight will make it much more likely that they do.


Data Collection and Standardization

Despite progress in recent years, data about the criminal justice system remains notoriously messy, complex, and hard to come by in standardized formats. Information is often locked up in public and private data silos and paper files, and in employment, prison, and court records. As a result, getting permission to collect and clean data from disparate sources consumes a disproportionate amount of time, often putting it out of the reach of the very reformers who are trying to develop and test novel insights and study the impacts of their implementation.

Paying attention to data collection and standardization at the outset can avoid these data deficits. A new Florida law shows one way [27]. It requires counties to publicly release 25 percent more data than they currently do into a public database, providing for a centralized process for the regular collection, compilation, and management of data about individuals, processes, and outcomes in the criminal justice system [28]. In this case, all data must be submitted in useable (machine-readable, disaggregated, privacy-respecting) form, with a single, unique identifier for information collected about an individual across criminal justice agencies, like courts, corrections, and police. The output should support innovation and community-based policy development, implementation and refinement, and, in the process, accountability and trust.

Training and Test Data and Data Sharing

Once system data is collected, additional time must be spent preparing it for research. As Fei Fei Li has said, “The thankless work of making a dataset is at the core of AI research. [29]” The Imagenet training dataset Li helped create and shared with the world has become the foundation for powering and measuring advances in image recognition [30]. Such datasets can be used to “train” software to recognize and correctly label images, as well as provide a “test” or benchmark for evaluating the relative accuracy of different artificial intelligence algorithms.

The sharing of criminal justice data, whether in the form of labeled training or holdout datasets or other means, would accelerate progress. Research-ready data should be prioritized from the start, with designated interfaces already designed for use by both computers and human to facilitate the secure, privacy-respecting sharing of data.


Algorithmic Oversight

While bolstering data collection and data sharing, federal and local governments also need to get serious about AI oversight. There is a transparency issue regarding existing tools that is in direct conflict with an open and transparent court process. For example, COMPAS refuses to make public the details of its algorithm, and neither COMPAS nor the PSA have publicly released their training data—though there may be privacy protections in tension with doing so. Once a system has been implemented, auditing a system to know whether it is performing as intended is also difficult to carry out.

But rather than make case-by-case determinations, proactive policies that support public trust and good science should be put in place. New York City passed a law to study the issue of transparency in algorithms used by governments. However, the United States on the whole is a laggard in data and algorithmic regulation. The lack of transparency and oversight both diminishes a tool’s potential for improvement and carries with it the liability of curtailing a defendant’s due process.

While jurisdictions, whether local, statewide, or national, are likely to advance standards tailored to their individual needs, bedrock tenets of data governance should be kept in mind. Tool providers should be required to disclose their inputs and processes, and agencies and courts should be required to explain how they are using the tools and how they are performing. Certainly not everyone should have access to all the data and corresponding algorithms—this could undermine privacy, invite game playing, and discourage innovation—but access needs to be robust enough to ensure accountability, advance scientific understanding and iteration, and build public trust.

Artificial intelligence offer great potential to turn the tide on mass incarceration in the US. However, it will not be as simple as using the right tool or finding the right dataset. If criminal justice reformers and policymakers are serious about a smarter criminal justice system, enhanced in part by AI, they must prioritize creating a smart and strong foundation—based on solid data and solid data policy—on which to support it.


  1. Pretrial Justice Institute. "Pretrial justice: How much does it cost?" Washington, D.C (2017) available at

  2. Next Generation Identification (NGI) Monthly Fact Sheet (2016), available at

  3. Alexandra Natapoff, "Misdemeanors," 85 Southern California Law Review 101 (2012); Loyola-LA Legal Studies Paper No. 2012-08.

  4. Gary Fields and John R. Emshwiller, "As Arrest Records Rise, Americans Find Consequences Can Last a Lifetime," The Wall Street Journal Aug. 18, 2014

  5. Matthew R. Durose, Alexia D. Cooper, and Howard N. Snyder, "Recidivism of Prisoners Released in 30 States in 2005: Patterns from 2005 to 2010," Bureau of Justice Statistics Special Report, (2014)

  6. Ashley Nellis, "The Color of Justice: Racial and Ethnic Disparity in State Prisons," The Sentencing project (2016) available at

  7. American Civil Liberties Union (ACLU) "Marijuana Arrests by the Numbers", available at

  8. Udi Ofer "We Can’t End Mass Incarceration Without Ending Money Bail," December 11, 2017

  9. Pretrial Justice Institute. "Pretrial justice: How much does it cost?," Washington, D.C (2017) available at

  10. Megan Stevenson and Jennifer Doleac, "The Roadblock To Reform,", American Constitution Society (2018) available at

  11. Jason Tashea, "Battling Bail: The bail industry is fighting back against reforms that threaten its livelihood," ABA Journal (2018)

  12. The Associated Press, "Georgia’s top judge applauds criminal justice reform success,", Seattle Times February 22 2018

  13. Al Baker, "Police Evaluations Should Focus on Lawfulness of Stops, Monitor Says,", New York Times October 20 2017

  14. Laura and John Arnold Foundation, "Public Safety Assessment: A risk tool that promotes safety, equity, and justice," (2017)

  15. COMPAS Classification software,

  16. Equivant, "Official Response to Science Advances," January 17, 2018

  17. John Koepke and David Robinson, "Danger Ahead: Risk Assessment and the Future of Bail Reform," Washington Law Review, Forthcoming (2018).

  18. Committee on the Judiciary, "The First Step Act", available at

  19. C.J Ciaramella, "Trump Says In SOTU That Administration Will Pursue Prison Reforms," reason, January 30 2018

  20. Law & Justice, "President Donald J. Trump Supports Legislative Action to Reduce Recidivism in Our Prison System," May 18, 2018

  21. Law & Justice, "CEA Report: Returns on Investments in Recidivism-Reducing Programs," May 18, 2018

  22. Colleen Chien, "The Second Chance Gap," Santa Clara Univ. Legal Studies Research Paper (2018). Available at SSRN: or

  23. Jason Tashea, "San Francisco district attorney to use algorithm to aid marijuana expungements," ABA Journal (2018)

  24. Nation, "Here’s why many Americans don’t clear their criminal records," PBS Socal January 8 2016

  25. J.D. Prose, "Pennsylvania becomes first state with ‘clean slate’ law for nonviolent criminal records," The Times, June 28 2018,

  26. Colleen Chien, "Presentation to the “Harnessing Technology to Close the ‘Second Chance Gap’” Workshop," Center for American Progress, November 16, 2018 available at

  27. Florida Senate Bill 1392 (2018), available at

  28. Issie Lapowski, "Florida could start a Criminal-Justice Data Revolution," Wired, March 13 2018 available at

  29. Dave Gershgorn, "It's Not About The Algorithm: The data that transformed AI research—and possibly the world," Quartz, July 26 2017, available at

  30. ImageNet,

Dipayan Ghosh