Follow the Data! Algorithmic Transparency Starts with Data Transparency
Assistant professor, Computer Science and Engineering, Tandon School of Engineering; Center for Data Science, New York University; and appointed member of the New York City Automated Decision Systems Task Force
Associate professor, Information School; and director, Urbanalytics Lab, University of Washington
The data revolution that is transforming every sector of science and industry has been slow to reach the local and municipal governments and NGOs that deliver vital human services in health, housing, and mobility [1-2]. Urbanization has made the issue acute: in 2016, more than half of North Americans lived in cities with at least 500,000 inhabitants . The opportunities that Big Data presents in this context have long been recognized, evidenced by the remarkable progress around Open Data and open access to it , the digitization of government records and processes ; and, perhaps most visibly, smart city efforts that emphasize using sensors to optimize city processes .
Despite this progress, the public sector is slow to adopt predictive analytics due to its mandate for responsibility—meaning that any decision made by algorithms will need to be scrutinized by the individuals and organizations affects, just as taxpayers must verify that resources are being distributed equitably.
Recent reports on data-driven decision-making underscore that fairness and equitable treatment of individuals and groups is difficult to achieve , and that transparency and accountability of algorithmic processes are indispensable but rarely enacted [8-9]. As a society, we cannot afford the status quo: algorithmic bias in administrative processes limits access to resources for those who need these resources most, and amplifies the effects of systemic historical discrimination. Lack of transparency and accountability threatens the democratic process itself.
In response to these threats, New York City recently passed a law requiring that a task force be put in place to survey the current use of “automated decision systems,” defined as “computerized implementations of algorithms, including those derived from machine learning or other data processing or artificial intelligence techniques, which are used to make or assist in making decisions” in City agencies . The task force will develop a set of recommendations for enacting algorithmic transparency by the agencies, and will propose procedures for:
interrogating automated decision systems for bias and discrimination against members of legally protected groups, and addressing instances in which a person is harmed based on membership in such groups (Sections 3 (c) and (d));
requesting and receiving an explanation of an algorithmic decision affecting an individual (Section 3 (b));
assessing how automated decision systems function and are used, and archiving the systems together with the data they use (Sections 3 (e) and (f)).
New York is the first US city to pass an algorithmic transparency law, and we expect other municipalities to follow with similar legal frameworks or recommendations. Of utmost importance as this happens is recognizing the central role of data transparency in any algorithmic transparency framework. Meaningful transparency of algorithmic processes cannot be achieved without transparency of data.
What Is Data Transparency?
In applications involving predictive analytics, data is used to customize generic algorithms for specific situations—that is to say that algorithms are trained using data. The same algorithm may exhibit radically different behavior—make different predictions; make a different number of mistakes and even different kinds of mistakes—when trained on two different data sets. In other words, without access to the training data, it is impossible to know how an algorithm would actually behave.
Algorithms and corresponding training data are used, for example, in predictive policing applications to target areas or people deemed to be high-risk. But as has been shown extensively, when the data used to train these algorithms reflects the systemic historical bias toward poor and predominately African-American neighborhoods, the predictions will simply reinforce the status quo rather than provide any new insight into crime patterns . The transparency of the algorithm is neither necessary nor sufficient to understand and counteract these particular errors. Rather, the conditions under which the data was collected must be retained and made available to make the decision-making process transparent.
Even those decision-making applications that do not explicitly attempt to predict future behavior based on past behavior are still heavily influenced by the properties of the underlying data. For example, the VI-SPDAT risk assessment tool , used to prioritize homeless individuals for receiving services, does not involve machine learning, but still assigns a risk score based on survey responses—a score that cannot be interpreted without understanding the conditions under which the data was collected. As another example, matchmaking methods such as those used by the Department of Education to assign children to spots in public schools are designed and validated using data sets; if these datasets are not made available, the matchmaking method itself cannot be considered transparent.
What is data transparency, and how can we achieve it? One immediate interpretation of this term is “making the training and validation data sets publicly available.” However, while data should be made open whenever possible, much of it is sensitive and cannot be shared directly. That is, data transparency is in tension with the privacy of individuals who are included in the data set. In light of this, we offer an alternative interpretation of data transparency:
In addition to releasing training and validation data sets whenever possible, agencies shall make publicly available summaries of relevant statistical properties of the data sets that can aid in interpreting the decisions made using the data, while applying state-of-the-art methods to preserve the privacy of individuals.
When appropriate, privacy-preserving synthetic data sets can be released in lieu of real data sets to expose certain features of the data, if real data sets are sensitive and cannot be released to the public.
An important aspect of data transparency is interpretability—surfacing the statistical properties of a data set, the methodology that was used to produce it, and, ultimately, substantiating its “fitness for use” in the context of a specific automated decision system or task. This consideration of a specific use is particularly important because data sets are increasingly used outside the original context for which they were intended. This compels us to augment our interpretation of data transparency in the public sector to include:
Agencies shall make publicly available information about the data collection and preprocessing methodology, in terms of assumptions, inclusion criteria, known sources of bias, and data quality.
Data transparency is important both when an automated decision system is interrogated for systematic bias and discrimination, and when it is asked to explain an algorithmic decision that affects an individual. For example, suppose that a system scores and ranks individuals for access to a service. If an individual enters her data and receives the result—say, a score of 42—this number alone provides no information about why she was scored in this way, how she compares to others, and what she can do to potentially improve her outcome.
To facilitate transparency, the explanation given to an individual should be interpretable, insightful, and actionable. As part of the result, data that pertains to other individuals, or a summary of such data, may need to be released—for example, to explain which other individuals or groups of individuals receive higher scores or more favorable outcomes. This functionality requires data transparency mechanisms discussed in our alternative interpretation above.
Toward Data Transparency by Design
Enacting algorithmic and data transparency challenges the state of the art in data science research and practice, and will require significant technological effort on the part of agencies. It will require careful planning, financial resources, and time.
As an illustration of two recent public actions of a similar nature: the French Digital Republic Act came into effect in October 2016, following a yearlong process , while the EU General Data Protection Regulation (GDPR) was adopted in April 2016 and became enforceable in May 2018, more than two years later .
How can we enable data transparency in complex data-driven administrative processes? The research community is actively working on methods for enabling fairness, accountability, and transparency (FAT) of specific algorithms and their outputs [15-21]. While important, these approaches focus solely on the final step of the data science lifecycle (called “analysis and validation” in Figure 1), and are limited by the assumption that input data sets are clean and reliable.
Figure 1: The data usage lifecycle
In challenging this assumption, we observe that additional information and intervention methods are available if we consider the upstream process that generated the input data . Appropriately annotating data sets when they are shared, and maintaining information about how data sets are acquired and manipulated, allows us to provide data transparency: to explain statistical properties of the data sets, uncover any sources of bias, and make statements about data quality and fitness for use. Put another way: if we have no information about how a data set was generated and acquired, we cannot convincingly argue that it is appropriate for use by an automated decision system.
To achieve algorithmic transparency, there is a need to develop generalizable data transparency methodologies for all stages of the data lifecycle , and to build tools that implement these methodologies [24-25]. Such tools should be placed in the hands of data practitioners in the public sector. Importantly, the requirement of data transparency cannot be handled as an afterthought, but must be provisioned for at design time.
To make this discussion concrete, let’s consider an example. The growing homelessness crisis is a deeply complex challenge to urban communities. A variety of services is available to homeless citizens, including emergency shelter, temporary rehousing, and permanent supportive housing. The goal is to enable an individual to transition into stable housing after an episode of homelessness.
Social service agencies are beginning to collect, share, and analyze data in an effort to provide better targeted interventions. Broadly speaking, these agencies perform two categories of data analysis. The first category is personalized prediction and recommendation of services. For example, previously incarcerated citizens may benefit more from supportive housing, while families with a history of substance abuse may be directed to harm-reduction programs. Data can also be used to predict frequent service users, recommend treatment for sufferers of substance abuse and of other mental-health issues, and provide protection for victims of domestic violence. The second category is measurement and evaluation of the effectiveness of specific interventions, and of the overall system of homeless assistance. Both kinds of analysis are done using complex data-driven models, and rest on the availability, interoperability, and statistical validity of data collected from numerous local communities.
Communities use Homeless Management Information Systems (HMIS) to collect data . Data sets produced by an HMIS are typically “weakly structured”—rectangular, with rows and columns, but otherwise with no guaranteed properties. For example, it is often the case that columns contain data of mixed types, that missing values are abundant, and that column names are not meaningful. HMIS data is anonymized and then shared, and it must be post-processed in various ways to make it appropriate for analysis.
An analyst’s set of candidate weakly structured data sets is formed from a number of sources: open data portals, queries against other agencies’ APIs, and locally derived data sets. In this context, relevant data sets are identified, repaired, restructured, and aligned—so-called “data wrangling.” A crucial dimension of data acquisition and curation that is often overlooked, and for which hardly any technical support exists in current systems, concerns statistical properties of the data. For example, removing records with missing values or joining between two data sets may introduce bias. This bias should be tracked and carried with the data set to inform downstream analysis. In some cases, relevant properties can be computed directly (e.g., geographic coverage). In other cases, the data set must be explicitly annotated (e.g., missing records due to system outage or to rules that prevent disclosure).
A data set derived in this manner may be further filtered, scored, and ranked to prioritize analysis. Filtering and ranking operations may introduce further bias, and must be tracked to explain properties of the data set they produce. For example, it may be required that the filtered data set contain homelessness data for Queens, Brooklyn, and Manhattan, and that it have representation of age and gender categories that agrees with a given population model (e.g., with what is expected based on the census). Further, if data is returned in sorted order, it must be guaranteed that no single ethnic group dominates the top ranks of the list. Restating the filtering and ranking tasks to capture the data analyst’s intent, while at the same time ensuring that several possibly competing objectives hold over the result, is difficult and requires support from the system to be done effectively. The result of this stage is a data set that is used as input for the data analysis stage.
During data analysis, a predictive model is learned based on the data, or an available model is invoked to make predictions. Data analysis is often coupled with validation, where confidences or error rates are produced alongside predictions. Based on research, data analytics can be instrumented to quantify accuracy, confidence, and even fairness at the level of sub-population [27-29]. Feedback from the data analytics stage, such as a high error rate on a specific sub-population, may be used to state additional objectives, and to iteratively refine the process upstream.
Given the close link that exists between algorithms and the data on which they are trained, data transparency is the next frontier. That does not, as we have noted, mean releasing raw data sets—which are often unnecessary, and usually insufficient, to quantify fitness for use in the context of a particular automated decision system or task.
Enacting algorithmic and data transparency will require a significant shift in culture on the part of relevant agencies. It will require careful planning, financial resources, and time. Equally as important, algorithmic and data transparency will require a paradigm shift in the way we think about data-driven algorithmic decision making in the public sector.
First, we must accept that the objectives of “accuracy” and “utility” cannot be the primary goal. They must be balanced with equitable treatment of members of historically disadvantaged groups, and with accountability and transparency to individuals who are being affected by algorithmic decisions and the general public at large. Second, we must recognize that automated decision systems cannot be “patched” to become transparent and accountable. Rather, we must provision for transparency and accountability at design time, which clearly impacts how we build and procure software systems for agency use. Perhaps less obviously, provisioning for data transparency impacts how municipalities structure their Open Data efforts. It is no longer sufficient to publish a data set on a city’s open data portal. Rather, the public must be informed about the data set’s composition and about the methodology used to produce it, as well as its fitness for any particular use. It cannot be reiterated enough that data transparency is a property of not only the data itself, but of how it is deployed in any particular context.
1. Stephen Goldsmith and Susan Crawford, The Responsive City: Engaging Communities through Data-Smart Governance (San Francisco: John Wiley & Sons, 2014).
2. Marcus R Wigan and Roger Clarke, “Big Data’s Big Unintended Consequences,” Computer 46, no. 6 (2013):46–53, http://doi.ieeecomputersociety.org/10.1109/MC.2013.195.
3. United Nations, “The World’s Cities in 2016,” 2016, http://www.un.org/en/development/desa/population/publications/pdf/urbanization/the_worlds_cities_in_2016_data_booklet.pdf.
4. Stefan Baack, “Datafication and Empowerment: How the Open Data Movement Rearticulates Notions of Democracy, Participation, and Journalism,” Big Data & Society 2, no. 2 (2015).
5. Annalisa Cocchia, “Smart and Digital City: A Systematic Literature Review,” in Smart City, eds. Renata Paola Dameri and Camille Rosenthal-Sabroux (New York: Springer International Publishing, 2014), 13–43.
6. Ibrahim Abaker Targio Hashem et al., “The Role of Big Data in Smart City,” International Journal of Information Management 36, no. 5 (2016):748–758.
7. MetroLab Network, “First, Do No Harm: Ethical Guidelines for Applying Predictive Tools within Human Services,” 2017, https://metrolabnetwork.org/data-science-and-human-services-lab/.
8. Robert Brauneis and Ellen P. Goodman, “Algorithmic Transparency for the Smart City,” Yale Journal of Law & Technology 20, no. 103 (2018), http://dx.doi.org/10.2139/ssrn.3012499.
9. Julia Angwin et al., “Machine Bias: Risk Assessments in Criminal Sentencing,” ProPublica, May 23, 2016, https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
10. The New York City Council, “Int. No. 1696-A: A Local Law in Relation to Automated Decision Systems Used by Agencies,” 2017, https://legistar.council.nyc.gov/LegislationDetail.aspx?ID=3137815&GUID=437A6A6D-62E1-47E2-9C42-461253F9C6D0.
11. Cathy O’Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (New York: Crown Publishing Group, 2016).
12. Partners Ending Homelessness, “Vulnerability Index—Service Prioritization Decision Assistance Tool (VI-SPDAT),” http://pehgc.org/.
13. La République Numérique, “The Digital Republic Bill—Overview,” https://www.republique-numerique.fr/pages/in-english.
14. The European Union, “Regulation (EU) 2016/679: General Data Protection Regulation (GDPR),” https://gdpr-info.eu/.
15. Cynthia Dwork et al., “Fairness through Awareness,” Innovations in Theoretical Computer Science, Cambridge, Massachusetts, January 8–10, 2012.
16. Michael Feldman et al., “Certifying and Removing Disparate Impact,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, August 10–13, 2015.
17. Sara Hajian and Josep Domingo-Ferrer, “A Methodology for Direct and Indirect Discrimination Prevention in Data Mining,” IEEE Transactions on Knowledge and Data Engineering 25, no. 7 (2013):1445–1459, https://ieeexplore.ieee.org/document/6175897?reload=true.
18. Faisal Kamiran, Indre Zliobaite, and Toon Calders, “Quantifying Explainable Discrimination and Removing Illegal Discrimination in Automated Decision Making,” Knowledge and Information Systems 35, no. 3 (2013):613–644.
19. Andrea Romei and Salvatore Ruggieri, “A Multidisciplinary Survey on Discrimination Analysis,” The Knowledge Engineering Review 29, no. 5 (2014):582–638, https://doi.org/10.1017/S0269888913000039.
20. Richard S. Zemel et al., “Learning Fair Representations,” in Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia, 2013, 325–333, http://proceedings.mlr.press/v28/zemel13.pdf.
21. Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan, “Inherent Trade-Offs in the Fair Determination of Risk Scores,” 8th Innovations in Theoretical Computer Science Conference, January 9–11, 2017, Berkeley, California, https://doi.org/10.4230/LIPIcs.ITCS.2017.43.
22. Keith Kirkpatrick, “It’s Not the Algorithm, It’s the Data,” Communications of the ACM 60, no. 2 (2017): 21–23, https://cacm.acm.org/magazines/2017/2/212422-its-not-the-algorithm-its-the-data/abstract.
23. Julia Stoyanovich et al., “Fides: Towards a Platform for Responsible Data Science,” in Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, Illinois, June 27–29, 2017, 26:1–26:6.
24. Haoyue Ping, Julia Stoyanovich, and Bill Howe, “Datasynthesizer: Privacy-Preserving Synthetic Datasets,” in Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, Illinois, June 27–29, 2017, 42:1–42:5.
25. Ke Yang et al., “A Nutritional Label for Rankings,” in ACM SIGMOD 2018.
26. HUD Exchange, “HMIS Data and Technical Standards,” https://www.hudexchange.info/programs/hmis/hmis-data-and-technical-standards/.
27. Florian Tramèr et al., “FairTest: Discovering Unwarranted Associations in Data-Driven Applications,” 2015, available at https://arxiv.org/abs/1510.02377.
28. Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou, “Fairness Testing: Testing Software for Discrimination,” in Proceedings of 2017 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Engineering, Paderborn, Germany, September 4–8, 2017, https://doi.org/10.1145/3106237.3106277.
29. Anupam Datta, Shayak Sen, and Yair Zick, “Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems,” 2016 IEEE Symposium on Security and Privacy, May 22–26, 2016, San Jose, California, available at https://www.andrew.cmu.edu/user/danupam/datta-sen-zick-oakland16.pdf.