Recently Governor Gavin Newsom announced 1 that his administration plans to pursue a “data dividend” law in which businesses would be required to make payments to the state or consumers when their data is sold. In an effort to ensue the long term viability and cultural adoption of such a scheme I’m going to explore here current approaches to ethically sourced data certification systems and whether, and if so how, to implement such a system to support wide spread demand for non-extractive data practices and what format such a scheme would need to take to avoid the pitfalls of lax standards.
To put this proposal in context of other sustainability certification programs I will give a brief overview of USDA organic and LEED, explore where certifications have gone wrong in the past and then explore how to design a system specific to data and the attention economy. I am turning to these two programs partially because there is little history to point to in the way of attention economy regulation other than moralistic decency censorship, but also because framing data in ecological terms brings the wider connected system of human attention and its financialization (intentionally addictive software design for example) into view. Human attention is the soil in which we farm, and data is the saleable produce which grows from that soil. How we tend now to the long-term viability of that soil will determine what can be grown in it in the future.
Jenny Odell in her recent book How to Do Nothing: Resisting the Attention Economy2 layouts the ecological sustainability comparison thusly:
Capitalism, colonialist thinking, loneliness, and an abusive stance toward the environment all coproduce one another. It’s important because of the parallels between what the economy does to an ecological system and what the attention economy does to our attention. In both cases, there’s a tendency toward an aggressive monoculture, where those components that are seen as “not useful” and which cannot be appropriated (by loggers or by Facebook) are the first to go.
Where did organic certification come from?
To start lets take a look the United States Department of Agriculture (USDA) Organic certification program. In 1990 Congress passed the Organic Foods Production Act to create a national standard for organic food and fiber. This law gave us the National Organic Standards board which got to decide what substances farmers could and could not use if they wanted the certification. It took a lot of work but in 2002 the current rules as well as the certification and verification system went into action. 3
But all this rulemaking and categorical border-drawing happened after organic farming had been a thing since basically since the invention of agriculture, the dawn of the neolithic age some 9000 years ago. Blind to our inheritance of indigenous land practices, we didn’t start thinking of organic as separate from conventional chemically-industrialized farming until around 1924 when bio-dynamic farming emphasized “soil fertility, plant growth, and livestock care as ecologically interrelated tasks.”4 In the US, it took a bunch of intrepid researchers and farmers the 30’s, 40’s and 50’s to build the knowledge and momentum required for the 60’s to see the banning of DDT, and it took until the 70’s to sell the first packaged organic products in supermarkets.
So it turns out labeling didn’t magically appear, it was the result of decades of labor. Today the market for this regulated product is thriving. The rise of ethical consumerism (at least the intention to buy sustainable stuff, the reality I will cover in the next section) has meant that people value and will pay more for things they are assured are sourced in ways good for the humans who made them and the environment overall. A 2008 study done by Sununtar Setboonsarng and Anil Markandya for the Asian Development Bank 5 even showed that in the right contexts those premium prices lead to decreases in poverty and hunger for food producers.
But what counts as organic?
I wanted you to have the history and effects of USDA labeling in mind before we break down what the labeling actually means. USDA organic is a binary, yes or no, certification that they boil down this way:
Organic means that farmers abstain from using a specific list of chemicals and genetically modified organisms. It doesn’t mean everybody in the supply chain is being paid a living wage. It doesn’t indicate safe working conditions. It doesn’t promise actual sustainability. Does LEED do any better?
What is LEED and how does it work?
LEED, or Leadership in Energy and Environmental Design, is a green architecture certification system popular worldwide. They have a rating system for ways buildings are designed, built, operated and maintained and award points to actions, materials, and techniques they deem better for the environment (eg. being more energy efficient), human health (eg. providing better indoor air quality), or overall experience of using the building (eg. access to daylight). The points add up to different levels of certification, ranging from certified to platinum.
So does all this points awarding and badge giving actually lead to provable changes in energy use, health effects, and human satisfaction? It’s a mixed bag. As one Ponoma student scholar pointed out in 2010, “the public often assumes that LEED certified buildings are completely sustainable or even net-zero with regards to greenhouse gas emissions, but in actuality buildings certified under the most popular version of LEED are only required to be 15% more energy efficient than required by most state building codes – a far cry from the energy usage cuts needed to stave off global warming.” 6 On air quality, one 2016 study in the Journal on Environmental Research and Public Health did show 50% lower concentrations in air pollutants when comparing a two LEED and non-LEED buildings7. And when it comes to user experience a 2017 study didn’t find any evidence of higher satisfaction for those working or living in certified buildings 8.
But is LEED effective in their stated goal of “spurring growth in sustainable building…worldwide”9? Unfortunately this is a big no. The U.S. Department of Energy’s 2008 report titled Energy Efficiency Trends in Residential and Commercial Buildings 10 stated that:
The relatively recent and small penetration of LEED certifications, and even of ENERGY STAR-certified buildings, may — in combination with other factors — help explain why commercial building energy intensities have trended upward over time.
Despite LEED’s efforts, buildings are getting more energy intensive to build and maintain, not less.
So certification just leads to pretend sustainability while actually being terrible?
Its a fair assumption based on our findings thus far. It’s not as though certification is a cure-all. In May of 2018 the NGO Changing Markets Foundation published research suggesting 11 that certification schemes in textiles, palm oil, and fisheries are in fact contributing to the very environmental degradation they claim to prevent. Certification organizations that preference gaining corporate partners will, of course, sacrifice effective standards in favor of being able to wave the flag of brand participation success. This eventuality is in the short-term best interest of those companies seeking certifications since we know people will pay higher prices for products they perceive to be ethically soured. We also know people will see a certification label and buy the product without doing any research into its underlying merit. This guarantees that unscrupulous certifications, and even look-alike logo designers, will try and trick people. And while some might deride trusting the label as lazy, lowering cognitive load is the whole point of these programs. The system needs to be designed to encourage better choices in people who don’t have time to research every product they buy.
Now let’s look at two current approaches to data certification.
Datasheets for Datasets
Datasheets for Datasets is a proposal by a group of researchers12 to require metadata documents to be included all “public datasets, commercial APIs, and pretrained models”13 with the stated goal of increasing “transparency and accountability”14 for both the users and creators of AI systems. They make the comparison to industry standard datasheets for computer hardware which specify “standard operating characteristics, test results, recommended usage, and other information.” Their prototype datasheets “focus on when, where, and how the training data was gathered, its recommended use cases, and, in the case of human-centric datasets, information regarding the subjects’ demographics and consent as applicable.”15
- For what purpose was the dataset created?
- Who created this dataset and on whose behalf?
- How was its creation funded?
- What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
- Is this dataset a subset of a larger set? If so what was the sampling strategy?
- Does the dataset identify any sub-populations (e.g., by age, gender)?
- What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)?
- Was there an ethical review?
- Are there tasks for which the dataset should not be used?
- Who is supporting/hosting/maintaining the dataset?
Good Work Code
Good Work Code, a project begun in 2015 by the National Domestic Workers Alliance, aims to create a broad framework for protecting all forms of on-demand work. Companies that sign on to their program are expected to up hold their core values: “safety, stability and flexibility, transparency, shared prosperity, fair pay, inclusion and input, support and connection, and growth and development.”16. One example of a company that participates in their program is DoorDash, a urban food delivery service. Whether or not DoorDash fulfills all of these values to a reasonable standard is hard to tell as Good Work Code’s enforcement mechanisms are limited and not public. We do know that “DoorDash carries a commercial auto insurance policy that covers up to $1 million in bodily injury/and or property damage of its Dashers”17 which is an unusual and not legally required level of protection for on-demand workers these days.
Given everything we’ve learned all these sustainability certification programs, let’s now explore a speculative design which focuses on the attention and data economy.
A Speculative Plan for Sustainably-Sourced Data
We can imagine, for example, Amy Tong, California’s Chief Information Officer and Director of the Department of Technology, developing a new program in which companies are awarded gold, silver, and bronze level data certifications intended to ramp current offenders and create an industry wide standard for data acquisition techniques. But before we go thru each of the levels I need to define a few terms.
Lineage-linked: data which has had metadata on its origins maintained using blockchain or another verified identity maintenance technology. This means that every time you are forced to label grainy images of buses or crosswalks by a reCAPTCHA when you are just trying to log back into your Patreon account, for annoying example, or contribute to an on-demand task via one of the microtasking platforms you work on, that labor would be linked to you universal verified ID allowing for dividends from its use to easily flow back to you.
Collectivized: data which is created and owned by a group of people or the state, distributing risk and negotiation load, as well as preventing race to the bottom economic forces. This means that instead of you having to decide what happens to every tiny instance of your data you join one or multiple community groups/unions/data corporations and collectively negotiate the value, allowed uses, and labor conditions of that work.
Dividend: over-time payments made to data providers in exchange for licensing their data for a specific use. This means that when profits are made from the use of your data you get paid, and not just once but, like an actor receiving residuals when their show goes into syndication, in pre-negotiated percentages over time.
With these three concepts we can build up a three tier ramp. Keep in mind that I am a researcher not a lawyer, designer, or magic future fairy. The following language is intended to provide a concrete example to scaffold further investigation into whether, and if so how, to create this kind of program—not as verbatim legal language.
- All new data, data acquired after the programs launch, used in customer facing products has been verified as lineage-linked.
- All new models, models trained, completed, modified, or optimized after the programs launch whether open or closed source, used in customer facing products uses only lineage-linked data.
- All new data is sourced from Collectivized Providers.
- All Collectivized Providers receive dividends.
- All dividends are negotiated with Collectivized Providers and meet or exceed local living wage standards.
- As above but includes all new data used in both new internal tools and new customer facing products
- As above but includes all new models used in both the company’s internal tools and customer facing products
- All dividend and collectivization rules same as Bronze
- Bronze and Sliver focus on new data and models incorporated after the programs start but Gold standards expand to include all data and models used in all internal tools and customer facing products.
- All dividend and collectivization rules same as Bronze
The first reaction people close to this field are likely to have is that these standards are far too difficult to achieve. If you include any kind of handwriting recognition in your product, for example, that model is likely based on the MNIST (or more recent EMNIST) dataset18, a data set commonly used to compare different pattern recognition techniques. Where did the data come from? “High school students and employees of the United States Census Bureau.” Clearly, MNIST doesn’t meet our lineage-linked standard. The argument is not that these grandparented-in data sets are useless and should be deleted. By all means use them for research. But when it comes to products and profits, building on a backbone of free source-less data is both extractive of undervalued labor and stunts the overall economy. These kinds of datasets need expiration dates, making them a product of their time but not suitable for use today.
“But open source!” you might cry. Data providers, and for that matter developers, contributing labor to open source projects must be compensated when companies then turn that work into capitol. Nadia Eghbal, Harvard Kennedy School researcher, estimated in 2016 that “open source was worth at least $143M of Instagram’s $1B acquisition.”19 Fortunately GitHub’s newly announced donation button is a tiny step toward just that future 20
In order for a data sourcing certification program to work it must be trustworthy but also communicate its value in the form of prestige and rarity. Breaking with the status quo is difficult, for environmental sustainability, people break with the status quo by abstaining from beef, for urban sustainability, people stop using congestion-inducing ride-hailing apps, and to create an a sustainable data economy we have to choose to stop profiting off the unpaid labor of an invisible other. We have seen from history what happens when standards are lax, market penetration is low, and fakes are rampant. In order to get the benefits of not having our data stolen, of lower cognitive loads when buying software, of not being subjected to addictive mechanisms in our apps, of higher profits from a wave of ethical consumerism, of good pay and working environments for on demand workers, and a sustainable attention future we have to gently peel away our dependence on extractivism and decide the long term benefits are worth the short term hassle.
- Newsom wants companies collecting personal data to share the wealth with Californians by Jazmine Ulloa at the LA Times
- How to do nothing by Jenny Odell
- Sustainable Agriculture Research and Education Center
- Bio-dynamic Agriculture
- Organic Agriculture and Post-2015 Development Goals Building on the Comparative Advantage of Poor Farmers
- Is LEED a True Leader? Studying the Effectiveness of LEED Certification in Encouraging Green Building
- Airborne Particulate Matter in Two Multi-Family Green Buildings: Concentrations and Effect of Ventilation and Occupant Behavior
- Indoor environmental quality and occupant satisfaction in green-certified buildings by Sergio Altomonte, Stefano Schiavon, Michael G. Kent & Gail Brager
- LEED Facts
- Energy Efficiency Trends in Residential and Commercial Buildings, 2008, U.S. Department of Energy
- Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford
- Ghost Work, Page 157, Mary L. Gary and Siddharth Suri
- Ghost Work, Page 157, Mary L. Gary and Siddharth Suri
- Open source was worth at least $143M of Instagram’s $1B acquisition by Nadia Eghbal