The Archive Problem
AI can process everything we've ever recorded. Should it?
In a Berlin basement, history waits in 16,000 garbage bags. Six hundred million fragments of paper, torn by desperate hands when the Wall fell in 1989. East Germany’s secret police had forty years of surveillance to destroy, but their shredders failed them.
Agents spent the regime’s final days tearing documents by hand, pieces as small as fingernails, working through 111 kilometers of files. The bags were supposed to burn. They didn’t.
For decades, those fragments occupied a liminal space between existence and access. The Stasi had documented everything, but in their final hours, they accidentally invented a protection we hadn’t named yet: practical obscurity.
You could hold the pieces. You couldn’t read them.
Then came the ePuzzler project. Pattern-recognition software developed by the Fraunhofer Institute began reassembling the fragments like a jigsaw puzzle made of secrets.
Torn edges matched and handwriting reconnected across millions of scraps. What researchers estimated would take 375 years became possible in a decade.
The technical capacity existed. The question nobody had adequately answered: should they use it?
The Stasi files present an unusually clean ethical case: victims still alive and desperate to learn what had been said about them. Who had informed. What the apparatus knew.
Germany’s answer was access. The Stasi Records Agency made the files available to anyone documented in them.
Imagine if Facebook offered that option.
But most collections don’t come with such clear moral coordinates.
The British kept meticulous records of their Indian subjects, the French documented Algerians, and Americans tracked enslaved people as property. Colonial administrators were nothing if not thorough.
These weren’t neutral repositories. They were records of power documenting what those with authority chose to preserve.
Corporations retain decades of internal communications. Police departments generate petabytes of footage. Families preserve letters and photographs. The people documented often had no say and no expectation of review.
That was fine when protection was practical.
Handwritten colonial records exist in principle but not in practice when transcription requires years of specialist labor. Body camera footage exists but remains invisible when reviewing it takes longer than recording it.
Your great-grandmother’s letters exist but remain private when they’re illegible to anyone unfamiliar with her particular hand.
AI dissolves these protections with the casual efficiency of someone finding your diary and announcing they’ve scanned it for sentiment analysis.
Transkribus can now transcribe 350-year-old handwritten documents with 95% accuracy. Over 20,000 users feed historical text into searchable databases.
Your great-grandmother’s intimate letters are no longer protected by her terrible handwriting. They’re searchable. They’re analyzable. They’re data.
Police departments are turning to companies like Truleo and Veritone to process footage they could never review manually. For roughly $50,000 annually, less than one officer’s salary, these systems scan hundreds of hours of video in seconds, flagging incidents, transcribing speech, and identifying behaviors.
One estimate suggests body camera footage equivalent to 25 million copies of Barbie gets collected but rarely reviewed. Institutions deploy systems that make it searchable because the economics are irresistible. Nobody requires permission.
The National Archives acknowledged the transformation in October 2024 when Archivist Dr. Colleen Shogan announced a strategic framework embracing AI integration. The statement reads like someone discovering they’ve opened Pandora’s box and deciding to organize the contents alphabetically.
The choice isn’t whether to use AI. It’s what to do with the capacity it creates.
In 1979, the U.S. Supreme Court recognized “practical obscurity” in Department of Justice v. Reporters Committee for Freedom of the Press. The case involved FBI rap sheets on millions of Americans.
The information was technically public. But the Court held that compilation into a searchable database changed its nature.
Information that was public in pieces could still be private in aggregate. The practical difficulty of gathering it provided protection.
Brilliant legal reasoning. Based on an assumption that survived exactly as long as computing stayed slow.
Pattern-recognition software performs a kind of digital alchemy, turning scattered documents into comprehensive profiles. Cross-reference names across centuries in seconds. Find patterns across millions of records no human researcher could track. Assemble portraits from fragments never meant to form a whole.
The protection wasn’t that information was hidden. The protection was that gathering it cost more than it was worth.
That economic barrier just evaporated.
Ask the genealogists. The practical obscurity that protected your ancestors isn’t protecting you.
The 23andMe revelation arrives via email: “You have new DNA relatives!”
When Pew Research surveyed direct-to-consumer genetic testing customers, 27% reported discovering previously unknown close relatives. For a quarter of customers, it’s the algorithmic dissolution of family narrative.
Meet Rydra. She sent away her 23andMe kit on a whim, curious about her Eastern European ancestry. Three months later: “You have new DNA relatives!” The match: “Close Family: Half-Sibling.” Rydra is an only child. Was an only child.
Her father’s wartime posting in France suddenly made a different kind of sense. Her grandmother’s insistence on never discussing “that period” made sense too.
Death was the ultimate information firewall. Rydra’s father is fifteen years dead. The secret survived him by exactly as long as it took pattern-matching to decide it shouldn’t.
The pattern isn’t just about secrets revealed. It’s about who gets to decide what becomes visible. Grandpa couldn’t consent to genetic databases because genetic databases didn’t exist. His daughter who took the test consented for herself but not for him. His granddaughter who got matched never consented at all.
Three generations. Three different consent contexts. One database that treats them all as equivalent data points.
Helen Nissenbaum’s framework of “contextual integrity” describes how privacy actually works: not as secrecy, but as appropriate flows of information.
Your diary entry isn’t a public statement, even if both are technically readable. Data shared in one context becomes a violation when algorithms flow it to contexts with different norms. AI doesn’t respect context. Only correlation.
A police interaction recorded for accountability is not footage for AI pattern analysis. Even if both are technically the same video.
Collections violate contextual integrity by design, preserving information beyond its original context.
Your great-grandfather’s letters to his mistress were contextually private when written, became family property when discovered, historical record when donated, searchable text when digitized.
The dead man never consented to these transitions. The framework of consent makes no sense when applied to someone who’s been gone for a century.
American law is remarkably clear: the dead have no privacy rights.
Privacy dies with you.
For 130 years, courts have repeated this logic. The deceased cannot object, cannot sue, cannot keep secrets that weren’t buried with them.
This made sense when information decayed as fast as bodies did. What survived was physical, limited, manageable through familiar mechanisms of property, inheritance, and estate.
It makes considerably less sense when AI can reconstruct what was thought lost. When pattern recognition can infer what was never recorded. When the dead leave behind genetic code that continues expressing itself in descendants who never asked to be data points.
Horse-and-buggy privacy law for autonomous vehicles.
The HIPAA Privacy Rule protects health information for fifty years after death. Long enough for your medical records to observe a period of mourning before entering the public domain.
Then it’s open season.
The European Union’s GDPR doesn’t apply to the dead at all. Some scholars have proposed “posthumous data guardians” to represent the privacy interests of the deceased.
The proposal sounds reasonable until you imagine it in practice.
Who applies for this job? Who decides what grandmother would have wanted when her progressive granddaughter and conservative son disagree? What framework determines the values of someone who died before social media existed, before genetic testing existed, before the concept of “data” applied to persons?
The gig economy will solve this. It solves everything. Platforms will emerge for posthumous privacy advocacy. Five-star ratings based on customer satisfaction scores from the living, measuring fidelity to the wishes of the dead, adjudicated by algorithms trained on correspondence the deceased never consented to analyze.
This isn’t dark humor. This is where the logic leads. We’ll build systems to represent the privacy interests of people who can’t consent, trained on data they couldn’t have consented to share, making decisions according to frameworks they couldn’t have anticipated, enforced by institutions they couldn’t have imagined.
We’ll call it protection.
GDPR includes exceptions that preserve historical materials. The right to erasure doesn’t apply to “archiving purposes in the public interest.”
The exemption exists because erasing history serves power, not truth. But subjects of documentation have no mechanism to object, even when it was never meant to be historically significant.
You have rights over your data while alive. Your family has some rights briefly after you die. Eventually, your data becomes historical material exempt from protections you could have claimed, available to analysis you could not have anticipated, serving purposes you could not have consented to.
We treat death as ending personhood but not data-hood. Your body gets buried. Your information gets excavated.
The people most affected aren’t always the dead. When police use algorithms to examine body camera footage, they reconstruct encounters citizens assumed were ephemeral.
You were pulled over in 2019. Minor traffic violation. Awkward exchange with the officer. You forgot about it within a week.
The footage exists. Somewhere in a police department server farm, your interaction sits with hundreds of thousands of others, recorded for accountability but rarely reviewed.
Last year, the department deployed Truleo. Your traffic stop is now a data point in behavioral analysis. The humiliating interaction is now training material. Your permission never requested because it was never legally required.
When researchers use pattern-matching to examine colonial files, they surface patterns in populations that had no say in being documented. The records survive. The subjects mostly don’t. Their descendants inherit both.
When human researchers accessed historical materials, they looked for answers to questions. When AI processes those same materials, it generates questions humans wouldn’t have thought to ask.
Pattern-matching across colonial records doesn’t just reveal what administrators documented. It infers what they systematically avoided documenting.
Analysis of police body camera footage doesn’t just find specific incidents. It reconstructs behavioral patterns across entire departments, revealing discrimination so structural that no individual officer could see it.
The capacity isn’t just to find what’s there. It’s to infer what was deliberately kept out.
Archivists have been thinking about these problems longer than most technologists have been alive.
The International Council on Archives Code of Ethics notes that archivists should “take care that corporate and personal privacy as well as national security are protected without destroying information.” The tension is built into the profession: preserve everything, respect everyone, reconcile the two.
Pattern-recognition software makes the reconciliation harder by removing the friction that used to enable compromise. When collections were practically inaccessible, you could preserve everything while protecting most of it. Material existed but couldn’t be efficiently exploited. Researchers accessed what they specifically sought. Casual exposure was impossible because casual access was impossible.
That bargain no longer holds.
AI doesn’t search. It processes. It doesn’t find specific records. It identifies patterns across all records.
The colonial repository doesn’t release individual documents. It trains models on comprehensive documentation of colonized populations. The surveillance footage doesn’t produce specific incidents. It enables systematic behavioral analysis across every recorded encounter.
The question isn’t whether to preserve. Archivists have decided that question, and the answer is yes.
The question isn’t whether AI will be used. Institutions have decided that question too, and the answer is also yes.
The question is what ethical frameworks apply when capacity to analyze far exceeds any consent that could have been obtained.
Watch what happens as this becomes visible: policy proposals for tighter access controls, delayed release periods, posthumous data advocates.
These proposals exist because nobody wants to state the obvious. We’re retroactively applying consent frameworks to situations where consent was structurally impossible, pretending better policy can resolve contradictions built into the nature of historical materials themselves.
Pattern-matching doesn’t create the problem. It reveals that collections were always problematic. Protected only by inefficiency.
The colonial administrator who documented his subjects assumed those documents would serve colonial purposes. He was wrong.
The police officer whose camera recorded an interaction assumed the footage would be reviewed only for specific incidents. She was wrong.
The person who wrote letters to their lover assumed those letters would be burned or forgotten. Wrong too.
Every collection is a time capsule of assumptions about what documentation would eventually mean. AI detonates those assumptions by enabling uses that couldn’t have been anticipated.
The dead couldn’t consent to being data because they couldn’t imagine becoming data. The living inherit the consequences.
What’s revealing isn’t that AI can process historical materials. It’s that we’re choosing to build these capabilities without seriously questioning whether we should. All while maintaining elaborate consent frameworks for the living that we know won’t protect them once they’re dead.
We’ve quietly accepted a regime where personhood ends at death but data-hood continues indefinitely. Subject to analysis by technologies that didn’t exist. Under frameworks nobody consented to.
Disney’s copyright extends 95 years. Your medical privacy expires at 50.
If your great-grandmother had drawn a cartoon mouse instead of writing letters about her syphilis diagnosis, we’d protect her work longer than her dignity.
That’s not a technical problem. That’s a statement about what we value.
You’re not just inheriting this problem from the dead. You’re creating it for whoever comes after you.
Every email sent is documentation-in-progress. Every photo. Every text. Every search query.
We’re all generating records that will outlive us, analyzed by technologies that don’t exist yet, under frameworks that haven’t been written. We’re writing letters to a future we won’t know, using ink that never fades.
The question isn’t what frameworks apply to Stasi documents. It’s what frameworks will apply to your Google search history when you’re dead. When your descendants want to understand you. When some future AI wants training data on how humans thought about privacy before privacy ended.
Nobody can consent to analysis by technologies they can’t imagine.
Your great-grandchildren will have the same problem with their data that we have analyzing our great-grandparents’ letters. The original context is gone. The consent framework makes no sense. Practical obscurity isn’t protecting anyone anymore.
We haven’t figured out what replaces obscurity as privacy’s last defense. But every search you run tonight, every message you send, every photo you take is already a letter to those great-grandchildren.
They’ll read it. All of it. Using tools you can’t imagine.
And there’s nothing you can do about that except know, finally, that privacy already ended. We just haven’t noticed because the bill hasn’t come due.









