Private GitHub Data Lingers in Copilot Training
Security researchers are warning that data exposed to the internet, even for a moment, can linger in online generative AI chatbots like Microsoft Copilot long after the data is made private.
[…]
Lasso co-founder Ophir Dror told TechCrunch that the company found content from its own GitHub repository appearing in Copilot because it had been indexed and cached by Microsoft’s Bing search engine. Dror said the repository, which had been mistakenly made public for a brief period, had since been set to private, and accessing it on GitHub returned a “page not found” error.
[…]
Lasso extracted a list of repositories that were public at any point in 2024 and identified the repositories that had since been deleted or set to private. Using Bing’s caching mechanism, the company found more than 20,000 since-private GitHub repositories still had data accessible through Copilot, affecting more than 16,000 organizations.
Any passwords or keys that were ever made public, however briefly, should be revoked. However, there may be other information of interest that’s now stored, and it was not obvious to me that it would be accessible via Copilot when it doesn’t show up in Bing.
Previously:
- OpenAI Failed to Deliver Opt-out Tool
- Microsoft’s Suleyman on AI Scraping
- AI Companies Ignoring Robots.txt
- Slack AI Privacy
- ChatGPT Is Ingesting Corporate Secrets