Privacy

Social media giants urged to tackle data-scraping privacy risks

Comment

Image of a person typing on a computer.
Image Credits: Basak Gurbuz Derman (opens in a new window) / Getty Images

A joint statement signed by regulators at a dozen international privacy watchdogs, including the U.K.’s ICO, Canada’s OPC and Hong Kong’s OPCPD, has urged mainstream social media platforms to protect users’ public posts from scraping — warning they face a legal responsibility to do so in most markets.

“In most jurisdictions, personal information that is ‘publicly available’, ‘publicly accessible’ or ‘of a public nature’ on the internet, is subject to data protection and privacy laws,” they write. “Individuals and companies that scrape such personal information are therefore responsible for ensuring that they comply with these and other applicable laws. However, social media companies and the operators of other websites that host publicly accessible personal information (SMCs and other websites) also have data protection obligations with respect to third-party scraping from their sites. These obligations will generally apply to personal information whether that information is publicly accessible or not. Mass data scraping of personal information can constitute a reportable data breach in many jurisdictions.”

The timing of the statement, which was also signed by privacy regulators in Australia, Switzerland, Norway, New Zealand, Colombia, Jersey, Morocco, Argentina and Mexico — who are all members of the Global Privacy Assembly’s international enforcement cooperation working group — coincides with the ongoing hype around generative AI models which typically require large amounts of data for training and could encourage more entities to scrape the Internet in a bid to acquire data-sets jump on the generative AI bandwagon.

High profile examples of such systems, such as OpenAI’s large language model ChatGPT, have relied (at least in part) upon data posted online for training their systems — and a class action lawsuit filed against the U.S. company in June, which CNN Business reported on, alleges it secretly scraped “massive amounts of personal data from the internet”.

Among the privacy risks the regulators highlight is the use of data scraping for targeted cyberattacks such as social engineering and phishing; identity fraud; and for the monitoring, profiling and surveilling of individuals, such as using data to populate facial recognition databases and provide unauthorised access to authorities — a clear swipe at Clearview AI, which has faced a number of enforcements from international regulators (including several across the EU) over its use of scraped data to power a facial recognition ID tool which it sold to law enforcement and other users.

They also warn scraped data can be used for unauthorised political or intelligence gathering purposes — including by foreign governments or intelligence agencies. And be used to pump out unwanted direct marketing or spam.

They don’t directly cite the training of AI models as one of these “key” privacy risks but generative AI tools which have been trained on people’s data without their knowledge or consent could be repurposed for a number of the malicious use cases they cite, including to impersonate people for targeted cyberattacks, identity fraud, or to monitor/surveil individuals.

As well as the statement being made public, the regulators note that a copy has been sent directly to YouTube’s parent company, Alphabet; TikTok’s parent ByteDance; Meta (owner of Instagram, Facebook and Threads); Microsoft (LinkedIn); Sina Corp (Weibo); and X (aka, the platform previously known as Twitter) — so mainstream global social media platforms are clearly front-and-center as the international watchdogs consider the privacy risks posed by data scraping.

Some platforms have of course already had major data scandals linked to data scraping — such as the 2018 Cambridge Analytica data misuse scandal which hit Facebook after a developer on its platform was able to extract data on millions of users without their knowledge or consent as a result of lax permissions the company applied; or the $275 million General Data Protection Regulation (GDPR) penalty Facebook was handed last year in relation to a data scraping incident that affected 530 million users as a result of insecure product design. (The latter incident is also subject to a lawsuit by an Irish digital rights group that’s challenging the DPA’s enforcement finding that there was no security breach.)

While the regulators’ joint statement contains a clear shot across the bows of mainstream social media site on the need to be proactive about protecting users’ information from scraping, there is no commensurately clear warning accompanying the message that failure to act and protect people’s data will result in enforcement action — which does risk diluting the statement’s impact somewhat.

Instead, the watchdogs urge platforms to “carefully consider the legality of different types of data scraping in the jurisdictions applicable to them and implement measures to protect against unlawful data scraping”.

“Techniques for scraping and extracting value from publicly accessible data are constantly emerging and evolving. Data security is a dynamic responsibility and vigilance is paramount,” they also write. “As no one safeguard will adequately protect against all potential privacy harms associated with data scraping, SMCs and other websites should implement multi-layered technical and procedural controls to mitigate the risks.”

Recommended measures to limit the risks of user data being scraped that are mentioned in the letter include having designated in-house team/roles focused on data scraping risks; ‘rate limiting’ the number of visits per hour or day by one account to other account profiles and limiting access if unusual activity is detected; and monitoring how quickly and aggressively a new account starts looking for other users and taking steps to respond to abnormal activity.

They also suggest platforms take steps to detect scrapers by identifying patterns in bot activity — such as having systems to spot suspicious IP address activity.

Taking steps to detect bots such as deploying CAPTCHAs and blocking IP address where data scraping activity is identified is another recommendation (albeit bots can solve CAPTCHAs so that piece of advice is already looking outdated).

Other recommended measures is for platforms to take appropriate legal action against scrapers, such as sending of ‘cease and desist’ letters; requiring the deletion of scraped information; obtaining confirmation of the deletion; and taking other legal action to enforce terms and conditions prohibiting data scraping.

Platforms may also have a requirement to notify affected individuals and privacy regulators under existing data breach laws, the watchdogs warn.

The social media giants who were sent a copy of the letter are being encouraged to respond with feedback within a month demonstrating how they will meet regulators’ expectations.

Individuals told ‘think long term’

The letter does also include some advice for individuals to take steps to help protect themselves against the risks of scraping — including suggesting web users pay attention to platforms’ privacy policies; think carefully about what they choose to share online; and make use of any settings that allow them to control the visibility of their posts.

“Ultimately, we encourage individuals to think long term,” they add. “How would a person feel years later, about the information that they share today? While SMCs and other websites may offer tools to delete or hide information, that same information can live forever on the web if it has been indexed or scraped, and onward shared.”

The letter also urges individuals who are concerned their data may have been scraped “unlawfully, or improperly” to contact the platform or website in question and if they do not get a satisfactory response it suggests they file a complaint with their relevant data protection authority. So the regulators are encouraging users to be more vigilant about scraping which could, ultimately, lead to an uptick in investigations and enforcements in this area.

The dozen international regulators signing the joint statement all hail from non-European Union markets. But, as noted above, EU data protection regulators are already active on data scraping risks through enforcements taken under the bloc’s GDPR.

They are also closely watching developments in generative AI services — so concerns raised in the letter look broadly aligned with issues already on the radar of the bloc’s data protection authorities.

Notably, Italy’s privacy watchdog slapped ChatGPT with a local stop-processing order earlier this year — which led to a brief break in service while OpenAI rushed out with disclosures and controls. Google’s Bard AI chatbot took longer to launch in the EU than in some other regions after its lead EU privacy regulator in Ireland raised similar concerns. But EU DPAs are simultaneously coordinating on how best to apply the local data protection rules to these novel AI chatbots, including vis-a-vis the crux issue of the lawfulness of the data processing used to train the models in light of the GDPR’s framework. So decisions on the core legality of tools like ChatGPT remains pending in the EU.

Earlier this year, France’s DPA, the CNIL, also warned that protection against data scraping will be a key plank of an AI action plan it announced in May.

France’s privacy watchdog eyes protection against data scraping in AI action plan

More TechCrunch

Google has found a way to bring a variation of its clever “Circle to Search” gesture to iPhone users. The new interaction, launched in January, allows Android users to search…

Google brings a variation on ‘Circle to Search’ to iPhone users

A new sculpture going live on Wednesday in the Flatiron South Public Plaza in New York is not your typical artwork. It combines technology, sociology, anthropology and art to let…

Always-on video portal lets people in NYC and Dublin interact in real time

Apple’s iPad event had a lot to like. New iPads with new chips and new sizes, a new Apple Pencil, and even some software updates. If you are a big…

TechCrunch Minute: When did iPads get as expensive as MacBooks?

Autonomous, AI-based players are coming to a gaming experience near you, and a new startup, Altera, is joining the fray to build this new guard of AI agents. The company announced…

Bye-bye bots: Altera’s game-playing AI agents get backing from Eric Schmidt

Google DeepMind has taken the wraps off a new version AlphaFold, their transformative machine learning model that predicts the shape and behavior of proteins. AlphaFold 3 is not only more…

Google DeepMind debuts huge AlphaFold update and free proteomics-as-a-service web app

Uber plans to deliver more perks to Uber One members, like member-exclusive events, in a bid to gain more revenue through subscriptions.  “You will see more member-exclusives coming up where…

Uber promises member exclusives as Uber One passes $1B run-rate

We’ve all seen them. The inspector with a clipboard, walking around a building, ticking off the last time the fire extinguishers were checked, or if all the lights are working.…

Checkfirst raises $1.5M pre-seed to apply AI to remote inspections and audits

Close to a decade ago, brothers Aviv and Matteo Shapira co-founded a company, Replay, that created a video format for 360-degree replays — the sorts of replays that have become…

Controversial drone company Xtend leans into defense with new $40 million round

Usually, when something starts to rot, it gets pitched in the trash. But Joanne Rodriguez wants to turn the concept of rot on its head by growing fungus on trash…

Mycocycle uses mushrooms to upcycle old tires and construction waste

Monzo has raised another £150 million ($190 million), as the challenger bank looks to expand its presence internationally — particularly in the U.S. The new round comes just two months…

UK challenger bank Monzo nabs another $190M as US expansion beckons

iRobot has announced the successor to longtime CEO, Colin Angle. Gary Cohen, who previous held chief executive role at Timex and Qualitor Automotive, will be heading up the company, marking a major…

iRobot names former Timex head Gary Cohen as CEO

Reddit — now a publicly-traded company with more scrutiny on revenue growth — is putting a big focus on boosting its international audience, starting with francophones. In their first-ever earnings…

Reddit tests automatic, whole-site translation into French using LLM-based AI

Mushrooms continue to be a big area for alternative proteins. Canada-based Maia Farms recently raised $1.7 million to develop a blend of mushroom and plant-based protein using biomass fermentation. There’s…

Meati Foods bites into another $100M amid growth to 7,000 retail locations

Cleaning the outside of buildings is a dirty job, and it’s also dangerous. Lucid Bots came on the scene in 2018 with its Sherpa line of drones to clean windows…

Lucid Bots secures $9M for drones to clean more than your windows

High interest rates and financial pressures make it more important than ever for finance teams to have a better handle on their cash flow, and several startups are hoping to…

Israeli startup Panax raises a $10M Series A for its AI-driven cash flow management platform

The European Union has deepened the investigation of Elon Musk-owned social network, X, that it opened back in December under the bloc’s online governance and content moderation rulebook, the Digital Services Act…

EU grills Elon Musk’s X about content moderation and deepfake risks

For the founders of Atlan, a data governance startup, data has always been at the heart of what they do, even before they launched the company. In fact, co-founders Prukalpa…

Atlan scores $105M for its data control plane, as LLMs boost importance of data

It is estimated that about 2 billion people, especially those in lower and middle-income countries, lack access to quality and affordable essential medicines. The situation is exacerbated by low-quality or even killer…

Axmed raises $2M from Founderful to streamline drug supply chains in underserved markets

For decades, the Global Positioning System (GPS) has maintained a de facto monopoly on positioning, navigation and timing, because it’s cheap and already integrated into billions of devices around the…

Xona Space Systems closes $19M Series A to build out ultra-accurate GPS alternative

Bankruptcy lawyers representing customers impacted by the dramatic crash of cryptocurrency exchange FTX 17 months ago say that the vast majority of victims will receive their money back — plus interest. The…

FTX crypto fraud victims to get their money back — plus interest

Google on Wednesday launched its digital wallet in India with local integrations, nearly two years after the app was relaunched as a digital wallet platform in the U.S. As TechCrunch exclusively reported last month,…

Google Wallet is now available in India

Bluesky has launched a new product roadmap for the coming months. The decentralized social network said on Tuesday that it is planning to introduce direct messages, support for videos, improved…

Bluesky to add DMs, video support and in-app custom feed curation

Samsung Medison, a medical device unit of Samsung Electronics that specializes in developing diagnostic imaging devices, said on Wednesday it plans to acquire Sonio, a Paris-based startup that makes AI-powered software…

Samsung Medison to acquire French AI ultrasound startup Sonio for $92.7M

Kyle Kuzma is a lot of things. He’s a forward for the Washington Wizards NBA team and a 2020 NBA champion. He’s also a style icon — depending on who…

NBA champion Kyle Kuzma looks to bring his team mentality to Scrum Ventures

Ofcom is cracking down on Instagram, YouTube and 150,000 other web services to improve child safety online. A new Children’s Safety Code from the U.K. Internet regulator will push tech…

Ofcom to push for better age verification, filters and 40 other checks in new online child safety code

Lipids are fatty, waxy or oily compounds that, for instance, typically come in the form of fats and oils. As a result they are heavily used in the production of…

After a $20M Series A funding, Germany’s Insempra plans eco-friendly lipid production

Tesla CEO Elon Musk has said that lidar sensors are a “crutch” for autonomous vehicles. But his company has bought so many from Luminar that Tesla is now the lidar-maker’s…

Tesla is Luminar’s largest lidar customer

U.S. realty trust giant Brandywine Realty Trust has confirmed a cyberattack that resulted in the theft of data from its network. In a filing with regulators on Tuesday, the Philadelphia-based…

Brandywine Realty Trust says data stolen in ransomware attack

Rivian lost $1.45 billion in the first quarter, showing that its recent company-wide cost-cutting measures have a ways to go before it can approach profitability. The EV-maker brought in $1.2…

Rivian loses $1.45B as cost-cutting measures continue

Meta is rolling out an expanded set of generative AI tools for advertisers, after first announcing a set of AI features last October. Now, instead of only being able to…

Meta’s AI tools for advertisers can now create full new images, not just new backgrounds