Deep learning pioneer Andrew Ng says companies should get ‘data-centric’ to achieve A.I. success

The key is better data, not bigger data, the Landing AI founder and CEO says.

Andrew Ng is among the pioneers of deep learning—the use of large neural networks in A.I. He’s also one of the most thoughtful A.I. experts on how real businesses are using the technology. His company, Landing AI, where Ng is founder and CEO, is building software that makes it easy for people, even without coding skills, to build and maintain A.I. systems. This should allow almost any business adopt A.I. —especially computer vision applications. Landing AI’s customers include major manufacturing firms such as toolmaker StanleyBlack & Decker, electronics manufacturer Foxconn, and automotive parts maker Denso.

Ng has become an evangelist for what he calls “data-centric A.I.” The basic premise is that state-of-the-art A.I. algorithms are increasingly ubiquitous thanks to open-source repositories and the publication of cutting edge A.I. research. Companies that would struggle to hire PhDs from top computer science schools can nonetheless access the same software code that Google or NASA might use. The real differentiator between businesses that are successful at A.I. and those that aren’t, Ng argues, is down to data: What data is used to train the algorithm, how it is gathered and processed, and how it is governed? Data-centric A.I., Ng tells me, is the practice of “smartsizing” data so that a successful A.I. system can be built using the least amount of data possible. And he says that “the shift to data-centric A.I.” is the most important shift businesses need to make today to take full advantage of A.I.—calling it as important as the shift to deep learning that has occurred in the past decade.

Ng says that if data is carefully prepared, a company may need far less of it than they think. With the right data, he says companies with just a few dozen examples or few hundred examples can have A.I. systems that work as well as those built by consumer internet giants that have billions of examples. He says one of the keys to extending the benefits of A.I. to companies beyond the online giants is to use techniques that enable A.I. systems to be trained effectively from much smaller datasets.

What’s the right data? Well, Ng has some tips that include making sure that data is what he calls “y consistent.” In essence this means there should be some clear boundary between when something receives a particular classification label and when it doesn’t. (For example, take an A.I. designed to find defects in pills for a pharma company. This system will perform better from less training data if any scratch below a certain length is labelled “not defective,” and any scratch longer than that threshold is labelled “defective” than if there is no consistency in which scratch lengths are labelled defective.)

He says that one way to spot data inconsistencies is to assign the same images in a training set to multiple people to label. If their labels don’t agree, the person designing the system can make a call on the correct label or that example can be discarded from the training set. Ng also urges those curating data sets to clarify labeling instructions by tracking down ambiguous examples. These are tricky cases that are likely to lead to inconsistent labels. Any examples that are unclear or confusing should be eliminated from the data set altogether, he says. Finally, he says people should analyze the errors an A.I. system makes to figure out which subset of examples tend to trip the system up. Adding just a few additional examples in key data subsets leads to faster performance improvements than adding additional examples where the software is already doing well. He also says that A.I. users should see data curation, data improvement, and retraining the A.I. on updated data, as an on-going cycle, not something a user does only once.

The idea of thinking of the building and training of A.I. models as a continuous cycle, not a one-off project, also comes across in a recent report on A.I. adoption from consulting firm Accenture. (Yes, Accenture is a sponsor of this newsletter.) It found that only 12% of 1,200 companies it looked at globally have advanced their A.I. maturity to the stage where they are seeing superior growth and business transformation. (Another 25% are somewhat advanced in their deployment of A.I., while the rest are still just running pilot projects if anything.) What sets that 12% apart? Well, one factor Accenture identifies is that they have “industrialized” A.I. tools and processes, and that they have created a strong A.I. core team. Other key factors are organizational too: they have top executives who champion A.I. as a strategic priority; they invest heavily in A.I. talent; they design A.I. responsibly from the start; and they prioritize both long- and short-term A.I. projects.

With that, here’s the rest of this week’s A.I. news.

It pays to know banner

Subscribe to unlock this article and get full access to Fortune.com

Andrew Ng is among the pioneers of deep learning—the use of large neural networks in A.I. He’s also one of the most thoughtful A.I. experts on how real businesses are using the technology. His company, Landing AI, where Ng is founder and CEO, is building software that makes it easy for people, even without coding skills, to build and maintain A.I. systems. This should allow almost any business adopt A.I. —especially computer vision applications. Landing AI’s customers include major manufacturing firms such as toolmaker StanleyBlack & Decker, electronics manufacturer Foxconn, and automotive parts maker Denso.

Ng has become an evangelist for what he calls “data-centric A.I.” The basic premise is that state-of-the-art A.I. algorithms are increasingly ubiquitous thanks to open-source repositories and the publication of cutting edge A.I. research. Companies that would struggle to hire PhDs from top computer science schools can nonetheless access the same software code that Google or NASA might use. The real differentiator between businesses that are successful at A.I. and those that aren’t, Ng argues, is down to data: What data is used to train the algorithm, how it is gathered and processed, and how it is governed? Data-centric A.I., Ng tells me, is the practice of “smartsizing” data so that a successful A.I. system can be built using the least amount of data possible. And he says that “the shift to data-centric A.I.” is the most important shift businesses need to make today to take full advantage of A.I.—calling it as important as the shift to deep learning that has occurred in the past decade.

Ng says that if data is carefully prepared, a company may need far less of it than they think. With the right data, he says companies with just a few dozen examples or few hundred examples can have A.I. systems that work as well as those built by consumer internet giants that have billions of examples. He says one of the keys to extending the benefits of A.I. to companies beyond the online giants is to use techniques that enable A.I. systems to be trained effectively from much smaller datasets.

What’s the right data? Well, Ng has some tips that include making sure that data is what he calls “y consistent.” In essence this means there should be some clear boundary between when something receives a particular classification label and when it doesn’t. (For example, take an A.I. designed to find defects in pills for a pharma company. This system will perform better from less training data if any scratch below a certain length is labelled “not defective,” and any scratch longer than that threshold is labelled “defective” than if there is no consistency in which scratch lengths are labelled defective.)

He says that one way to spot data inconsistencies is to assign the same images in a training set to multiple people to label. If their labels don’t agree, the person designing the system can make a call on the correct label or that example can be discarded from the training set. Ng also urges those curating data sets to clarify labeling instructions by tracking down ambiguous examples. These are tricky cases that are likely to lead to inconsistent labels. Any examples that are unclear or confusing should be eliminated from the data set altogether, he says. Finally, he says people should analyze the errors an A.I. system makes to figure out which subset of examples tend to trip the system up. Adding just a few additional examples in key data subsets leads to faster performance improvements than adding additional examples where the software is already doing well. He also says that A.I. users should see data curation, data improvement, and retraining the A.I. on updated data, as an on-going cycle, not something a user does only once.

The idea of thinking of the building and training of A.I. models as a continuous cycle, not a one-off project, also comes across in a recent report on A.I. adoption from consulting firm Accenture. (Yes, Accenture is a sponsor of this newsletter.) It found that only 12% of 1,200 companies it looked at globally have advanced their A.I. maturity to the stage where they are seeing superior growth and business transformation. (Another 25% are somewhat advanced in their deployment of A.I., while the rest are still just running pilot projects if anything.) What sets that 12% apart? Well, one factor Accenture identifies is that they have “industrialized” A.I. tools and processes, and that they have created a strong A.I. core team. Other key factors are organizational too: they have top executives who champion A.I. as a strategic priority; they invest heavily in A.I. talent; they design A.I. responsibly from the start; and they prioritize both long- and short-term A.I. projects.

With that, here’s the rest of this week’s A.I. news.

Jeremy Kahn
@jeremyakahn
jeremy.kahn@fortune.com

A.I. IN THE NEWS

Spotify buys startup that replicated Val Kilmer’s voice for Top Gun: Maverick Spotify has bought Sonantic, a London-based startup that uses A.I. technology to create highly-realistic digital versions of people’s voices—and which can also create completely novel voices for film, television, audio books, or podcasts. Sonantic got a lot of attention last month when the actor Val Kilmer revealed that the company’s technology had enabled him to deliver his lines in the new Top Gun sequel, even though his actual voice has been badly impaired as the result of throat cancer. The financial terms of the deal were not disclosed, according to TechCrunch

FTC warns companies against relying on A.I. for content moderation. The Federal Trade Commission issued a report to the U.S. Congress urging lawmakers to use “great caution” in considering A.I. as a possible solution to removing harmful online content, fake accounts, and misinformation. In recent years, companies such as Meta and Google have increasingly promoted A.I.-based content moderation as key to mitigating the deluge of harmful online content. But the Commission said that A.I. systems had inherent failings—such as a failure to understand context and meaning, as well as limitations in having training data reflect current real-world examples—that meant they would never be entirely accurate or provide full coverage. It also said there was a risk of bias and discrimination in using such systems and that it could open the door to more invasive forms of mass surveillance. 

Suspended Google A.I. researcher says the chatbot he claims is “sentient” has hired a lawyer. Blake Lemoine, the Google A.I. researcher who made headlines earlier this month with claims that an A.I. chatbot the company has created is “sentient,” told Wired earlier this week that the software had hired a lawyer. Lemoine, who says that Google suspended him from his job after he raised ethical concerns about how the company was treating LaMDA, says the chatbot asked him to invite a lawyer to his house and have a dialogue with the chatbot, and that during that conversation, the chatbot agreed to retain the lawyer. Most A.I. experts dispute Lemoine’s contention that LaMDA is sentient. Google also says it suspended Lemoine with pay after he violated the company’s confidentiality policies by publicly leaking transcripts of LaMDA’s dialogues, and also because he engaged in provocative behavior, such as earlier attempts to secure legal representation for LaMDA.

Autonomous ship completes Atlantic crossing. The Mayflower Autonomous Ship, named after the vessel that brought the Pilgrims to America, completed a 3,500-mile autonomous journey from the Azores to Halifax, Canada. (I have written about the ship previously here, here, and here.) The ship had run into mechanical and electrical issues during its attempt to make a first-ever autonomous voyage from England to the U.S. East Coast. These forced the ship to divert to the Azores for repairs and then later forced ProMare, the marine charity spearheading the project with technology support from IBM, to alter the ship’s landing destination from Virginia to Halifax. (The ship had been due to make its passage to the U.S. in 2020 to commemorate the 400th anniversary of the Pilgrims’ voyage, but it was delayed, first by the pandemic, and then had to abort an attempted crossing in 2021, also due to a ruptured exhaust pipe.)

Elon Musk favors government regulation of A.I. In an interview with Bloomberg News’ editor-in-chief at the Qatar Economic Forum, the billionaire, reiterated calls for a government body to regulate artificial intelligence. “I’ve said for a long time that I think there ought to be an AI regulatory agency that oversees artificial intelligence for the public good,” he said. “And I think that for anything where there is a risk to the public, whether that’s say, the Food and Drug Administration or Federal Aviation Administration or the Communications Commission, whether it’s a public risk or a public good at stake, it’s good to have a sort of a government referee and a regulatory body.” Musk also said he hoped to have a prototype of Tesla’s humanoid A.I. robot to show off in September.

EYE ON A.I. RESEARCH

Training A.I. on multiple kinds of data makes it better, but may also make it easier to attack. That’s the finding from researchers from Zhejiang University of Technology in Hangzhou, China. The group looked at five different A.I. systems that had been trained to analyze both the images and the text in social media posts to try to identify misinformation. This kind of A.I. tends to perform better than those trained on just images or just text. The team wanted to see if being multi-modal also made A.I. systems more robust, better able to withstand attacks by malicious actors who might want to sneak their misinformation past the detector. But, in their research, the team discovered that the opposite was true: The multi-modal systems were likely to fail in the face of adversarial attacks than simpler A.I. systems trained on just images or text. What’s more, they found that the multi-modal misinformation detection A.I.s were particularly vulnerable to visual attacks—even small adjustments in images that were imperceptible to humans could throw these systems off. It turned out they did better on text, experiencing just a 10% decline in performance in the face of text-based adversarial attacks. You can read the non-peer reviewed research here on the repository arxiv.org. 

FORTUNE ON A.I.

Amazon’s about to launch its drone deliveries program. Citizens in the area aren’t happy—by Chris Morris

Commentary: The battle between autocracy and democracy has blinded us to the A.l. oligopoly—by Wendall Wallach

Elon Musk warns rivals Lucid and Rivian that unless they slash costs they’re going bankrupt—by Christiaan Hetzner

A.I. robotaxis already running in China could be coming soon to the U.S.—by Jaclyn Trop

BRAIN FOOD

A.I. for scouting professional athletes. Data analytics has been making big inroads in professional sports for at least two decades. But, in many sports, such as soccer, it is still challenging to find metrics that allow true apples-to-apples comparisons between two potential prospects. Many teams have continued to rely heavily on a network of human scouts. But now several teams are experimenting with an app called AiScout, according to Forbes, that promises to move teams to a more scientific way of assessing player skills. The platform sounds a bit like Pymetrics (the game-based system for assessing potential new hires’ cognitive skills and personalities) for athletes: the teams set skill drills for the athletes they are interested in to perform and these are benchmarked against the performance of the team’s existing players on the exact same tests. The story says the method has already helped teams such as the Premier League’s Chelsea spot potential new talent, but that it may really level the playing field for smaller teams that don’t have the recruiting resources of a big club.