Alltius Blog | How to Assess the Trustworthiness of Language Models

It has been over 100 days since the LLM blitzkrieg hit people and businesses alike. While many people have been at LLM centric solutions for months before that event, the virality was unprecedented. Some entities have been early adopters (e.g. Expedia), even when their public image is one of inertia (e.g. AirIndia), while others are adopting a measured approach. The caution is primarily on account of privacy and security concerns.

This wave is not so much about creative copywriting as much it is about using AI to truthfully answer questions from existing information. But can LLMs do that? Or at least help you do that? On their own, they are brilliant language completion tools and not fact dispensing oracles. Chat layers of LLMs seem to fictionalise a lot of stuff when one uses them for both.

As part of our user research exercises and various customer pilots, we spoke with several professionals who have been championing the cause of using coachable assistants for getting quick answers (check out KNO Plus). Here are some of them.

Chief Product Offer, at leading global tax SaaS company
Professor, Information Systems, at a top Indian business school
Strategy Lead, In-licensing, at a global pharma company
Customer Success Leader, at global fintech specialising in Merchant Billing
Director, Fixed Income Portfolio at a European wealth management firm
Director, Legal at a data infrastructure company

Here are some synthesised findings from our conversations.

Most believe that the text generative AI applications for businesses will need to evolve across two generations.

Level 1, Answering things:

Plugins and applications that can answer questions. The consumer could be an employee of the business or their end user.

Level 2, Doing things:

Applications that can consume simple instructions in natural language and guide users to complete a task. e.g. creating an ‘opportunity’ Salesforce CRM (without opening the application itself)

Most applications are in the Level 1 phase - they are trying to answer questions. At Alltius, we are not only answering questions but also finding them from sources of choice in communicated in a medium of convenience (e.g. widgets, web apps, APIs, plugins and more).

Most Level 1 use cases are analogous to a human completing the task, but it could be for oneself or for someone else. For example,

The associate spending days to read through financial reports and coming up with an ‘Industry Attractiveness Report’
The end user of a SaaS application reading a product wiki and figuring out the logical next step
The research scholar mining through journal articles for a comprehensive literature review

In all the use cases, the expectation is that a ‘Level 1 Assistant’ will meet the same bars of trust as one would place on humans in order to get a well-synthesised accurate answer. So what would it take to earn this trust?

This trust, they say, manifests along two axes.

A. Quality of synthesis

Remember that the user is okay not reading through information and vetting for hours. Instead, he or she is willing to place trust in a generative AI powered assistant. It is natural that users are expecting the answers to have :‍

Accuracy

Should look right : At least in the first pass. This is table stakes, and unfortunately most applications stop here.
Must be testable : Users must be able to validate AI models. ‍
Should disclose sources : Models must reveal sources used to construct answers.‍
Should come from many types of sources : Should use a holistic of sources to triangulate better (text, tickets, videos, etc.).
Should self-correct : The underlying models must consume user feedback and auto correct over time.‍
Should get precision right over recall : Wrong answers will create more distrust than inability to answer.

This is precisely why businesses will easily trust applications that pull SQL queries and manipulate graphs using natural language instructions rather than an AI co-pilot that constructs answers for questions.

Speed

The real 10x benefit is if magic happens in seconds.
Speed matters in data ingestion. For example - reading a variety of HTML structures, interpreting tables in images, interpreting video transcripts and eventually putting them all together.
Speed also matters in how models train on bespoke customised data.

Reusability‍

These applications can also earn trust if they deliver answers in copy ready and easily consumed formats.
The language, tone and tenor counts - and so does the ability to tune it to the style of the favourite copywriter.

At Alltius, our AI assistants get coached in seconds and answer even more quickly. For our large Enterprise customers, we spend disproportionately larger time on fine tuning models to meet stringent precision and recall metrics even before going live. The reference sources are clearly laid out and customers can pressure test in a convenient playground.

This brings us to the second axis.

B. Data Privacy and Security

Can you not send the information to the LLM please? Can we set your engine up in our cloud?

It is natural that businesses worry so much about their data. However, the hyperventilation is spread across a spectrum of data types, primarily driven by how many people should have access to it. If we were to grade it, they would sit in five buckets.

Public and generic : Not specific to the business, information meant for public consumption. e.g. a 100-pager government report on data privacy
Public but specific to company : Specific to a company. E.g. product support wikis
Non-public, non-sensitive : Working progress reports, analysis, requirement documents that are eventually going to be out e.g. WIP PRD, Leave policy
Non-public, business-sensitive : Information that needs to be within the company but accessible to most employees e.g. Townhall slide deck, Journal subscriptions
Non-public, highly sensitive: Customer conversations, business plans and other bespoke information which few people have access to e.g. Leadership meeting transcript

Businesses are likely to build trust in LLMs along two dimensions.

Three very distinct solution spaces emerge as a result.

Business grade : Mostly out of the box business grade solutions applied on public or non-sensitive data. Mid-market companies with lots of public documentation are likely to opt for this.
Enterprise grade : Self-hosted LLMs where data does not leave beyond the contours of the enterprise, hosting on their clouds and dispensed on a channel of choice. The AI models could however be non-specific.
Tailored platinum grade : Very large companies with kevlar like expectations are likely to build custom models for in-house consumption.

At Alltius, we are engaging with customers in buckets 1 and 2. While we are offering out of the box AI assistants to train on publicly available information, we are also creating closed environments and insourced language models for large enterprises with bank grade privacy and security protocols.

Please tell us how you feel about this topic. If you have a use case to share or discuss you can contact us here.

How to Assess the Trustworthiness of Language Models

Level 1, Answering things:

Level 2, Doing things:

A. Quality of synthesis

Accuracy

Speed

Reusability‍

B. Data Privacy and Security

More from the Blog

Alltius Joins LendAPI Marketplace, Expanding AI Services for Financial Institutions

When Generative AI Designs You!

GenAI in Customer Support - Not a One Size fits all solution

Products

Channels

Company