Exploring the Opportunities and Risks for Generative AI and Corporate Databases: Data Use and Dependability

In our initial article we examined the lesser-discussed but quickly expanding use of generative artificial intelligence (GenAI) systems that leverage large language models (LLMs) to query data stored in corporate databases.  To better understand the risks and potential guidelines related to this evolving conduit to information, practitioners in both the corporate and legal settings must understand the many uses (and lineage) of structured and semi-structured data within their organizations, as previously outlined in our examination of this topic. 

To help guide how practitioners assess the impacts of this technology, we will explore the risks and potential downstream implications associated with changes to how structured and semi-structured data is used and accessed through GenAI, as well as the continued need to ensure accuracy in results.

Structured and Semi-Structured Data: Foundations of Organizational Environments

Relational databases have formed the backbone of organizational data environments for many years.  Companies rely on relational databases to support various critical back-office tasks, from invoice, inventory, and supply chain management to compliance, human resources, reporting, and data analytics.  Whether stored on-premises, in the cloud, or in a hybrid of the two, databases provide a foundation for the consistency and integrity of vast amounts of information.  The data stored in databases is in great demand by “data consumers”—users across a company’s landscape who require a variety of information to perform their functions and accomplish their goals.

Beyond supporting corporate functionality and analysis, relational databases play a crucial role in many legal matters.  Corporate database content, including adjacent system logs, are increasingly a source of information responsive to legal matters.  As a result, legal professionals frequently work with database owners to obtain, organize, and analyze this information.

Organizations have also turned to semi-structured formats to store data.  Semi-structured data types include XML, HTML, CSV, and even email, and generally contain tags or other markers (rather than a defined schema) to separate different elements.  NoSQL (nonrelational) databases are one form of semi-structured data seeing increased adoption.  These tools allow organizations to handle diverse and rapidly generated records, including the customer sales orders and system log files inherent in e-commerce.  The proliferation of data lakes within organizations has also resulted in further reliance on semi-structured data, as they can manage disparate data types from various operational feeds.

Accuracy and Dependability of Data

Data consumers within organizations are fully cognizant of the wealth of information in structured and semi-structured data stores.  Until recently, the prospects of direct access to this information remained low.  Instead, many corporate and legal users rely on reports, analyses, and visualizations curated by smaller, more technical teams with direct access to the underlying data (which we will refer to as a “database analyst team”).  However, with new and novel implementations of GenAI, data consumers are beginning to see direct access within their reach.

Although applying GenAI to structured and semi-structured data will likely provide operational efficiencies, organizations must understand that the process will not be without pitfalls.  Organizations must implement the appropriate safeguards to mitigate the risk of improper usage and analysis of data retrieved from structured and semi-structured data using GenAI tools.

To begin, companies are leveraging established frameworks that ensure the accuracy of reporting from data stored in structured and semi-structured form.  Specific individuals are trained on the nuances, composition, and lineage of data stored in these sources, as well as the subtleties of query languages that enable them to retrieve the information they require.  This groundwork promotes the accuracy and dependability of the data reported.  As an example, the creation of reports from databases used to disseminate information throughout a company undergoes testing and ongoing monitoring to ensure the accuracy of those reports.

As companies consider using GenAI to allow for more widespread access to structured and semi-structured data, they must work to ensure that the output from GenAI is accurate, understood, and treated appropriately.  Allowing direct data access for managers and junior personnel may introduce efficiency but also increase the risk of bypassing existing accuracy safeguards.  The figure below illustrates the potential paradigm shift in information access.

 

Figure depicts replacing a database analyst team (and their

The figure depicts replacing a database analyst team (and their “reports & analysis” output) by GenAI.  To account for this change and loss of safeguards within the existing setup of information gatekeepers and validated reports, organizations should consider updating processes to add new controls to ensure accuracy and appropriate contextual use of data retrieved using GenAI.  For example, organizations should consider implementing ongoing training for any data consumers given access to the underlying data via GenAI, as well as implementing additional procedural safeguards to ensure that data consumers are using GenAI output effectively and appropriately, and most critically, validating the integrity of the output.

These same concerns also extend to legal professionals because they, too, are data consumers.  For example, attorneys searching for relevant data within a database typically have relied on database owners and/or a database analyst team to provide reports and/or assist with data analysis.  With the implementation of GenAI, attorneys can query these datasets directly, potentially unlocking new insights that would not have been found before.  However, attorneys will not likely possess a comprehensive understanding of the datasets and controls necessary to ensure the accurate interpretation of data retrieved through GenAI.

Conclusion

Structured and semi-structured data serve as the basis for many operational and analytical business processes.  Introducing GenAI tools to technical users and non-technical data consumers (including managers and attorneys) is a new method of querying data.  Understanding and verifying the data returned by queries has always been imperative.  While using GenAI tools may facilitate faster and more efficient access to data consumers, it is essential that data consumers still understand and verify the data to ensure accuracy and reliability.  Corporate and legal professionals should look to develop efficient, noninvasive, yet sound guidelines before providing access to structured and semi-structured data via GenAI.