Exploring the Opportunities and Risks for Generative AI and Corporate Databases: An Introduction
Generative artificial intelligence (gen AI) systems that leverage large language models (LLMs) are proliferating faster than even the most ardent tech enthusiasts had envisioned. Many legal professionals are captivated by this technology and are busy analyzing and discussing the benefits, risks and impacts of gen AI to both the business and the practice of law.
Much attention has been dedicated to the use of gen AI for document summarization, categorization, search, and first-draft generation. Almost every practice area is busy assessing the impact of this technology on their respective fields, including the areas of e-discovery and information governance. One area that has not received as much attention as others, however, is the risk involved in business users’ use of gen AI to query data stored in databases.
Harnessing Data in Corporate Database Reporting
Databases serve as the backbone of modern business, allowing organizations to keep pace with the storage, management and analysis of an ever-growing amount of data. While scaling to meet storage needs of the future, these systems also offer data integrity, fast retrieval, scalability, security, and analytics. But these advantages require specialized knowledge to appropriately query the stored content—specialized knowledge that is typically the provenance of database administrators (DBAs), data engineers, and other power users. For example, the most widely used type of database—relational databases—require knowledge of structured query language (SQL) to access the contents stored within them.
At the same time, the thirst of "data consumers"—i.e., users across a company’s landscape who require a variety of information to perform their functions and accomplish their goals—for data in various forms, including raw data (not aggregated or summarized) persists. But the gatekeepers of this information (generally, DBAs, engineers, and information security personnel) typically balk at providing direct access to databases and instead point data consumers to canned reports and analytical/visualization tools. These gatekeepers harbor concerns, which are not unfounded, that users lacking detailed knowledge of a database’s schema may not be able to write appropriate queries or understand the true context of what a query returns.
Many of these data consumers have the business and analytical acumen, however, to derive value from raw data that is not easily represented in prewritten reporting and dashboards. To capitalize on these skills, some organizations have begun leveraging gen AI to allow data consumers to query raw data. For example, security and system administrators are beginning to take advantage of plain English queries that rely on LLMs to help search log files. While log files are typically stored in semi-structured data formats (e.g., JSON, TXT, XML) and not in databases, without LLMs to assist with these searches users must rely on specific query languages (i.e., not plain English) to search this data. As there will always be a desire and need for data consumers to query raw data, it is only natural that organizations will extend plain English query capabilities to database searches.
Enter Gen AI for Databases
Organizations are now experimenting with the use of gen AI to search databases to eliminate hurdles associated with understanding complex database structures and mastering specific query languages. For relational databases, these efforts build on earlier text-to-SQL applications that were developed on strict rules. Today, organizations are working to use LLMs to convert plain English requests to SQL queries that are in turn used to search a database. These initiatives are progressing—although still behind semi-structured data search—and show promise that data consumers will also be able to bypass query languages like SQL and instead use plain English searches to search databases as well.
Risks and Associated Guardrails
Nonetheless, organizations must proactively prepare for the inevitable availability of methods that give data consumers the ability to independently query structured and semi-structured data. While testing and implementing gen AI for database queries, organizations must identify and properly address the new risks these tools introduce.
By way of an example, consider the development of a tool that leverages gen AI to allow data consumers to independently search financial information stored within a database using plain English queries to determine total revenue for a fiscal year. Providing access to this information introduces new vulnerabilities including:
- Context and Nuance: Questions about revenue such as “What was total revenue last year?” will likely yield vastly different results and queries from a question with more context and intentional parameters, such as “Give me total billed amounts for the entire year of 2024 in all operating units.”
- Consistency and Accuracy: Slight changes to searches entered in plain English may map to different database queries, resulting in varied findings.
- Unauthorized Access: Users may be able to obtain sensitive information including sales results for individual employees.
To mitigate these potential risks, organizations should consider certain controls and guardrails, including:
- Context and Nuance: All users of these tools should receive high-level training on the basics of the available data and appropriate practices to prompt the underlying database.
- Consistency and Accuracy: Organizations should enforce “human-in-the-loop” controls by requiring a subject matter expert to validate results prior to using the content downstream.
- Unauthorized Access: Administrators should leverage concepts such as role-based access, search controls, and logging to ensure that only users with appropriate credentials may access sensitive information.
These risks and mitigating controls only represent a few of the risks posed by the application of gen AI to database search. This area of risk will undoubtedly be a focal point for organizations looking to allow more data consumers to independently query databases.
Conclusion
Our perspective is in no way an anti-AI stance, nor is it a recommendation to place a moratorium on using gen AI to retrieve critical information in corporate databases. Quite the opposite, we support embracing and using this evolving new technology responsibly and, quite frankly, taking advantage of it at every reasonable opportunity, while keeping in mind its limitations and risks. We all have an opportunity to learn and consider the ways in which companies implement gen AI to allow users to query corporate databases responsibly, and to develop strategies to reduce risk and to stay ahead of potential problems, while helping data consumers use gen AI for efficient, and responsible, data retrieval. The challenge, however, is to ensure that all data returned by prompts and gen AI query tools is sound, verifiable, and appropriate contextually. Understanding evolving practices and guidance will be key to maintaining a balance of reliability and risk.
Diana Fasching is a managing director at Redgrave in Virginia. Fasching focuses on information governance and technical aspects of e-discovery, identifying and advising on issues and solutions related to the life cycle of information and integration of emerging technologies.
Michael Kearney is a director at Redgrave in Washington, D.C. He advises clients regarding data protection, information privacy, information governance, e-discovery, and emerging artificial intelligence issues. Kearney is a certified artificial intelligence governance professional (AIGP).
Glen Mattfeld is a director at Redgrave in Virginia. He helps clients optimize enterprise-wide systems and data storage and management practices, focusing on collaboration and operational efficiency. Mattfeld is a certified information management professional (CIMP).