How the cloud copes with the problem of dirty data

Terabytes of data were a big deal some 15 years ago; now, companies are managing hundreds of petabytes of data and will soon face exabytes. AI is only going to compound the problem.

Indeed, as businesses begin to harness generative AI, they need to ensure the its input — the information used to train it — as well as its output, are secure, accurate and comply with regulations, said Krishna Subramanian, COO, president and co-founder of Komprise, a data management company. 

Dirty data leading to bad AI outcomes is a problem that’s born in the cloud because many organizations leverage the cloud for AI processing due to heavy compute requirements, making the cloud less expensive than on-premises resources. 

And the solution proposed by Komprise is also a cloud phenomenon, an extension of the company’s core business which focuses on helping customers manage unstructured data that exists outside of structured databases. That type of information comprises 90% of today’s data, including video files, audio recordings, genomic files, medical imaging and other healthcare information, energy sector information and so on.

Komprise is a hybrid cloud solution, with a management console that runs in the cloud, managing data that can reside either in the cloud or on-premises. 

Companies need to store and optimize all this unstructured data and much of it ends up plunked in the cloud. But the cloud has varying cost tiers. In Amazon, for instance, storage options include S3 — more expensive but quickly accessible — to less-expensive but harder-to-reach Glacier and Deep Glacier instances. 

Komprise has found a niche for itself by provisioning automated tools to manage and move data cost-effectively by creating a “global file index,” or a database of an enterprise’s unstructured data. Using this index, Komprise can also find all data related to particular projects.

“We help companies save money on unstructured data, and we help customers make money by better using their historical data,” Subramanian said. 

Komprise’s customers are in data-intensive industries, such as healthcare, biotech, state and local governments, manufacturing, financial services, social media, entertainment and the energy sector. But AI presents new business data management challenges and opportunities for Komprise, Subramanian said. 

AI’s big, messy entrance 

Businesses are looking to AI to automate tasks employees are now doing. IBM CEO Arvand Krishna said last month that the company expects to be able to replace 7,800 jobs with AI over time and that he expects 30% of back-office functions will be replaceable by AI over the next five years. 

To do that work, businesses must use existing Large Language Models (LLM) or build their own. Either path has potential problems, Subramanian said. 

Businesses need to be able to report on what training data was used in generative AI. Hybrid-cloud-based data management can help with the job of curating, auditing and moving data to feed an LLM, Subramanian said. 

Most businesses are not likely to build their own LLM. They’ll use pre-trained models, such as ChatGPT, which run in the cloud. But companies still must know how their employees interact with the model and ensure they aren’t giving away proprietary information. And it hasn’t exactly been smooth sailing thus far. Samsung banned the use of generative AI after proprietary information leaked. ChatGPT changed its policies to protect privacy better. 

“There are all these gotchas you must be careful of,” Subramanian said. “If you’re using someone else’s generative AI, you need a data management framework to handle security, privacy, data lineage, data ownership and data governance.”

Businesses need guardrails — policies and governance that control when employees can use generative AI and what information can be made available to the AI. And businesses need to understand the policies of the company providing the AI. These protections are important because, in the long run, businesses will need AI trained on corporate data rather than limited to public data sets.

Subramanian said that businesses can use a data management product for auditing what information is being shared with generative AI to create a paper trail of activity and tracking output. 

Companies will want to ensure that AI doesn’t leak proprietary information, such as customer data. “If you log into your bank, it should only show your bank accounts and balances. It shouldn’t show your neighbor’s balances,” Subramanian concluded.