Given the rapid pace of change in the AI and generative AI markets, staying up to date on the latest news, technologies and announcements can be challenging. With that in mind, I'm starting a new blogging series focused on what I view as the most important industry developments -- and, of course, my takes on why they matter.
Enhancing code generation with CodeLlama 70B release from Meta
I recently spoke about the role of generative AI in application development as part of TechTarget and BrightTALK's Generative AI Summit. Research from TechTarget's Enterprise Strategy Group (ESG) showed that one in five organizations surveyed believe companies will make a significant investment in generative AI capabilities for application and software development in 2024. These organizations will look to accelerate software delivery and improve developer efficiency across several use cases tied to faster and better code creation and documentation.
Reflecting this trend, Meta recently released CodeLlama 70B, an open source model built for code generation, which can be downloaded for free. It's trained on 1 trillion tokens of code, and the context window has significantly increased to 100,000 tokens.
The model can process and generate longer, more complex code in Python, C++, Java, PHP and other common languages based on natural-language prompts or existing code snippets. What's more, Meta claims it can do so faster and more accurately than ever before. But while Llama's performance has significantly improved, Meta is still chasing OpenAI's GPT-4.
Release of Hermes training data from Nous Research highlights continued criticality of transparency
Another area I find fascinating is the vetting of training data for some of the largest models in use today. This is where the increasingly important topic of indemnity comes into play, something which I'll be discussing in more detail in a follow-up blog.
After the recent headlines about the New York Times' lawsuit against OpenAI and Microsoft over copyright infringement, understanding indemnity clauses will be critical going forward for large language model (LLM) creators and end users alike. And it's not just about outputs and responses -- it's also about the data used to train LLMs.
Last week, one newsworthy moment occurred when applied research group Nous Research released the entire data set used to train its OpenHermes 2.5 and Nous Hermes 2 models, which included over a million data points. By attributing virtually every piece of data in the training set to somewhere within the open source ecosystem, they are setting new standards for openness and transparency.
OpenAI facing more legal pressures from EU
Over the past year, various countries have restricted the use of ChatGPT for a number of reasons, with Italy notably imposing and then lifting a ban. On Jan. 29, Italy's data protection authority gave OpenAI 30 days to respond to its allegations that ChatGPT breaches the EU's General Data Protection Regulation.
The issues seem to center around OpenAI's processing of personal data to train its AI models. Although fines could reach up to €20 million or up to 4% of the company's global annual turnover if violations are found, the bigger concern for OpenAI could be a potential mandate to change its data collection or processing practices as they relate to EU member states.
ESG's research showed that 95% of organizations have some form of active compliance guidelines in association with data used in AI projects. However, when companies withhold information on the exact data used to train proprietary and commercial models, this level of pushback is all but expected.
With more regulation expected across global AI markets, organizations must be prepared to act quickly -- not only to prevent legal issues, but to also ensure accuracy and lack of bias in the data used for AI training and insights. Ultimately, this will be crucial to businesses' ability to prove that they are developing AI in a responsible, trustworthy way.
Big investments from big tech into generative AI unicorns trigger FTC investigation
With ESG's research highlighting that skills gaps are a key challenge for organizations looking to implement generative AI, it's no surprise that organizations will be turning to leaders at the forefront of innovation. Several organizations stand out as such leaders, including major cloud providers such as Microsoft, Google and Amazon, as well as AI unicorns like OpenAI and Anthropic.
But all five are currently under scrutiny as part of a recent Federal Trade Commission (FTC) investigation into the partnerships and investments among them, including Microsoft's $10 billion investment in OpenAI, as well as Amazon's $4 billion investment and Google's $300 million investment in Anthropic. And these are just the larger partnership examples tied to the latest FTC investigation -- there are many other examples of strategic partnerships between big tech and generative AI companies.
In my view, the crux of this FTC investigation is the nature of those investment agreements and their impacts on market share and competition. My take is that the recent transformative advances in generative AI would not have been possible without these partnerships and investments, which have significantly accelerated innovation, even outside of the specifically cited technology companies.
Voltron Data acquires Claypot AI for real-time AI
Voltron Data's recent acquisition of Claypot AI, which works with both batch and streaming data, is clearly in line with Voltron's objectives. Voltron recently unveiled Theseus, a composable, accelerated data processing engine built on GPUs, although to date the company has primarily focused on local and batch data.
With the Claypot acquisition, Voltron will now be able to deliver streaming data analytics using the same open standards on which the company was founded, including integrations of open source technology such as Apache Arrow, Apache Parquet and Ibis. Importantly, these two teams have been working together for some time now building a streaming data back end. For customers, this development promises access to real-time AI capabilities, alongside several other core AI lifecycle technologies, such as feature engineering enablement and MLOps.
Mike Leone is a principal analyst at TechTarget's Enterprise Strategy Group, where he covers data, analytics and AI.
Enterprise Strategy Group is a division of TechTarget. Its analysts have business relationships with technology vendors.