So You Just Got Assigned Your First GenAI Project. (Post 1 of 3)
The Main Models You Are Likely To Work With (Today)
There was a startling announcement from Amazon and Anthropic last week that they will be collaborating to build the one of the world's most powerful AI supercomputers. This is sending shockwaves into the AI and GenAI ecosystem as they are throwing down the gauntlet to the other dominant players (Meta, Google, OpenAI)
There are a lot of folks with strong experience in building software, but who have not had the opportunity to work on implementing solutions that involve GenAI… and for the most part, I tend to see information coming either in the form of “TAKE THIS COURSE” or “LET ME IMPRESS YOU”
I thought that it might be helpful to provide a breakdown on GenAI, and how some of the Large Language Models (LLM) differ, both in terms of how your company might be likely to incorporate them and how everyone’s role and where/when they provide value might shift in these scenarios. My hope is to break this down in a way that it is helpful to seasoned software professionals as more of an index of how things might be different from what you already know, rather than a “how to” guide.
As I have mentioned in my videos, critical thinkers play a crucial role in ensuring quality, compliance, and adaptability throughout the lifecycle of these features. Not just to deploy amazing new features, but also to allow the company to remain relevant while not sacrificing quality (and legal issues) during a period where the AI capabilities are moving faster than the systems we use today were ever intended to handle at scale. New companies entering into these spaces will struggle not just in understanding how to best make use of their most skilled engineers, but how to best support them in an entirely different lifecycle.
With this in mind, in this post I will offer a few ways to break down how the models differ, how the work differs and what you may want to consider before starting on your first project.
The Main Models You Are Likely To Work With (Today)
What are the similarities and differences between the different model offerings (at a high level) and how these differ from one another in terms of how an engineer or tester gets involved?
OpenAI:
OpenAI's GPT models are accessed via API endpoints, making them straightforward to integrate into existing applications without requiring infrastructure management. This simplicity means testers focus on prompt engineering, guardrails, and evaluation tooling rather than infrastructure concerns.
New Challenges: For those focusing on quality of responses, they will need to develop skills in prompt engineering and API testing. They’ll likely work closely with product teams to define edge cases in outputs and collaborate with data scientists to evaluate response quality. Familiarity with tools like Postman or custom API testing frameworks will be valuable, but not as valuable as learning how to build custom evaluation tooling (likely in python) that will allow you to perform exhaustive scoring of responses under various conditions and ranges of inputs. The testing is all about creativity, challenging the perception of what is possible, and in analysis skills. It is also a good idea to hone up on your statistics that you learned in school to be sure you can speak to the significance of your findings, and use appropriately sized samples.
Meta:
Meta's Llama models are open-source and self-hosted, allowing for extensive customization. This approach gives companies significant control over data privacy and model behavior, but this also requires more technical expertise to set up and maintain. The time (and cost) spent on training and tuning is significant, so there is a longer lag between updating model versions unless your company is extremely well funded.
New Challenges: In addition to what was mentioned about OpenAI, efforts to evaluate quality of responses may require that engineers typically focused on product dev to learn more about their company’s infrastructure management (e.g., containerization with Docker or Kubernetes) and model fine-tuning techniques in such situations. For testers, there are now opportunities to work very closely with Data Science as well as with DevOps (to ensure the model is deployed securely and reliably). They may well find some resistance to these at companies that hold a closed mindset about the role of a software tester, but getting closer to this work (if company is using Llama) will really help a tester get closer to the relevant problems and help them become more valuable.
Google:
Google's Gemini (formerly Bard) models offer both endpoint access and custom training options through Google Cloud. This flexibility allows companies to start with pre-trained models while gradually moving toward deeper integration.
New Challenges: Most of what was described for OpenAI and Meta remains true here for product engineers and testers alike, except that familiarizing themselves with Google Cloud tools like Vertex AI for custom training workflows is also extremely valuable (actually, if the company uses GCP… there is value in getting comfortable with these anyway) . They may also need to collaborate with cloud engineers on deployment strategies while ensuring compliance with their country's Responsible AI (RAI) guidelines.
Anthropic:
This is of course the company that prompted this post. Anthropic's Claude models really emphasize efficiency (e.g., prompt caching) and like OpenAI are accessed via API. These models are cost-effective, making them attractive for businesses focused on scalability without heavy infrastructure investment. It should be noted that DARPA recently revealed that these models also seem to be far more resistant to Universal Suffix attacks and other malicious approaches to fool AI’s into bypassing their own restrictions.
New Challenges: This one is clearly about to change dramatically, but if you are an product engineer or tester working on this today, beyond what was mentioned about OpenAI, you will want to also focus on measuring and testing cost-performance trade-offs as part of your deployment strategies and evaluations. Your company likely selected Claude for cost reasons. You'll not only want to refine your ability to evaluate efficiency metrics, but also become proficient in presenting and managing your work upwards in terms of measuring latency or token usage.
Don’t Forget the Emerging Chinese Models:
I have not personally worked with any of these yet, but models like Tencent's Hunyuan-Large and Alibaba's Qwen are gaining traction globally. Most are open-source, and taking advantage of massive involvement from developers to quickly catch up with established competitors… especially when solutions need to belocalized for specific markets.
New Challenges: Stay tuned on this one but keep your eyes open. They could well find their way into your eco-system. Based on where things are today, anyone working with these models may need to develop expertise in handling multilingual capabilities and cultural nuances in outputs.
In the next post, we will try do dive a bit more into how you need to work differently in a GenAI dev environment, and how even working with different models requires different considerations



