Elizabeth Rousseau, UX Researcher, DFFRNT
Benchmarking as a tool for user experience researchers if functionally identical to how it applies as a business practice. Depending on the product or service being offered it might be important to know how long it takes to transfer money on an app, or to pay for parking with a parking machine, or to buy tickets at a venue. With this measurement, it is then possible to compare your metrics against other similar services and determine how well your product stacks up against the competition. These metrics are also valuable internally, to show how design or Information architecture changes improved task performance. Benchmarking makes it easy to measure and compare performance.
School systems do this as well, provincial testing is intended to assess how much students have progressed in areas such as reading, writing and math. This is both to set clear targets to families on what the expectations are for their children, as well as important feedback for the province as to how well schools are performing as centers of learning.
This sort of clarity is valuable across all sectors, particularly when it comes to new and poorly understood technology that is being equally hyped and derided. One such sector is Collocially known as AI, though in most cases we are talking about Large Language Models “LLM’s” which had demonstrated some competency in compiling, summarizing, and presenting its inventory of information. An underwhelming summary of the abilities of LLM’s to those who are jittery with excitement at the potential of AI developments and the future work that humans no longer need to be bothered about.
It sounds like some benchmarks would be helpful in providing us with some evidence as to current capacities of various LLMs, and how worried employees need to be. Carnegie Mellon University researchers did just that, creating a digital company in order to test how well LLMs were able to take on typical workplace tasks. Based on their testing, the top two models scored 30 and 26 percent, which isn’t a convincing enough performance to hire someone even if they can forgo sleep and don’t have to worry about the ‘cost of living’.
In particular, the researchers found that the LLMs struggled with using the company communication platform as well as online office software the digital company used. The LLMs also lost performance scores because of difficulties in understanding documents, communicating with other people, or common ‘mundane’ tasks, all areas that humans have no trouble with.
One task has the LLM take on the role of a project manager. Each task has multiple steps, with points awarded for completing each step as well as more points for full task completion. This allows for evaluating the performance of the LLM throughout the task, and to award points for any amount of progress while keeping the incentive for completion high.