Evaluating the usability of Power BI's AI Quick Seasure Suggestion feature.
As Microsoft started to integrate OpenAI’s GPT models across its product suite, we partnered with the Power BI team to evaluate the usability of the Quick Measure Suggestions (QMS) feature before launching it to general availability.
The QMS feature allows users to quickly create new measures using natural language prompts instead of writing formulas in Data Analysis Expressions (DAX). This functionality is designed to enhance efficiency and accessibility for users, especially those less familiar with DAX. Our evaluation aimed to identify barriers to discovering the feature, challenges in submitting natural language prompts, and best practices for setting user expectations.
NOTE: This is a usability study case study, so it will only focus on the research side. If you would like to learn more about this project and the design recommendations we made based on the research findings, feel free to reach out to me ☺☺☺.
study objective
To validate the overall usability of the feature and gather data to quantify various metrics, including AI trust score, learnability, functionality success, and comprehension.
focus area - 01
Identifying barriers that users may face in discovering the QMS feature.
focus area - 02
Identifying challenges and pain points that users may encounter while submitting a natural language prompt through the QMS input box.
focus area - 03
Defining effective methods to inform user about the behavior of an AI product in production that may not always provide accurate outputs.
Question 01
Are participants able to discover the QMS feature without being instructed?
Question 02
What challenges and pain point a participant might have to submit a natural language prompt through the QMS input box?
Question 03
Are the participants able to evaluate and validate a suggestion received from the AI generator when multiple answers were provided?
Question 04
In case of any errors in the experience, what barriers do the participants face in understanding and recovering from it?
The QMS feature is targeting to existing Power BI users with experience in DAX. A critical requirement was that users should have some experience with DAX (Data Analysis Expressions) in order for us to validate their expectations against the outcome.
In collaboration with the Power BI team, we recruited 6 participants who used Power BI daily for their personal or professional work. The recruitment process, including screening, signing a Non-Disclosure Agreement with Microsoft, and scheduling, was conducted through the User Interviews platform. Additionally, all our participants were based in the United States due to data regulations in certain geographic regions.
- Business professionals who use Power BI to create reports every week.
- They should have enough understanding of DAX to validate DAX formulas.
- All participants from within the United States because of data regulations in different regions.
Our tasks were designed to first get participants an idea of the dataset they were to use during the study. This was followed by checking discoverability of the feature, and then performing calculations on Power BI of various difficulty levels.
When writing our script, we had to iterate several times on our language because users would use the same words we used in describing the task for prompting using the Quick Measure Suggestions feature.
Where we tested:
The participants were shared a Microsoft Teams link prior to the remote test meeting we conducted. The meeting had 1 Moderator, 1 Administrator and at least 1 notetaker.
What we tested:
5+2 scenarios and tasks were given to the participant involving exploring and using the Power BI QMS feature.
How we tested:
Participants were observed during these tasks, and asked follow-up questions. The sessions lasted for about 60 mins. Participants were asked to fill a post-task survey
Typing the prompt in certain ways lead to correct results which might not always be natural language. Language has to be specific to the data and is not intelligent to recognize any variables that may be given input naturally. For instance, the user has to enter “United States of America” and not “U.S.” or “United States”.
5/6 participants noted that there was a specific way to prompt and 5/6 participants tried rewriting the prompt to get their expected output. When writing the prompt, there is no grammar check for any typos. When this is allowed, the feature returns an error. 3/6 participants mentioned the way they write prompts could affect how close the result is to their expectations.
The blue line indicates match with an existing field in the dataset, but participants thought of other things like auto-suggestions, and filtering system for specific terms.
4/6 participants found the auto-suggestion of the input box confusing.
3/6 participants found it confusing between:
- Blue underline clicked: possible prompt filtered for this term
- Auto-suggestion while typing: suggested prompts.
Considering this feature is being launched after people are used to ChatGPT, participants expect more than just calculations.This feature is limited to providing formulas, but people want directions on getting to their final outcome, which could be a graph or refined calculation.
While 5/6 participants struggled with typos, the pilot participant expected an autocorrect feature. Participants felt that they had to ask very specific question in a specific style and users might need a training to use it efficiently
Participants wanted to give a name to the measure they were creating before they added the calculation as a card to their dashboards, but they could not find a way to do that. They missed the formula bar up top left, where they could actually edit the measure name.
Some participants also expected there to be a typo recognizer so they do not make mistakes writing a natural language prompt.
5 out of 6 participants did not notice the variations of suggested measures shown below the first expanded suggestion card, despite it showed up multiple times throughout the test.
Furthermore, participants reported that the variations look similar to each other, and they cannot easily distinguish them.
Every output provides a “Preview value” which is not be optimal for certain types of prompts, leading to confusion.Confusing for prompts with multiple variables, especially for categorical or time/trend related ones.
The current design is most suitable for displaying singular and text-based answers. The “Preview value” is a middle step, but participant expectation is the final output (analysis or visualization).
We used the System Usability Scale (SUS) list of ten questions that participants scored as “Strongly Agree”, “Agree”, “Neutral”, “Disagree”, or “Strongly Disagree”.
We got a SUS score of:
60.7
The AI Trust Score is a quantitative metric that has been developed by Microsoft to assess the degree of trust users have in an AI system or model. This score is determined by a set of multi-dimensional statements that users respond to, providing insight into their level of trust in the AI. These statements assess the effectiveness of the AI feature in enhancing job efficiency and proficiency, as well as the user's understanding of when and how to use the feature in their job role.
Due to NDA, we are unable to share the AI Trust Score here.
The Quick Measure Suggestions feature demonstrated high discoverability, with users following three distinct paths to reach the natural language input box. While no participants had difficulty beginning to write a natural language prompt, they faced challenges in understanding the suggestions and other interface elements once they started typing.
Despite these challenges, participants, particularly those new to or unfamiliar with DAX, expressed strong enthusiasm for the feature's potential. All participants agreed that they would be willing to incorporate QMS into their everyday work.