Power BI Usability Study

Research Brief

study objective

To validate the overall usability of the feature and gather data to quantify various metrics, including AI trust score, learnability, functionality success, and comprehension.

focus area - 01

Identifying barriers that users may face in discovering the QMS feature.

focus area - 02

Identifying challenges and pain points that users may encounter while submitting a natural language prompt through the QMS input box.

focus area - 03

Defining effective methods to inform user about the behavior of an AI product in production that may not always provide accurate outputs.

Research Questions

Question 01

Are participants able to discover the QMS feature without being instructed?

Question 02

What challenges and pain point a participant might have to submit a natural language prompt through the QMS input box?

Question 03

Are the participants able to evaluate and validate a suggestion received from the AI generator when multiple answers were provided?

Question 04

In case of any errors in the experience, what barriers do the participants face in understanding and recovering from it?

Participants Recruitment

Who are QMS's target users?

The QMS feature is targeting to existing Power BI users with experience in DAX. A critical requirement was that users should have some experience with DAX (Data Analysis Expressions) in order for us to validate their expectations against the outcome.

How did we recruited participants?

In collaboration with the Power BI team, we recruited 6 participants who used Power BI daily for their personal or professional work. The recruitment process, including screening, signing a Non-Disclosure Agreement with Microsoft, and scheduling, was conducted through the User Interviews platform. Additionally, all our participants were based in the United States due to data regulations in certain geographic regions.

What is the participant profiles?

- Business professionals who use Power BI to create reports every week.
- They should have enough understanding of DAX to validate DAX formulas.
- All participants from within the United States because of data regulations in different regions.

Task Flows

Prompting to Prompt...How did we design the tasks?

Our tasks were designed to first get participants an idea of the dataset they were to use during the study. This was followed by checking discoverability of the feature, and then performing calculations on Power BI of various difficulty levels.

When writing our script, we had to iterate several times on our language because users would use the same words we used in describing the task for prompting using the Quick Measure Suggestions feature.

What was the study like?

Where we tested:
The participants were shared a Microsoft Teams link prior to the remote test meeting we conducted. The meeting had 1 Moderator, 1 Administrator and at least 1 notetaker.

What we tested:
5+2 scenarios and tasks were given to the participant involving exploring and using the Power BI QMS feature.

How we tested:
Participants were observed during these tasks, and asked follow-up questions. The sessions lasted for about 60 mins. Participants were asked to fill a post-task survey

Findings

Finding 01: Not Really Natural Language

Typing the prompt in certain ways lead to correct results which might not always be natural language. Language has to be specific to the data and is not intelligent to recognize any variables that may be given input naturally. For instance, the user has to enter “United States of America” and not “U.S.” or “United States”.

5/6 participants noted that there was a specific way to prompt and 5/6 participants tried rewriting the prompt to get their expected output. When writing the prompt, there is no grammar check for any typos. When this is allowed, the feature returns an error. 3/6 participants mentioned the way they write prompts could affect how close the result is to their expectations.

Finding 02: Confusing interaction with the input box

The blue line indicates match with an existing field in the dataset, but participants thought of other things like auto-suggestions, and filtering system for specific terms.

4/6 participants found the auto-suggestion of the input box confusing.
3/6 participants found it confusing between:
- Blue underline clicked: possible prompt filtered for this term
- Auto-suggestion while typing: suggested prompts.

Finding 03: Expectations inspired by ChatGPT

Considering this feature is being launched after people are used to ChatGPT, participants expect more than just calculations.This feature is limited to providing formulas, but people want directions on getting to their final outcome, which could be a graph or refined calculation.

While 5/6 participants struggled with typos, the pilot participant expected an autocorrect feature. Participants felt that they had to ask very specific question in a specific style and users might need a training to use it efficiently

Finding 04: Interface shortcomings

Participants wanted to give a name to the measure they were creating before they added the calculation as a card to their dashboards, but they could not find a way to do that. They missed the formula bar up top left, where they could actually edit the measure name.

Some participants also expected there to be a typo recognizer so they do not make mistakes writing a natural language prompt.

Finding 05: Poor discoverability of additional suggestions

5 out of 6 participants did not notice the variations of suggested measures shown below the first expanded suggestion card, despite it showed up multiple times throughout the test.

Furthermore, participants reported that the variations look similar to each other, and they cannot easily distinguish them.

Finding 06: Unexpected output

Every output provides a “Preview value” which is not be optimal for certain types of prompts, leading to confusion.Confusing for prompts with multiple variables, especially for categorical or time/trend related ones.

The current design is most suitable for displaying singular and text-based answers. The “Preview value” is a middle step, but participant expectation is the final output (analysis or visualization).

Metrics

SUS Score

We used the System Usability Scale (SUS) list of ten questions that participants scored as “Strongly Agree”, “Agree”, “Neutral”, “Disagree”, or “Strongly Disagree”.

We got a SUS score of:

60.7

AI Trust Score

The AI Trust Score is a quantitative metric that has been developed by Microsoft to assess the degree of trust users have in an AI system or model. This score is determined by a set of multi-dimensional statements that users respond to, providing insight into their level of trust in the AI. These statements assess the effectiveness of the AI feature in enhancing job efficiency and proficiency, as well as the user's understanding of when and how to use the feature in their job role.

Due to NDA, we are unable to share the AI Trust Score here.

Good Things

What went well

The Quick Measure Suggestions feature demonstrated high discoverability, with users following three distinct paths to reach the natural language input box. While no participants had difficulty beginning to write a natural language prompt, they faced challenges in understanding the suggestions and other interface elements once they started typing.

Despite these challenges, participants, particularly those new to or unfamiliar with DAX, expressed strong enthusiasm for the feature's potential. All participants agreed that they would be willing to incorporate QMS into their everyday work.