Basic data coding

Getting from data to numbers

A collection of shapes inside the original, big rectangle. Some shapes are grayed out and have questions marks in them, indicating that their data has not yet been found.

What is data coding?

Once you have your datasets, you need to marshall them into a format that can be read uniformly. While mixed data sources help ensure that you have a more well-rounded understanding of your part of the big why than is possible with a single data type, those mixed data sources present a problem once you want to use them together to create a compilation.

The math you need to make all the numbers in your data adhere to a common rule is called normalization, which is explained in the next section. But what about when you have data that isn’t even in number form? These are the quotations, illustrations, photographs, and other media that are not numbers, but are research and data. This could be both current qualitative data, as well as historical data. You know this data is important, but how do you integrate it alongside the more easily-counted quantitative data?

Social scientists have been working on this problem for many years. Through a process called “data coding,” you can assign numerical value to non-numerical data.

Designing your coding system

Like building compiled indicators, there’s no perfect way to assign numerical value to non-numeric data. In fact, a 2020 study found that there’s a distinct “lack of standard methods for quantification of qualitative results.”¹ For this reason, when creating your coding system, it’s useful to apply a version of our principles for using mixed methods in data collection. We want it to be:

Defensible: If someone asks why you coded a datapoint a certain way, or why you named a theme the way you did, you must be able to answer those questions logically.
Replicable: If someone is following your work, you must build and document your coding scheme in such a way that a good faith effort to replicate your work will very likely result in a later researcher arriving at the same conclusions you did.
Verifiable: If someone checks your work, whether in its core data points, the themes you extracted, or in the coding methodology you build, they should be able to verify that your logic holds throughout.

Which is to say: you will have to design this coding system, and you will have to be able to explain it, defend it, and build it in a way such that others can independently verify its logic.

Tools to use

At this point, you might consider what tools to use for coding. As a general practitioner, we recommend that you use spreadsheets or basic relational databases as your primary tools, and document your coding logic in them. If you have more data than you can manage using one of these two tools, we recommend that you reach out to a data analytics team, a data science team, or an evaluation science team. We especially caution against using artificial intelligence or machine learning systems, as the outputs from those tools (at the time this guide was written) can be quite unreliable and resist independent verification.

If you do not have access to data professionals, that’s fine. You will simply need to narrow your work down to a point where you can manage the amount of data on your own and with your generalist’s skill set. Check out the problem framing section of the discovery guide to learn how to re-frame and scope your work.

Why code by hand

You might wonder why we encourage you to code qualitative data by hand. It’s not simply to make your life harder, but instead to allow us to address:

Scope: If you have a bigger data set than you can grapple with by hand, that’s a sign that you probably need to either scope your project back to a size you can handle, or you need to call in professional researchers.
Cost: A by-hand method is usually the cheapest way to move through this process. Buying tools to do this work for us costs additional budget dollars, but using our brains is already included in our job descriptions.
Velocity: A by-hand method is the fastest way to move through this process as long as your dataset is small (less than 1,000 data points). Any tool you want to use will require a learning curve, which takes time. Doing this work by hand avoids that learning curve.
Learning opportunity: The manual method allows for greater learning. Thinking through and doing this work yourself helps you learn the parameters of coding and gather insights into the process, which leads to a greater familiarity with your data and greater practice at critically engaging with it.
Product acquisition: After you’ve learned, you can buy. After you have a grasp of this process with a manual method, you can use what you’ve learned to determine which automated tool would work best, if you’re going to work this process often in the future. Smarter, more precise purchases make budget dollars go farther.

Coding systems depend on research methods

With some qualitative data, it’s easy to assign numerical value to the resulting dataset. For example, data held in a Likert scale, which is one of those sliding scales of preference, usually with a “strongly agree” on one end and a “strongly disagree” on the other, is qualitative data expressed quantitatively. If your research includes sentiment data, for example, research participants have probably been asked to express their data in a Likert scale, and have therefore already used numbers to express their feelings.

Slightly more complicated are research studies where multiple choice questions have been used as the research platform. In multiple choice answers, numerical values can be assigned to each question, or numerical values can be assigned to clusters of answers. This method of coding is like magazine or online quizzes that you might have seen.

In either of these cases, you might consider reaching out to data analysts to help you accurately parse the data. Because this research should have been designed with the help of a “Voice of the Customer” or data analytics team before the research took flight, that team might be easy to find. Otherwise, consider whether you can manage it in a spreadsheet on your own, with your skillset, or if the data set is simply not feasible for you to use.

The hardest task: coding completely open-response data

By far the most difficult coding task researchers face is when the qualitative research is entirely subjective in nature; that is, research where participants might have been asked a series of questions (semi-structured interviews) or been part of a open conversation (unstructured or open-ended interviews), or when the data is non-textual, like photographic, sonic, or textural.

This research, in the words of one team, produces “mountains of words.”²

Themes and codes

The simplest way to convert subjective qualitative data points into a numerical representation is “straight count”. This is the method we recommend, as it’s the simplest. After compiling the straight count, you can then treat your qualitative dataset like any quantitative data set, choosing whether to weight the scores or not, and normalizing the data so it shares common units of measure with your other data. (Processes for normalization and weighting are provided in later sections of this guide.) Below, please find two examples of how to compile straight counts from qualitative data.

Example 1: Coding, then counting

In 2017, the Veterans Experience Office at the U.S. Department of Veterans Affairs set out to understand how community-based veterans groups, known as CVEBs, were functioning. To do this, the design research team undertook a five-city, workshop-based approach.

After gathering the data, the team synthesized and coded it into themes, then, finding that the themes themselves bore division, further parsed them. The synthesized data ended up looking like this:

Themes dwelling on current concerns
- Group & Community History
- VA Relationship
- CVEB Self Perception
- Values: Conceptual
- Values: Practical
- Experiential Know-How
- Quantitative Know-How
- Challenges
- Best practices
Themes dwelling on future concerns
- VA Relationship
- Conceptual Thinking
- Metrics
- Communication
- Specific Projects
- Concerns
- Network
- Workflow
- Service Awareness & Navigation

Then, the team did a straight count of the number of data points that appeared in each theme. From that straight count, they were able to understand where the attention and resources for each group clustered, where additional attention and resources might be allocated, and how each group might share and gain best practices from the other.

Example 2: Defining, gathering counting

To understand how GSA’s digital ecosystem functions, the agency’s Enterprise Digital Experience (EDX) team compiled the EDX Index. This index is a composite of six datasets related to website performance, which are:

Accessibility
Performance/SEO
User behavior
USWDS use
Required site links
Customer-centricity

The first five in the list are quantitative, using units of measurement like time, number of connections, and implementation instances, while the sixth, customer-centricity, is gathered qualitatively using semi-structured interviews.³ The team first researched best practices for customer-centricity as defined in federal policy and law, as well as in private industry. They then melded those best practices with digital practice in GSA, outputting the following as customer-centric themes for GSA digital teams:

The team’s ability to state the website’s audience
The team’s ability to state the website’s purpose
The team’s implementation and use of a repeatable customer feedback mechanism
The team’s ability to take action based on customer feedback
The team’s ability to measure the impact of those actions

In designing the research, the team defined each answer to be binary, yes or no. Yes gets 2 points; no gets 1. There are no zero values in the scoring. They designed the answer structure this way to ensure that, even though they would get thick data from semi-structured interviews, they wanted to easily use that data alongside the quantitative data.

Given that human communication is nuanced, however, it’s not as straightforward as a yes or no when in conversation. For example, sometimes a team may have the authority to take action, and they have the right talent on the team, but action isn’t taken. So in that case the score for the EDXIndex question 4 is “no,” because even though the resources are in place, nothing is happening. Something is impeding action. This is the type of nuance that won’t be found in quantitative studies, or even in text-based qualitative ones, except possibly in open text fields.

Coding it into values of 2 and 1 is useful for scoring, but the EDX team leveraged the qualitative format to gather better data than would be possible using quantitative methods alone.

The risk of “double counting”

One common question is whether one data point counts in more than one theme. Since qualitative data is “thick data”,⁴ individual data points can count in more than one theme, as long as the decision to do so is defensible, replicable, and verifiable. One of the trickiest situations is when the researcher interprets the respondent’s information based on word choice, tone, or the answer’s cadence. Here’s an example:

A researcher is interviewing a veteran with mobility issues about the experience they have walking through a VA medical center parking lot. When the researcher asks the veteran “Do you feel safe walking through the parking lot?” The veteran answers “Yes,” then hesitates, and in an embarrassed tone, “…But I try not to walk alone when it’s icy in winter”.

But the researcher notes and documents that the veteran hesitated when they answered, and their tone sounded embarrassed. The professionalism and good faith of the researcher is key here. Without registering the situational context, “icy”, and the hesitation and tone, the datapoint becomes a simple “yes”, which is not fully accurate to the answer. But by including those points, the researcher’s experience interviewing and communicating adds a vital layer of nuance to the data.

When coding the data, the researcher confirms that their notes are accurate. For these reasons, the team codes the data point into:

The “Feels safe” theme.
A weather-related theme.
A theme around veteran feelings of independence.

Through one simple statement, data supporting three themes emerges. This is the nature of data based on verbal, human communication: it’s layered and nuanced. If you have conducted qualitative research, that nuance must be registered in your data set. It’s not double counting; it’s breaking the data out into accurate layers of meaning.

Review of key concepts

Assigning numerical value to non-numeric data is a serious proposition, but not one that should dissuade you from treating qualitative data with the dignity and importance that it can provide to illuminate your big why. Key concepts to keep in mind are:

Ensuring that your coding logic is defensible, replicable, and verifiable.
Documenting or gaining access to the documentation underpinning the core data points.
Coding by hand as far as you can, and reaching out to experts when the amount of data or the analysis of it outstrips your generalist’s skillset.

van Grootel L, Balachandran Nair L, Klugkist I, van Wesel F. Quantitizing findings from qualitative studies for integration in mixed methods reviewing. Res Synth Methods. 2020 May;11(3):413-425. doi: 10.1002/jrsm.1403. Epub 2020 Mar 15. PMID: 32104971; PMCID: PMC7317911. ↩
Johnson BD, Dunlap E, Benoit E. Organizing “mountains of words” for data analysis, both qualitative and quantitative. Subst Use Misuse. 2010 Apr;45(5):648-70. doi: 10.3109/10826081003594757. Erratum in: Subst Use Misuse. 2010 Jun;45(7-8):1279. PMID: 20222777; PMCID: PMC2838205. ↩
For more specifics on this work, please see Meyers, A and Monroe, A. Determining the true value of a website: A case study. 16 April 2024. ↩
Wong, T. Why Big Data Needs Thick Data. Ethnography Matters.13 May 2013. ↩

Up Next

Buy PDFs Book a call Share feedback