Skip to content

Draft Synthetic Data Manifesto | NewMR


Ray Poynter, 30 September, 2024


Manifesto for Synthetic DataIn this note I set out what I believe to be Synthetic Data, why we need to define Synthetic Data, and some guidelines that I think vendors and buyers should adopt. I have been involved in a wide range of discussion with a wide range of organisations, but these views are my views, they do not represent the views of anybody else.

Scope – humans only
In this note I am only talking about the situation where data has been created to replace or augment data from or about people. Synthetic data could be created for other purposes, for example to represent companies and organisations, but this note does not cover anything except the case of synthetic data in the context of information relating to humans.

Why we need to define Synthetic Data
Synthetic data is already being sold, purchased and used, so a ‘wait-and-see’ approach is not appropriate. Buyers and users of research need and want to know what data and processes are being used to create the results they are using. If the data being used is not 100% data collected and unmodified from real people, they should be told. Moreover, they should be told in ways that allow them to assess the reliability and validity of the results and advice being offered.

From a vendor’s point of view, the name used for data that has been constructed rather than collected does not matter. Indeed, a vendor might be well advised to avoid the term synthetic data from a marketing point of view. However, if a vendor is using data that is not data collected from real people, they should, in my opinion, declare it. Defining synthetic data provides a way of flagging that vendors and buyers of information are dealing with a category that needs declaring.

A Definition of Synthetic Data
Data that has been created to replace data that could or would otherwise have been collected from humans.

Note, this definition does not talk about how the data has been created, it does not assume AI, LLMs, or any particular algorithm was used to create the data. Synthetic data is data that has been created instead or in addition to collecting it from people.

Examples of Synthetic Data

This list is not exhaustive, but it is intended to be helpful.

  • Synthetic Survey Responses. A data set where some or all of the data has been created. Variations include:
    • 100% of the data has been created
    • Some of the cases have been created, e.g. to compensate for under-sampling some groups.
    • Some of the cells were created, e.g. to compensate for missing responses.
    • Some of the fields were created, e.g., to add information that had not been collected from the research participants.
  • Synthetic Personas. AI entities that are created so that questions, often qualitative questions, can be asked of the personas.

Guidelines for anybody providing data or findings that include or are based on Synthetic Data

  1. Tell the buyer/user if the data is not 100% raw, unmodified responses from real humans.
  2. Explain the extent and nature of the created data.
  3. Outline the theoretical background to the approach you have used.
  4. Outline the limitations, biases, and risks in this approach.
  5. If you have used AI to create data, draw the buyer/user’s attention to the 20 AI questions developed by ESOMAR.
  6. Help the user/buyer assess the validity and reliability of the data and results.
  7. Keep updating your estimates of the accuracy and reliability of the approaches over time.

Are there better names than Synthetic Data?
I am sure there are hundreds of better names for synthetic data. For example, people have discussed the merits of terms like synthetic respondents and virtual participants. However, at the moment, the market has settled on the term Synthetic Data, so that is the term I use when describing this approach. If and when the market moves to another term, so will I.

Draft?
This document is the result of many discussions and lots of reading. However, I am sure that others can improve it. I would love to hear your suggestions.

Curious about how hot insights methods can benefit your business? Contact us at SoftOfficePro.com. We’ll help you harness the latest market research techniques to stay ahead of the competition. For all Market Research projects please visit pulsefe.com. They have a great platform comparable to STG at a fractional cost. For ODK Collect projects please contact us at softofficepro.com

Source link

6 Comment on this post

  1. Your blog has quickly become one of my favorites. Your writing is both insightful and thought-provoking, and I always come away from your posts feeling inspired. Keep up the phenomenal work!

  2. I do agree with all the ideas you have introduced on your post They are very convincing and will definitely work Still the posts are very short for newbies May just you please prolong them a little from subsequent time Thank you for the post

  3. Your writing has a way of making even the most complex topics accessible and engaging. I’m constantly impressed by your ability to distill complicated concepts into easy-to-understand language.

Join the conversation

Your email address will not be published. Required fields are marked *

Discover more from SOFTOFFICEPRO

Subscribe now to keep reading and get access to the full archive.

Continue reading

Share via
Copy link