Trusting a Distributed Data Pipeline

Duration: 40 minutes


Richard Guy

Principal Applied Researcher


Specializing in data mining, modeling, and engineering, Richard works with a team of data mining researchers at Microsoft.

About The Session

Conclusions you reach with data are only valid if they correctly interpret your data set. In many organizations, the responsibility for collecting and aggregating data is distributed, so it can be hard to ensure that everyone who uses a data set understands the limitations of the signals in that pipeline.

As an example, many companies make important decisions about what events constitute an “active user,” and these decisions are reflected in the pipeline code. Changes to a pipeline may not be communicated to all downstream users, leading to misinformed conclusions even from correctly executed analyses.

In this talk, Richard will share three key questions to help ensure that you are interpreting your data correctly and drawing accurate conclusions.

Key Questions

  • For data consumers: what business decisions are implicit rather than explicit in the data that I am using?
  • For data producers: who is using your data, and are they aware of changes that you make?
  • For organizations: how does your organization prevent unintentional changes to the meaning of a data set?