Review for Thesis Yi Yang

[Thesis Yi Yang]

Truth Discovery in Streaming Data and Crowdsourcing Applications

(Doctor Thesis'19)

"In this thesis, I focus on the truth discovery models to assess data veracity. ... This thesis advances truth discovery in applications where data is collected from data streams and crowdsourcing applications, specifically studies how to use object correlation in streaming data truth discovery and how to improve the accuracy and efficiency of streaming data truth discovery."

Chapter 4

Dynamic Source Weight Computation for Truth Inference over Data Streams

On the other hand, many existing truth discovery methods designed for streaming data can achieve high efficiency but sacrifice the accuracy .

解决方法是使用 Dynamic Source Weight Computation (DSWC)，首先设定一个阈值，当 unit error 低于这个阈值时，权重迭代的过程就被跳过。又因为 source weight 是不知道的，所以需要一个预测模型来预测权重。

Related Work

首先对问题进行定义：假设我们有 $J$ 个对象， $I$ 个用户。在时间戳 $p$ 每个对象 $j$ 可以被用户 $I_j^p$ 观测到，其中 $I_j^p \subseteq I$ 。我们用 $x_{ij}^p$ 表示用户 $i$ 在时间戳 $p$ 观测到 $j$ 的值。此时的真值用 $z_j^p$ 来表示。

最常用的方法是权重聚合（weighted aggregation），high-level view 如下：

$z_j^p = \frac{\sum_{i\in I_j^p a_i^p \times x_{ij}^p + c}}{\sum_{i\in I_j^p a_i^p + b}}$

$a_i^p$ 表示用户 $i$ 的权重，也就是反映出用户 $i$ 的可靠性。以下是不同的更新策略：

CRH¹: $a_i^p = \log\frac{\sum_{i'\in I}\sum_{j\in J_{i'}^p}(x_{i'j}^p - z_j^p)^2}{\sum_{j\in J_i^p}(x_{ij}^p - z_j^p)^2}$

DyOP²: $a_i^p = \frac{|J_i^p|}{\sum_{j\in J_i^p}(x_{ij}^p - z_j^p)^2}$

GTM³: $a_i^p = \frac{2(\beta_1 + 1) + |J_i^p|}{2\beta_2 + \sum_{j\in J_i^p}(x_{ij}^p - z_j^p)^2}$ ( $\beta_1$ 和 $\beta_2$ 是 Inverse-Gamma 分布的超参数)

基于迭代的方法（iterative based methods）可以实现高精度，但是迭代过程需要的计算量却很大。DynaTD，DynaTD+s 以及 iCRH 这些方案是针对流数据场景提出的，它们都没有使用迭代的过程，而是每个时间戳只计算一次而不是让结果达到收敛。

The consequence of adopting this approach is that the incremental methods cannot compute accurate source weights at each timestamp, which results in large errors when inferring object truths.

为了权衡效率与精度，Li 等人提出了 ASRA⁴ 方法（adaptive source reliability assessment scheme）：首先分析上一次的权重，如果存在的误差低于阈值，则使用该权重估计本次的真值；否则进行迭代更新权重。另一方面，ASRA 也存在许多局限性：

ASRA 假设所有用户都在线，即 $\forall t\in[1,T], |I_j^t|=|I|$
ASRA 的演化估计模型不能保证权重在每个时间戳能够收敛
ASRA 不能够利用先验知识（注：这么说也只是为了作者论文铺路）

因此，我们提出了 DSWC 方案，首先定义 unit error $\phi_j^{p/q}$ 表示在 $q$ 时刻用 $p$ 时刻的权重估计出的真值产生的偏差，定义如下：

$\phi_j^{p/q} = (\frac{z_j^p - z_j^{p/q}}{m_j^q})^2$

其中， $m_j^q = \max\{x_{ij}^q\}_{i\in I_j^q}$

Note

本文中 $i$ 表示用户， $j$ 表示对象，所以 $x_{ij}^q$ 代表在 $q$ 时刻用户 $i$ 观测到 $j$ 的值， $m_j^q$ 代表在 $q$ 时刻对 $j$ 对象观测到的最大数值。

如果 $\phi_j^{p/q} < \epsilon$ ，那么直接使用 $\{a_i^p\}$ 估计 $q$ 时刻的真值。

为了避免迭代，我们需要有一个预测当前权重的模型。

Four Research Questions:

How to use object relations in truth discovery in a dynamic environment
How to model object relations
How to efficiently discover object truths
How to achieve both high accuracy and high efficiency for streaming data truth discovery
How to discover object truths efficiently
How to further improve stream data truth discovery efficiency when achieving high accuracy
How to model the humans' own characteristics in the truth discovery model
How to model humans' guessing behaviors
How to better model humans' labeling process
How to effectively use a small amount of ground truths to better aggregate continumous object truths

Outlines:

Probabilistic Truth Discovery with Object Correlations (PTDCorr)
Dynamic Source Weight Computation Truth Discovery (DSWC)
Crowdsourced Truth Discovery modeling Guessing and task Difficulty (CTDGD)
Confusion-aware Truth Inference (CTI)
Optimization-based Semi-supervised Truth Discovery (OpSTD)

PTDCorr, iPTDCorr, DSWC => data streams CTDGD, CTI => sources are human OpSTD => when a small set of ground truths are available

Three Framework:

Iterative Framework
Optimization Framework
Probabilistic Graphical Model Framework

Methods for Static Data:

Li, Q., Li, Y., Gao, J., Zhao, B., Fan, W. & Han, J. (2014). Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In Proceedings of the 2014 acm sigmod international conference on management of data (pp. 1187–1198). ↩
Li, Y., Li, Q., Gao, J., Su, L., Zhao, B., Fan, W. & Han, J. (2015). On the discovery of evolving truth. In Proceedings of the 21th acm sigkdd international conference on knowledge discovery and data mining (pp. 675–684). ↩
Zhao, B. & Han, J. (2012). A probabilistic model for estimating real-valued truth from conflicting sources. Proc. of QDB. ↩
T. Li, Y. Gu, X. Zhou, Q. Ma, and G. Yu, “An Effective and Efficient Truth Discovery Framework over Data Streams,” 2017, doi: 10.5441/002/edbt.2017.17. ↩