Research Tracks
While During the Conference
If you have any problems during the session and are unable to reach us via Zoom or Chime, please send us messages via Facebook Messages or Whova. Our staff will be available online to answer your questions.
The Web Conferences 2020 Staff
How to attend the web conference 2020 online
Please be reminded that all the time shown in the program is based on CST (China Standard Time) UTC/GMT +8 hours. You may use the Time Zone Converter to convert the CST time to your local time.
This is the instruction on how to attend the conference online and guidelines for Session Chairs and presenters. Please click the link here to preview or download.
Each event and program will be hosted in Zoom, Chime is only for backup.
The Web Conferences 2020 Staff
Research Tracks
Research Tracks (1)
Web Mining-A (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
11:00-11:30 |
Off-policy Learning in Two-stage Recommender Systems Jiaqi Ma (University of Michigan), Zhe Zhao (Google), Xinyang Yi (Google), Ji Yang (Google), Minmin Chen (Google), Jiaxi Tang (Simon Fraser University), Lichan Hong (Google) and Ed H. Chi (Google).
AbstractRecommender systems in industrial production often need to serve billions of users with a million-level candidate item space to recommend from. And moreover, the systems are required to respond users' request in real time within milliseconds. The large scale and the strict latency have led to numerous technical challenges.One major challenge is how to serve the users efficiently with highly personalized content. To achieve this goal, a two-stage approach is widely used, where an efficient candidate generation model generates a candidate set of hundreds of items from the whole item space at the first stage, and then, at the second stage, a more powerful ranking model re-ranks the candidate items and recommends the top few items to the user.Another major challenge is how to get enough labeled data to train such large-scale recommender systems. Fortunately, abundant logged user feedback (e.g., user clicks or dwell time) generated by historical recommender systems is available and commonly used as training data. However, such data are inherently biased because the feedback can only be observed on items recommended by the historical systems. Researchers therefore have applied off-policy correction on the learning of recommender systems to address such biases. However, most existing work only studied off-policy correction on single-stage systems.In this work, we demonstrated that naively applying the existing off-policy correction methods to the two-stage recommender systems is sub-optimal and proposed an efficient two-stage off-policy policy gradient method to correct the bias in two-stage systems. We also conducted experiments on real-world datasets with large action space and showed the effectiveness of the proposed method. |
11:30-12:00 |
Efficient Neural Interaction Function Search for Collaborative Filtering Quanming Yao (4Paradigm), Xiangning Chen (University of California, Los Angeles), James Kwok (The Hong Kong University of Science and Technology), Yong Li (Tsinghua University) and Cho-Jui Hsieh (University of California, Los Angeles).
AbstractInteraction function (IFC), which captures interactions among items and users, is of great importance in collaborative filtering (CF). The inner product is the most popular IFC due to its success in low-rank matrix factorization. However, interactions in real-world applications can be highly complex. Many other operations (such as plus and concatenation) have also been proposed, and can possibly offer better performance than the inner product. In this paper, motivated by the success of automated machine learning, we propose to search for proper interaction functions (SIF) for CF tasks. We first design an expressive search space for SIF by reviewing and generalizing existing CF approaches. We then propose to represent the search space as a structured multi-layer perceptron and design a stochastic gradient descent algorithm that can simultaneously update both architectures and learning parameters. Experimental results demonstrate that the proposed method can be much more efficient than popular AutoML approaches, and also obtain much better prediction performance than state-of-the-art CF approaches. |
12:00-12:30 |
Clustering and Constructing User Coresets to Accelerate Large-scale Top-K Recommender Systems Jyun-Yu Jiang (University of California, Los Angeles), Patrick H. Chen (University of California, Los Angeles), Cho-Jui Hsieh (University of California, Los Angeles) and Wei Wang (University of California, Los Angeles).
AbstractTop-K recommender systems aim to generate few but satisfactory personalized recommendations for various practical applications, such as item recommendation for e-commerce and link prediction for social networks. However, the numbers of users and items can be enormous, thereby leading to myriad potential recommendations as well as the bottleneck in evaluating and ranking all possibilities. Existing Maximum Inner Product Search (MIPS) based methods treat the item ranking problem for each user independently and the relationship between users has not been explored. In this paper, we propose a novel model for clustering and navigating for top-K recommenders (CANTOR) to expedite the computation of top-K recommendations based on latent factor models. A clustering-based framework is first presented to leverage user relationships to partition users into affinity groups, each of which contains users with similar preferences. CANTOR then derives a coreset of representative vectors for each affinity group by constructing a set cover with a theoretically guaranteed difference to user latent vectors. Using these representative vectors in the coreset, approximate nearest neighbor search is then applied to obtain a small set of candidate items for each affinity group to be used when computing recommendations for each user in the affinity group. This approach can significantly reduce the computation without compromising the quality of the recommendations. Extensive experiments are conducted on six publicly available large-scale real-world datasets for item recommendation and personalized link prediction. The experimental results demonstrate that CANTOR significantly speeds up matrix factorization models with high precision. For instance, CANTOR can achieve 355.1x speedup for inferring recommendations in a million-user network with 99.5% precision@1 to the original system while the state-of-the-art method can only obtain 93.7x speedup with 99.0% precision@1. |
12:30-12:45 |
Adaptive Hierarchical Translation-based Sequential Recommendation Yin Zhang (Texas A&M University), Yun He (Texas A&M University), Jianling Wang (Texas A&M University) and James Caverlee (Texas A&M University).
AbstractExisting sequential recommenders mainly focus on modeling sequential patterns by using user activity sequences. However, purely sequence-based recommendation usually faces challenges in capturing general item relations that are not easily discovered from highly-personalized user sequences. Hence, we propose a novel adaptive hierarchical translation-based recommendation called HierTrans. Specifically, HierTrans first extends traditional item-level relations to the category-level, to help capture dynamic sequence patterns that can generalize across users and time. Then unlike the item-level relation based methods, we build a novel hierarchical temporal graph that contains item multi-relations at the category-level and user dynamic sequences at the item-level to facilitate capturing item multi-relations inside user dynamic sequences. Based on the graph, HierTrans adaptively aggregates the high-order multi-relations among items and dynamic user preferences to capture the dynamic joint influence for next-item recommendation. Specifically, different from traditional translation-based recommenders that assumes a user's translation vector is static and identical, the user translation vector in HierTrans can adaptively change based on both a user's previous interacted items and the item relations inside the user's sequences, as well as the user's personal dynamic preference. Experiments on public datasets demonstrate the proposed model consistently outperforms state-of-the-art sequential recommendation methods and uncovers meaningful patterns in user sequences. |
12:45-13:00 |
Graph Enhanced Representation Learning for News Recommendation Suyu Ge (Tsinghua University), Chuhan Wu (Tsinghua University), Fangzhao Wu (Microsoft), Tao Qi (Tsinghua University) and Yongfeng Huang (Department of Electronic Engineering; Tsinghua University).
AbstractWith the explosion of online news, personalized news recommendation becomes increasingly important for online news platforms to help their users find interested information. Existing news recommendation methods achieve personalization by building accurate news representations from news content and user representations from their direct interactions with news (e.g., click), while ignoring the high-order relatedness between users and news. Here we propose a news recommendation method which can enhance the representation learning of users and news by modeling their relatedness in a graph setting. In our method, users and news are both viewed as nodes in a bipartite graph constructed from historical user click behaviors. For news representations, a transformer architecture is first exploited to build news semantic representations. Then we combine it with the information from neighbor news in the graph via a graph attention network. For user representations, we not only represent users from their historically clicked news, but also attentively incorporate the representations of their neighbor users in the graph. Experiments were conducted on a large-scale real-world dataset. The improved performances validate the effectiveness of our proposed method. |
Social Network-A (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
11:00-11:30 |
TRAP: Two-level Regularized Autoencoder-based Embedding for Power-law Distributed Data Dongmin Park (Korea Advanced Institute of Science and Technology), Hwanjun Song (Korea Advanced Institute of Science and Technology), Minseok Kim (Korea Advanced Institute of Science and Technology) and Jae-Gil Lee (Korea Advanced Institute of Science and Technology).
AbstractFinding low-dimensional embeddings of sparse high-dimensional data objects is important in many applications such as recommendation, graph mining, and natural language processing (NLP). Recently, autoencoder (AE)-based embedding approaches have achieved state-of-the-art performance in many tasks, especially in top-k recommendation tasks with user embedding or node classification tasks with node embedding. However, we find that many real-world data follow the power-law distribution with respect to the data object sparsity. When learning AE-based embeddings of these data, dense inputs move away from sparse inputs in an embedding space even when they are highly correlated. Resultingly, the embedding is distorted, which we call the polarization problem. In this paper, we propose TRAP that leverages two-level regularizers to effectively alleviate this problem. (i) The macroscopic regularizer adds a regularization term in the loss function to generally prevent dense input objects from being distant from other sparse input objects. (ii) The microscopic regularizer introduces a new object-wise parameter to individually entice each object to correlated neighbor objects rather than uncorrelated ones. Importantly, TRAP can be easily coupled with existing AE-based embedding methods with a simple modification. In extensive experiments on two representative embedding tasks using six-real world datasets, TRAP boosted the performance of the state-of-the-art algorithms by up to 31.53% and 94.99% respectively. |
11:30-12:00 |
Graph Representation Learning via Graphical Mutual Information Maximization Zhen Peng (Xi'an Jiaotong University), Wenbing Huang (Tsinghua University), Minnan Luo (Xi'an Jiaotong University), Qinghua Zheng (Xi'an Jiaotong University), Yu Rong (Tencent AI Lab), Tingyang Xu (Tencent AI Lab) and Junzhou Huang (University of Texas at Arlington).
AbstractThe richness in the content of various information networks such as social networks and communication networks provides the unprecedented potential for learning high-quality expressive representations without external supervision. This paper investigates how to preserve and extract the abundant information from graph-structured data into embedding space in an unsupervised manner. To this end, we propose a novel concept, Graphical Mutual Information (GMI), to measure the correlation between input graphs and high-level hidden representations. GMI generalizes the idea of conventional mutual information computations from vector space to the graph domain where measuring mutual information from two aspects of node features and topological structure is indispensable. GMI exhibits several benefits: First, it is invariant to the isomorphic transformation of input graphs--an inevitable constraint in many existing graph representation learning algorithms; Besides, it can be efficiently estimated and maximized by current mutual information estimation methods such as MINE; Finally, our theoretical analysis confirms its correctness and rationality. With the aid of GMI, we develop an unsupervised learning model trained by maximizing GMI between the input and output of a graph neural encoder. Considerable experiments on transductive as well as inductive node classification and link prediction demonstrate that our method outperforms state-of-the-art unsupervised counterparts, and even sometimes exceeds the performance of supervised ones. |
12:00-12:30 |
Task-Oriented Genetic Activation for Large-Scale Complex Heterogeneous Graph Embedding Zhuoren Jiang (Sun Yat-sen University), Zheng Gao (Indiana University Bloomington), Jinjiong Lan (alibaba), Hongxia Yang (Alibaba Group), Yao Lu (Sun Yat-sen University) and Xiaozhong Liu (Indiana University Bloomington).
AbstractThe recent success of deep graph embedding innovates the graphical information characterization methodologies. However, in real-world applications, such a method still struggles with the challenges of heterogeneity, scalability, and multiplex. To address these challenges, in this study, we propose a novel solution, Genetic hEterogeneous gRaph eMbedding (GERM), which enables flexible and efficient task-driven vertex embedding in a complex heterogeneous graph. Unlike prior efforts for this track of studies, we employ a task-oriented genetic activation strategy to efficiently generate the “Edge Type Activated Vector” (ETAV) over the edge types in the graph. The generated ETAV can not only reduce the incompatible noise and navigate the heterogeneous graph random walk at the graph-schema level, but also activate an optimized subgraph for efficient representation learning. By revealing the correlation between the graph structure and task information, the model interpretability can be enhanced as well. Meanwhile, an activated heterogeneous skip-gram framework is proposed to encapsulate both topological and task-specific information of a given heterogeneous graph. Through extensive experiments on both scholarly and e-commerce datasets, we demonstrate the efficacy and scalability of the proposed methods via various search/recommendation tasks. GERM not only outperforms the state-of-the-art models, but also significantly reduces the running time. |
12:30-12:45 |
ROSE: Role-based Signed Network Embedding Amin Javari (University of Illinois at Urbana-Champaign), Tyler Derr (Michigan State University), Pouya Esmalian (Sharif University), Jiliang Tang (Michigan State University) and Kevin Chang (University of Illinois at Urbana-Champaign).
AbstractIn real-world networks, nodes might have more than one type of relation. Signed networks are an important class of such networks consisting of two types of relations: positive and negative. Recently, embedding signed networks has attracted increasing attention. In general, existing models rely on a path-based closeness measure defined based on social theories. However, this strategy is associated with major drawbacks including the incompleteness of such theories in explaining real-world signed networks. We propose a new approach for embedding signed networks that addresses these shortcomings by relying on a network transformation based strategy. The main idea is that rather finding the similarities of two nodes based on the complex relationships/paths between them, we can find their similarities through simple paths/relationships between different roles carried by them. Based on this idea, the model can be described in three steps: (1) the input directed signed network is transformed into an undirected, unsigned bipartite network where each node is mapped to a set of nodes denoted as role-nodes. Each role-node captures a certain role that a node in the original network plays. (2) The network of role-nodes is embedded. (3) Original network is encoded by aggregating the embedding vectors of role-nodes. According to our experiments, the proposed technique substantially outperforms the existing models on link prediction and label prediction tasks. |
12:45-13:00 |
Learning Temporal Interaction Graph Embedding via Coupled Memory Networks Zhen Zhang (Zhejiang University), Jiajun Bu (Zhejiang University), Martin Ester (Simon Fraser University), Jianfeng Zhang (Alibaba Group), Chengwei Yao (Zhejiang University), Zhao Li (Alibaba Group) and Can Wang (Zhejiang University).
AbstractWith the increasing demand of mining rich knowledge in graph structured data, graph embedding has become the research focus in both academic and industrial communities due to its powerful capability. The majority of existing work overwhelmingly learn node embeddings in the context of static, plain or attributed, homogeneous graphs. However, many real-world applications frequently involve bipartite graphs with temporal and attributed interaction edges, called temporal interaction graphs. The temporal interactions usually imply different facets of interest and might even evolve over the time, thus putting forward huge challenges in learning effective node representations. Furthermore, most existing graph embedding models embed all the information of each node into a single vector representation, which is insufficient to characterize the node's multifaceted properties. In this paper, we propose a novel framework named TigeCMN to learn node representations from a sequence of temporal interactions. Specifically, we devise two coupled memory networks to store and update node embeddings in the external matrices explicitly and dynamically, which forms deep matrix representations and could enhance the expressiveness of the node embeddings. Then, we generate node embedding from two parts: a static embedding that encodes its stationary properties and a dynamic embedding induced from memory matrix that models its temporal interaction patterns. We conduct extensive experiments on various real-world datasets covering the tasks of node classification, recommendation and visualization. The experimental results empirically demonstrate that TigeCMN can outperform the state-of-the-arts with different gains. |
User Modeling-A (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
11:00-11:30 |
Learning to Hash with Graph Neural Networks for Recommender Systems Qiaoyu Tan (Texas A&M University), Ninghao Liu (Texas A&M University), Xing Zhao (Texas A&M University), Hongxia Yang (Alibaba), Jingren Zhou (Alibaba) and Xia Hu (Texas A&M University).
AbstractGraph representation learning has been extensively studied for recommender systems in recent years. Despite its effectiveness in generating continuous embeddings for objects in user-item interaction networks, the computational cost to infer users’ preferences toward large corpus of items is tremendous. To overcome the computational barriers, hashing is often adopted to facilitate efficient approximations of k-nearest-neighbors search. However, such approaches may not yield optimal hash codes to support high quality retrieval due to the separate learning of embedding and hashing. The joint learning of effective hashing and embedding still remains an open challenge.In this paper, we focus on the problem of hashing with graph neural networks (GNNs) for high-quality retrieval. We propose a simple yet effective discrete representation learning framework for jointly learning continuous and discrete codes. Specifically, a deep hashing with GNNs (HashGNN) is presented, which consists of two components, a GNN encoder for learning node representations, and a hash layer for encoding representations to hash codes. The whole architecture is trained end-to-end by jointly optimizing two losses, i.e., reconstruction loss from reconstructing observed links, and ranking loss from preserving the relative ordering of hash codes. A novel discrete optimization strategy based on straight through estimator (STE) with guidance is proposed. The key idea is to avoid gradient magnification in the back-propagation of STE with continuous embedding guidance, in which we begin from learning an easier network that mimics the continuous embedding and let it evolve during the training, until it finally goes back to STE. Comprehensive experiments over three publicly available and one real-world A+ 1 company datasets demonstrate that our model not only can achieve comparable performance compared with its continuous counterpart but also runs multiple times faster during inference. |
11:30-12:00 |
Learning the Structure of Auto-Encoding Recommenders Farhan Khawar (The Hong Kong University of Science and Technology), Leonard Poon (The Education University of Hong Kong) and Nevin L. Zhang (The Hong Kong University of Science and Technology).
AbstractAutoencoder based recommenders have recently shown state-of-the-art performance in the recommendation task due to their ability to model non-linear item relationships effectively. However, existing autoencoder based recommenders use fully-connected neural network layers and do not employ structure learning. This can lead to inefficient training, especially when the data is sparse as commonly found in collaborative filtering. The aforementioned results in lower generalization ability and reduced performance. In this paper, we introduce structure learning for autoencoder recommenders by taking advantage of the inherent item groups present in the collaborative filtering domain. Due to the nature of items in general, we know that certain items are more related to each other than to other items. Based on this, we propose a method that first learns groups of related items and then uses this information to determine the connectivity structure of an auto-encoding neural network. This results in a network that is sparsely connected. This sparse structure can be viewed as a prior that guides the network training. Empirically we demonstrate that the proposed structure learning enables the autoencoder to converge to a local optimum with a much smaller spectral norm and generalization error bound than the fully-connected network. The resultant sparse network considerably outperforms the state-of-the-art methods like Mult-vae/Mult-dae on multiple benchmarked datasets. In particular, our method achieves more than 13\% improvement over Mult-vae across all metrics on the MSD dataset when the same number of parameters and flops are used. It also has a better cold-start performance. |
12:00-12:30 |
Directional and Explainable Serendipity Recommendation Xueqi Li (Hunan University), Wenjun Jiang (Hunan University), Weiguang Chen (Hunan University), Jie Wu (Temple University), Guojun Wang (Guangzhou University) and Kenli Li (Hunan University).
AbstractSerendipity recommendation has attracted more and more attention in recent years. It commits to providing recommendations which could not only cater users' preferences but also broaden their horizons. However, existing approaches usually measure user-item relevance with a scalar instead of a vector, ignoring user preference directionality, which increases the risk of unrelated recommendations. To address the limitation, we propose a user-preference-aware and explainable serendipity recommendation method. Specifically, we (1) extract users' long-term preferences (we call it preference directions) with an unsupervised model, GMM (Gaussian mixture model), and capture their short-term demands (we call it current demands) with capsule network; (2) generate recommendations by combining preference directions with current demands; and (3) make the first attempt to provide explanations for serendipitous recommendations via a back-routing scheme. Extensive experiments on real-world datasets show that our approach could effectively improve the serendipity and explainability, and provides a promotion on diversity, comparing with existing serendipity-based methods. |
12:30-12:45 |
Attentive Sequential Model of Latent Intent for Next Item Recommendation Md Mehrab Tanjim (University of California San Diego), Congzhe Su (Etsy), Ethan Benjamin (Etsy), Diane Hu (Etsy), Liangjie Hong (Etsy) and Julian McAuley (University of California San Diego).
AbstractUsers exhibit different intents across e-commerce services (e.g.~discovering new items, purchasing gifts, etc.) which drives them to interact with a wide variety of items in multiple ways (e.g.~click, add-to-cart, add-to-favorite, purchase). To give better recommendations, it is important to capture user intent, in addition to considering their historic interactions. However, these intents are by definition latent, as we observe only a user's interactions and not their underlying intent. To discover such latent intents, and use them effectively for recommendation, in this paper we propose an Attentive Sequential model of latent intent. Our model first learns item similarities from users' interaction histories via a self-attention layer, then uses a Temporal Convolutional Network layer to obtain a latent representation of the user's intent from her action on a particular category. We use this representation to guide an attentive model to predict the next item. Results from our experiments show that our model can capture the dynamics of user behavior and preferences, leading to state-of-the-art performance across datasets from two major e-commerce platforms, namely Etsy and Alibaba. |
12:45-13:00 |
Exploiting Aesthetic Preference in Deep Cross Networks for Cross-domain Recommendation Jian Liu (Soochow University), Pengpeng Zhao (Soochow University), Fuzhen Zhuang (Chinese Academy of Sciences), Yanchi Liu (Rutgers University), Victor S. Sheng (Texas Tech University), Jiajie Xu (soochow university), Xiaofang Zhou (The University of Queensland) and Hui Xiong (Rutgers University New Jersey).
AbstractVisual aesthetics of products plays an important role in the decision process when purchasing appearance-first products, e.g., clothes. Indeed, user's aesthetic preference, which serves as a personality trait and a basic requirement, is domain independent and could be used as a bridge between domains for knowledge transfer. However, existing work has rarely considered the aesthetic information in product images for cross-domain recommendation. To this end, in this paper, we propose the new deep Aesthetic Cross-Domain Networks (ACDN), in which parameters characterizing personal aesthetic preferences are shared across networks to transfer knowledge between domains. Specifically, we first leverage an aesthetic network to extract aesthetic features. Then, we integrate these features into a cross-domain network to transfer users' domain independent aesthetic preferences. Moreover, network cross-connections are introduced to enable dual knowledge transfer across domains. Finally, the experimental results on real-world datasets show that our proposed model ACDN outperforms benchmark methods in terms of recommendation accuracy. The results also show that users' aesthetic preferences are effective in alleviating the data sparsity issue on cross-domain recommendation. |
Society (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
11:00-11:30 |
Facebook Ads Monitor: An Independent Auditing System for Political Ads on Facebook Márcio Silva (Universidade Federal de Mato Grosso do Sul), Lucas Santos de Oliveira (Universidade Estadual do Sudoeste da Bahia), Athanasios Andreou (INSTITUTE EURECOM), Pedro Olmo Vaz de Melo (UFMG), Oana Goga (UPMC) and Fabricio Benevenuto (Federal University of Minas Gerais (UFMG)).
AbstractThe 2016 United States presidential election was marked by the abuse of targeted advertising on Facebook. Concerned with the risk of the same kind of abuse to happen in the 2018 Brazilian elections, we designed and deployed an independent auditing system to monitor political ads on Facebook in Brazil. To do that we first adapted a browser plugin to gather ads from the timeline of volunteers using Facebook. We managed to convince more than 2000 volunteers to help our project and install our tool. Then, we use a Convolution Neural Network (CNN) to detect political Facebook ads using word embeddings. To evaluate our approach, we manually label a data collection of 20,000 ads as political or non-political and then we provide an in-depth evaluation of proposed approach for identifying political ads by comparing it with classic supervised machine learning methods. Finally, we deployed a real system that shows the ads identified as related to politics. We then compare our detected politic ads with an archive of all political ads provided by Facebook. We noticed that not all political ads were tagged as such over the platform. Our results imply that the decision of what is a political ad and what is not should not be only made by one platform, and emphasized the need for independent auditing platforms. |
11:30-12:00 |
Facebook Ads as a Demographic Tool to Measure the Urban-Rural Divide Daniele Rama (ISI Foundation), Kyriaki Kalimeri (ISI Foundation), Yelena Mejova (ISI Foundation), Michele Tizzoni (ISI Foundation) and Ingmar Weber (Qatar Computing Research Institute).
AbstractIn the global move toward urbanization, making sure the people remaining in rural areas are not left behind in terms of development and policy considerations is a priority for governments worldwide. However, it is increasingly challenging to track important statistics concerning this sparse, geographically dispersed population, resulting in a lack of reliable, up-to-date data. In this study, we examine the usefulness of the Facebook Advertising platform, which offers a digital "census" of over two billions of its users, in measuring potential rural-urban inequalities. We focus on Italy, a country where about 30% of the population lives in rural areas. First, we show that the population statistics that Facebook produces suffer from instability across time and incomplete coverage of sparsely populated municipalities. To overcome such limitation, we propose an alternative methodology for estimating Facebook Ads audiences that nearly triples the coverage of the rural municipalities from 19% to 55% and makes feasible fine-grained sub-population analysis. Using official national census data, we evaluate our approach and confirm known significant urban-rural divides in terms of educational attainment and income. Extending the analysis to Facebook-specific user "interests" and behaviors, we provide further insights on the divide, for instance, finding that rural areas show a higher interest in gambling. Notably, we find that the most predictive features of income in rural areas differ from those for urban centres, suggesting researchers need to consider a broader range of attributes when examining rural wellbeing. The findings of this study illustrate the necessity of improving existing tools and methodologies to include under-represented populations in digital demographic studies -- the failure to do so could result in misleading observations, conclusions, and most importantly, policies. |
12:00-12:30 |
Social Interactions or Business Transactions? What customer reviews disclose about Airbnb marketplace Giovanni Quattrone (Middlesex University), Antonino Nocera (University of Pavia), Licia Capra (University College London) and Daniele Quercia (King's College London).
AbstractAirbnb is one of the most successful examples of sharing economy marketplaces. With rapid and global market penetration, understanding its attractiveness and evolving growth opportunities is key to plan business decision making. There is ongoing debate, for example, about whether Airbnb is an hospitality service that fosters social exchanges between hosts and guests, as the sharing economy manifesto originally stated, or whether it is (or is evolving into) a purely business transaction platform, the way hotels have traditionally operated. To answer these questions, a scalable market analysis approach is needed, affording platform owners the ability to easily examine their market over time and across different locations. In this paper, we propose to do so by means of a novel market analysis approach that exploits customers' reviews. Using a combination of thematic analysis and machine learning techniques, we first build a platform specific dictionary of themes and sub-themes discussed in guests' reviews. Using quantitative linguistic analysis based on this dictionary, we then illustrate how to answer a variety of market research questions, at fine levels of thematic, temporal and spatial granularity. |
12:30-12:45 |
Analyzing the Use of Audio Messages in WhatsApp Groups Alexandre Maros (UFMG), Jussara Almeida (UFMG), Fabrício Benevenuto (UFMG) and Marisa Vasconcelos (IBM).
AbstractWhatsApp is a free messaging app with more than one billion active monthly users which has become one of the main communication platforms in many countries, including Saudi Arabia, Germany, and Brazil. In addition to allowing the direct exchange of messages among pairs of users, the app also enables group conversations, where multiple people can interact with one another. A number of recent studies have shown that WhatsApp groups play an important role as an information dissemination platform, especially during important social mobilization events. In this paper, we build upon those prior efforts by taking a first look into the use of {\it audio} messages in WhatsApp groups, a type of content that is becoming increasingly important in the platform. We present a methodology to analyze audio messages shared in WhatsApp groups, characterizing content properties (e.g, topics and language characteristics), their propagation dynamics and the impact of different types of audios (e.g., speech versus music) on such dynamics. |
12:45-13:00 |
Using Facebook Data to Measure Cultural Distance between Countries: the Case of Brazilian Cuisine Carolina Vieira (UFMG), Filipe Ribeiro (UFOP), Pedro Olmo Vaz de Melo (UFMG), Fabricio Benevenuto (Federal University of Minas Gerais (UFMG)) and Emilio Zagheni (Max Planck Institute for Demographic Research).
AbstractMeasuring the affinity to a particular culture has been an active area of research. Many cultural aspects characterize regions in terms of cultural attributes, such as clothing, music, art, and food. As one of the central aspects, the cuisine of a country can effectively reflect one of the dominant aspects of its culture. As such, the number of people interested in a typical national dish can be used to estimate the prevalence of that culture inside the host region. In this study, we measure the global spread of Brazilian food culture across countries by exploring Facebook user's preferences for typical Brazilian dishes from Facebook Advertising Platform. But first, to decide which dish will be considered typical from Brazil, we made use of spatial analysis to understand the distribution of interests around the world and to quantify how typical is the dish in Brazil and between the Brazilian immigrants. This methodology can be generalized to other countries to infer cultural elements that immigrants take off to other countries during the migration process. The interest in Brazilian typical dishes can be used to characterize countries in terms of Brazilian cultural exposition. While evaluating the cultural distance between Brazil and the countries most preferred by Brazilian immigrants, we explore several measures of distance by comparing these in the context of affinity to Brazilian cuisine in different parts of the world. These measures of distance between countries evaluated in terms of cultural preferences can complement other metrics of distance applied to gravity-type models, for example, in order to explain flows of people between countries. |
Security (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
11:00-11:30 |
An Empirical Study of the Use of Integrity Verification Mechanisms for Web Subresources Bertil Chapuis (UNIL-HEC Lausanne), Olamide Omolola (TU Graz), Mauro Cherubini (UNIL-HEC Lausanne), Mathias Humbert (armasuisse S+T) and Kévin Huguenin (UNIL-HEC Lausanne).
AbstractWeb developers can (and do) include subresources such as scripts, stylesheets and images in their webpages. Such subresources might be stored on remote servers such as content delivery networks (CDNs). This practice creates security and privacy risks, should a subresource be corrupted, as was recently the case for the British Airways websites. The subresource integrity (SRI) recommendation, released in mid-2016 by the W3C, enables developers to include digests in their webpages in order for web browsers to verify the integrity of subresources before loading them. In this paper, we conduct the first large-scale longitudinal study of the use of SRI on the Web by analyzing massive crawls (~3B unique URLs) of the Web over the last 3.5 years. Our results show that the adoption of SRI is modest (~3.40%), but grows at an increasing rate and is highly influenced by the practices of popular library developers (e.g., Bootstrap) and CDN operators (e.g., jsDelivr). We complement our analysis about SRI with a survey of web developers (N =227): It shows that a substantial proportion of developers know SRI and understand its basic functioning, but most of them ignore important aspects of the specification, such as the case of malformed digests. The results of the survey also show that the integration of SRI by developers is mostly manual – hence not scalable and error prone. This calls for a better integration of SRI in build tools. |
11:30-12:00 |
Conquering Cross-source Failure for News Credibility: Learning Generalizable Representations beyond Content Embedding Yen-Hao Huang (National Tsing Hua University), Ting-Wei Liu (National Tsing Hua University), Ssu-Rui Lee (National Tsing Hua University), Fernando Henrique Calderon Alvarado (National Tsing Hua University) and Yi-Shin Chen (National Tsing Hua University).
AbstractFalse information on the Internet has caused severe damage to society. Researchers have proposed methods to determine the credibility of news and have obtained good results. As different media sources (publishers) have different content generators (writers) and may focus on different topics or aspects, the word/topic distribution for each media source is divergent from others. We discover a challenge in the generalizability for the existing content-based methods to perform consistently on the news from media sources which are not in the training set, namely the cross-source failure. A cross-source setting can cause a more than 15-19% decrease in accuracy for current methods; content-sensitive features are considered one of the major causes of cross-source failure for a content-based approach. To overcome this challenge, we propose a credibility pattern embedding neural network (CPENN), which focuses on function words and syntactic structure to learn generalizable representation for credibility analysis and further reinforce the cross-source robustness for different media. Experiments with cross-validation on 194 real-world media sources showed that the proposed method could learn the generalizable features and outperformed the state-of-the-art methods on unseen media sources. Extensive analysis on the embedding feature representation represents a strength of the proposed method compared to current content embedding feature approaches. We envision that the proposed method is more robust for real-life unreliable news detection with CPENN due to its good generalizability. |
12:00-12:30 |
Valve: Securing Function Workflows on Serverless Computing Platforms Pubali Datta (University of Illinois at Urbana-Champaign), Prabuddha Kumar (Stony Brook University), Tristan Morris (Silicon Valley Bank), Michael Grace (Samsung Electronics), Amir Rahmati (Stony Brook University) and Adam Bates (University of Illinois at Urbana-Champaign).
AbstractServerless Computing has quickly emerged as a dominant cloud computing paradigm, allowing developers to rapidly prototype event-driven applications using a composition of small functions that each perform a single logical task. However, many such application workflows are based in part on publicly-available functions written by third parties, creating the potential for functions to behave in unexpected, or even malicious, ways. At present, developers are not in total control of where and how their data is flowing, creating significant security and privacy risks in growth markets that have embraced serverless (e.g., IoT). As a practical means of addressing this problem, we present Valve, a serverless platform that enables developers to exert complete and fine-grained control of information flows in their applications. Valve enables workflow developers to reason about function behaviors, and specify restrictions, through auditing of network-layer information flows. By proxying network request and propagating taint labels across network flows, Valve is able to restrict function behavior without code modification. We demonstrate that Valve is able defend against known serverless attack behaviors including container reuse-based persistence and data exfiltration over cloud platform APIs with less than 10% runtime overhead, 4.7% deployment overhead and 8.28% teardown overhead. |
12:30-12:45 |
Detecting Undisclosed Paid Editing in Wikipedia Nikesh Joshi (Boise State University), Francesca Spezzano (Boise State University), Mayson Green (Boise State University) and Elijah Hill (Boise State University).
AbstractWikipedia, the free and open-collaboration based online encyclopedia, has millions of pages that are maintained by thousands of volunteer editors. As per Wikipedia's fundamental principles, pages on Wikipedia are written with a neutral point of view and maintained by volunteer editors for free with well-defined guidelines in order to avoid or disclose any conflict of interest. However, there have been several known incidents where editors intentionally violate such guidelines in order to get paid (or even extort money) for maintaining promotional spam articles without disclosing such.In this paper, we address for the first time the problem of identifying undisclosed paid articles in Wikipedia. We propose a machine learning-based framework using a set of features based on both the content of the articles as well as the patterns of edit history of users who create them. To test our approach, we collected and curated a new dataset from English Wikipedia with ground truth on undisclosed paid articles. Our experimental evaluation shows that we can identify undisclosed paid articles with an AUROC of 0.98 and an average precision of 0.91. Moreover, our approach outperforms ORES, a scoring system tool currently used by Wikipedia to automatically detect damaging content, in identifying undisclosed paid articles. Finally, we show that our user-based features can also detect undisclosed paid editors with an AUROC of 0.94 and an average precision of 0.92, outperforming existing approaches. |
Search (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
11:00-11:30 |
Context-Aware Document Term Weighting for Ad-Hoc Search Zhuyun Dai (Carnegie Mellon University) and Jamie Callan (Carnegie Mellon University).
AbstractBag-of-words document representations play a fundamental role in modern search engines, but their power is limited by the shallow frequency-based term weighting scheme. This paper proposes HDTerm, a hierarchical document term weighting framework for document indexing and retrieval. It first estimates the semantic importance of a term at the passage level. The deep and fine-grained term weights are then aggregated into a document-level bag-of-words representation, which can be stored into a standard inverted index for efficient retrieval. This paper also proposes two approaches that enable training HDTerm without relevance labels. Experiments show that an index using HDTerm weights significantly improved the retrieval accuracy over standard term-frequency based index and state-of-the-art embedding-based index. |
11:30-12:00 |
Efficient Implicit Unsupervised Text Hashing using Adversarial Autoencoder Khoa Doan (Virginia Tech) and Chandan K. Reddy (Virginia Tech).
AbstractSearching for documents with semantically similar content is a fundamental problem in the information retrieval domain with various challenges, primarily, in terms of efficiency and effectiveness. Despite the promise of modeling structured dependencies in documents, several existing text-hashing methods lack an efficient mechanism to incorporate such vital information. Additionally, the desired characteristics of an ideal hash function, such as robustness to noise, low quantization error and bit balance/uncorrelation, are not effectively learned in existing methods. This is because of the requirement to either tune additional hyper-parameters or optimize additional non-trivial cost functions. In this paper, we propose a Denoising Adversarial Binary Autoencoder (DABA) model which presents a novel representation learning framework that captures structured representation of text documents in the learned hash function. Also, adversarial training provides an alternative direction to implicitly learn a hash function that captures all the desired characteristics of an ideal hash function. Essentially, DABA adopts a novel single-optimization adversarial training procedure that minimizes the Wasserstein distance in its primal domain to regularize the encoder's output of either a recurrent neural network or a convolutional autoencoder. We empirically demonstrate the effectiveness of our proposed method in capturing the intrinsic semantic manifold of the related documents. The proposed method outperforms the current state-of-the-art shallow and deep unsupervised hashing methods for the document retrieval task on several prominent document collections. |
12:00-12:30 |
Adversarial Bandits Policy for Crawling Commercial Web Content Shuguang Han (Google), Michael Bendersky (Google), Przemek Gajda (Google), Sergey Novikov (Google), Marc Najork (Google), Bernhard Brodowsky (Google) and Alexandrin Popescul (Pinterest).
AbstractThe rapid growth of commercial web content has driven the development of shopping search services to facilitate users seeking for product information. Due to the dynamic nature of commercial content, an optimal recrawl policy is a key component in a shopping search service; it ensures that users have access to the most up-to-date product details. Prior studies did propose various strategies to maximize the content freshness; however, they often relied on simple heuristics, and overlooked the crawling resource budgets. To address this, Azar et al. [5] recently proposed a joint optimization strategy LambdaCrawl aiming to maximize content freshness within a given resource budget. In this paper, we demonstrate that the effectiveness of LambdaCrawl is governed in large part by how well future change rates can be estimated. Therefore, we adopt a state-of-the-art deep learning model for change rate prediction, which results in a substantial improvement of content freshness over the common LambdaCrawl implementation with change rate estimated from the past history. Moreover, we demonstrate that while LambdaCrawl is a significant advancement upon the existing recrawl strategies, it can be further improved upon by a unified multi-strategy recrawl policy. To this end, we employ a K-armed adversarial bandits algorithm that can provably optimize the overall content freshness by combining multiple strategies. Empirical results over a large-scale production dataset demonstrate that the proposed adversarial bandits approach outperforms LambdaCrawl by a large margin, especially under tight resource budgets. |
12:30-12:45 |
Ad Hoc Table Retrieval using Intrinsic and Extrinsic Similarities Roee Shraga (Technion - Israel Institute of Technology), Haggai Roitman (IBM), Guy Feigenblat (IBM) and Mustafa Canim (IBM).
AbstractGiven a keyword query, the ad hoc table retrieval task aims at retrieving a ranked list of the top-k most relevant tables in a given table corpus. Previous works have primarily focused on designing table-centric lexical and semantic features, which could be utilized for learning-to-rank (LTR) tables. In this work, we make a novel use of intrinsic (passage-based) and extrinsic (manifold-based) table similarities for enhanced retrieval. Using the WikiTables benchmark, we study the merits of utilizing such similarities for this task. To this end, we combine both similarity types via a simple, yet an effective, cascade re-ranking approach. Overall, our proposed approach results in a significantly better table retrieval quality, which even transcends that of strong semantically-rich baselines. |
12:45-13:00 |
Graph-Query Suggestions for Knowledge Graph Exploration Matteo Lissandrini (Aalborg University), Davide Mottin (Aarhus University), Themis Palpanas (Paris Descartes University) and Yannis Velegrakis (Utrecht University).
AbstractWe consider the task of exploratory search through graph queries on knowledge graphs. We propose to assist the user by expanding the query with intuitive suggestions to provide a more informative (full) query that can retrieve more detailed and relevant answers. To achieve this result, we propose a model that can bridge graph search paradigms with well established techniques for information-retrieval. Our approach does not require any additional knowledge from the user and builds on principled language modelling approaches. We empirically show the effectiveness and efficiency of our approach on a large knowledge graph and how our suggestions are able to help build more complete and informative queries. |
Mobile (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
11:00-11:30 |
FiDo: Ubiquitous Fine-Grained WiFi-based Localization for Unlabelled Users via Domain Adaptation Xi Chen (Samsung Electronics Canada), Hang Li (Samsung Electronics Canada), Chenyi Zhou (Samsung Electronics Canada), Xue Liu (Samsung Electronics Canada), Di Wu (Samsung Electronics Canada) and Gregory Dudek (Samsung Electronics Canada).
AbstractEmerging location-aware applications, such as cashier-less shopping, mobile ads targeting and geo-based Augmented Reality (AR),are changing people’s lives fundamentally. In order to fully sup-port these new applications, location information with meter-level resolution (or even higher) is required anytime and anywhere. Unfortunately, most of the current location sources (e.g., check-in data and GPS) either are unavailable indoor or provide only house-level resolutions. To fill the gap, this paper utilizes the ubiquitous WiFi signals to establish a meter-level localization system, which employs WiFi propagation characteristics as location fingerprints.However, an unsolved issue of these WiFi fingerprints is their inconsistency across different users. In other words, WiFi fingerprints collected for one user may not be used to localize another user.To address this issue, we propose a WiFi-based Domain-adaptive system FiDo, which is able to localize many different users with labelled data from only one or two example users. FiDo contains two modules: 1) a data augmenter that introduces data diversity using a Variational Autoencoder (VAE); and 2) a domain-adaptive classifier that adjusts itself to newly collected unlabelled data using a joint classification-reconstruction structure. Compared to the state ofthe art, FiDo increases average F1 score by 11.8% and improves the worst-case accuracy by 20.2%. |
11:30-12:00 |
Towards Fine-grained Flow Forecasting: A Graph Attention Approach for Bike Sharing Systems Suining He (University of Michigan--Ann Arbor & The University of Connecticut) and Kang G. Shin (University of Michigan--Ann Arbor).
AbstractAccurate bike-flow prediction at the individual station level is essential for bike sharing service. Due to the spatial and temporal complexities of traffic networks and the lack of data-driven design for bike stations, existing methods cannot predict the fine-grained bike flows to/from each station. To remedy this problem, we propose a novel data-driven spatiotemporal Graph attention convolutional neural network for Bike station-level flow prediction (GBikes). We develop data-driven and spatio-temporal designs, and model bike stations (nodes) and interstation bike rides (edges) as a graph. In particular, we design a novel graph attention convolutional neural network (GACNN) with attention mechanisms capturing and differentiating station-to-station correlations. Multi-level temporal closeness, spatial distances and other external factors (e.g., weather and points of interest) are jointly considered for comprehensive learning and accurate prediction of bike flows at each station. Extensive experiments upon a total of over 11 million trips collected from three large-scale bike-sharing systems in New York City, Chicago, and Los Angeles have corroborated GBikes’s significant improvement of accuracy, robustness and effectiveness over prior work. |
12:00-12:30 |
Dynamic Flow Distribution Prediction for Urban Dockless E-Scooter Sharing Reconfiguration Suining He (University of Michigan--Ann Arbor & University of Connecticut) and Kang G. Shin (University of Michigan--Ann Arbor).
AbstractThanks to recent progresses in mobile payment, IoT, electric motors, batteries and location-based services, Dockless E-scooter Sharing (DES) has become a popular means of last-mile commute for a growing number of (smart) cities. As e-scooters are getting deployed dynamically and flexibly across city regions that expand and/or shrink, with subsequent social, commercial and environmental evaluation, accurate prediction of the distribution of e-scooters given reconfigured regions becomes essential for the city planners and service providers.To meet this need, we propose GCScoot, a novel dynamic flow distribution prediction for reconfiguring urban DES systems. Based on the real-world datasets with reconfiguration, we analyze the mobility features of the e-scooter distribution and flow dynamics for the data-driven designs. To adapt to dynamic reconfiguration of DES deployment, we propose a novel spatio-temporal graph capsule neural network within GCScoot to predict the future dockless e-scooter flows given the reconfigured regions. GCScoot preprocesses the historical spatial e-scooter distributions into flow graph structures, where discretized city regions are considered as nodes and their mutual flows as edges. Given data-driven designs regarding distance, ride flows and region connectivity, the dynamic region-to-region correlations embedded within the temporal flow graphs are captured through the graph capsule neural network which accurately predicts the DES flows.We have conducted extensive empirical studies upon three different e-scooter datasets (>2.8 million rides in total) in populous US cities including Austin TX, Louisville KY and Minneapolis MN. The evaluation results have corroborated the accuracy and effectiveness of GCScoot in predicting dynamic distribution of dockless e-scooters’ mobility. |
12:30-13:00 |
Towards IP-based Geolocation via Fine-grained and Stable Webcam Landmarks Zhihao Wang (Institute of Information Engineering, Chinese Academy of Sciences), Qiang Li (School of Computer and Information Technology, Beijing Jiaotong University), Jinke Song (School of Computer and Information Technology, Beijing Jiaotong University), Haining Wang (Virginia Tech) and Limin Sun (Institute of Information Engineering, Chinese Academy of Sciences).
AbstractIP-based geolocation is essential for various location-aware Internet applications, such as online advertisement, content delivery, and online fraud prevention. Achieving accurate geolocation enormously relies on the number of high-quality (i.e., the fine-grained and stable over time) landmarks. However, the previous efforts of garnering landmarks have been impeded by the limited visible landmarks on the Internet and manual time cost. In this paper, we leverage the availability of numerous online webcams that are used to monitor physical surroundings as a rich source of promising high-quality landmarks for serving IP-based geolocation. In particular, we present a new framework called {\it GeoCAM}, which is designed to automatically generate qualified landmarks from online webcams, providing IP-based geolocation services with high accuracy and wide coverage. GeoCAM periodically monitors websites that are hosting live webcams and uses the natural language processing technique to extract the IP addresses and latitude/longitude of webcams for generating landmarks at large-scale. We develop a prototype of GeoCAM and conduct real-world experiments for validating its efficacy. Our results show that GeoCam can detect 282,902 live webcams hosted in webpages with 94.2\% precision and 90.4\% recall, and then generate 16,863 stable and fine-grained landmarks, which are two orders of magnitude more than the landmarks used in prior works. Thus, by correlating a large scale of landmarks, GeoCAM is able to provide a geolocation service with high accuracy and wide coverage. |
Web Mining-B (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
11:00-11:30 |
Mining Points-of-Interest for Explaining Urban Phenomena: A Scalable Variational Inference Approach Christof Naumzik (ETH Zurich), Patrick Zoechbauer (ETH Zurich) and Stefan Feuerriegel (ETH Zurich).
AbstractPoints-of-interest (POIs; i.e., restaurants, bars, landmarks, and other entities) are common in web-mined data: they greatly explain the spatial distributions of urban phenomena. The conventional modeling approach relies upon feature engineering, yet it ignores the spatial structure among POIs. In order to overcome this shortcoming, the present paper proposes a novel spatial model for explaining spatial distributions based on web-mined POIs. Our key contributions are: (1) We present a rigorous yet highly interpretable formalization in order to model the influence of POIs on a given outcome variable. Specifically, we accommodate for the spatial distributions of both the outcome and POIs. In our case, this is modeled by the sum of latent Gaussian processes. (2) In contrast to previous literature, our model infers the influence of POIs without feature engineering, instead we model the influence of POIs via distance-weighted kernel functions with fully learnable parameterizations. (3) We propose a scalable learning algorithm based on sparse variational approximation. For this purpose, we derive a tailored evidence lower bound (ELBO) and, for appropriate likelihoods, we even show that an analytical expression can be obtained. This allows fast and accurate computation of the ELBO. Finally, the value of our approach for web mining is demonstrated in two real-world case studies. Our findings provide substantial improvements over state-of-the-art baselines with regard to both predictive and, in particular, explanatory performance. Altogether, this yields a novel spatial model for leveraging web-mined POIs. Within the context of location-based social networks, it promises an extensive range of new insights and use cases. |
11:30-12:00 |
Snippext: Semi-supervised Opinion Mining with Augmented Data Zhengjie Miao (Duke University), Yuliang Li (Megagon Labs), Xiaolan Wang (Megagon Labs) and Wang-Chiew Tan (Megagon Labs).
AbstractOnline services are interested in solutions to opinion mining, which is the problem of extracting aspects, opinions, and sentiments from text. One method to mine opinions is to leverage the recent success of pre-trained language models which can be fine-tuned to obtain high-quality extractions from reviews. However, fine-tuning language models still requires a non-trivial amount of training data.In this paper, we study the problem of how to significantly reduce the amount of labeled training data required in fine-tuning language models for opinion mining. We describe Snippext, an opinion mining system developed over a language model that is fine-tuned through semi-supervised learning with augmented data. A novelty of Snippext is its clever use of a two-prong approach to achieve state-of-the-art (SOTA) performance with little labeled training data through: (1) data augmentation to automatically generate more labeled training data from existing ones, and (2) a semi-supervised learning technique to leverage the massive amount of unlabeled data in addition to the (limited amount of) labeled data. We show with extensive experiments that Snippext performs comparably and can even exceed previous SOTA results on several opinion mining tasks with only half the training data required. Furthermore, it achieves new SOTA results when all training data are leveraged. By comparison to a baseline pipeline, we found that Snippext extracts significantly more fine-grained opinions which enable new opportunities of downstream applications. |
12:00-12:30 |
Experimental Evidence Extraction in Data Science with Hybrid Table Features and Ensemble Learning Wenhao Yu (University of Notre Dame), Wei Peng (Zhejiang University), Yu Shu (Sichuan University), Qingkai Zeng (University of Notre Dame) and Meng Jiang (University of Notre Dame).
AbstractData Science has been one of the most popular fields in higher education and research activities. It takes tons of time to read the experimental section of thousands of papers and figure out the performance of the data science techniques. In this work, we build an experimental evidence extraction system to automate the integration of tables (in the paper PDFs) into a database of experimental results. First, it crops the tables and recognizes the templates. Second, it classifies the column names and row names into “method”, “dataset”, or “evaluation metric”, and then unified all the table cells into (method, dataset, metric, score)-quadruples. We propose hybrid features including structural and semantic table features as well as an ensemble learning approach for column/row name classification and table unification. SQL statements can be used to answer questions such as whether a method is the state-of-the-art or whether the reported numbers are conflicting. |
12:30-12:45 |
A Cue Adaptive Decoder for Controllable Neural Response Generation Weichao Wang (Northeastern University), Shi Feng (Northeastern University, China), Wei Gao (Victoria University of Wellington), Daling Wang (Northeastern University, China) and Yifei Zhang (Northeastern University, China).
AbstractIn open-domain dialogue systems, dialogue cues such as emotion, persona, and emoji can be incorporated into conversation models for strengthening the semantic relevance of generated responses. Existing neural response generation models either incorporate dialogue cue into decoder's initial state or embed the cue indiscriminately into the state of every generated word, which may cause the gradients of the embedded cue to vanish or disturb the semantic relevance of generated words during back propagation. In this paper, we propose a Cue Adaptive Decoder (CueAD) that aims to dynamically determine the involvement of a cue at each generation step in the decoding. For this purpose, we extend the Gated Recurrent Unit (GRU) network with an adaptive cue representation for facilitating cue incorporation, in which an adaptive gating unit is utilized to decide when to incorporate cue information so that the cue can provide useful clues for enhancing the semantic relevance of generated words. Experimental results show that CueAD outperforms state-of-the-art baselines with large margins. |
Semantics (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
11:00-11:30 |
Relation Adversarial Network for Low Resource Knowledge Graph Completion Ningyu Zhang (Alibaba Group & AZFT Joint Lab for Knowledge Engine), Shumin Deng (Zhejiang University & AZFT Joint Lab for Knowledge Engine), Zhanlin Sun (Carnegie Mellon University), Jiaoyan Chen (University of Oxford), Wei Zhang (Alibaba Group & AZFT Joint Lab for Knowledge Engine) and Huajun Chen (Zhejiang University & AZFT Joint Lab for Knowledge Engine).
AbstractKnowledge Graph Completion (KGC) has been proposed to improve Knowledge Graphs by filling in missing connections via link prediction or relation extraction. One of the main difficulties for KGC is the low resource problem. Previous approaches assume sufficient training triples to learn versatile vectors for entities and relations, or a satisfactory number of labeled sentences to train a competent relation extraction model. However, low resource relations are very common in KGs, and those newly added relations often do not have many known samples for training. In this work, we aim at predicting new facts under a challenging setting where only limited training instances are available. We propose a general framework called Weighted Relation Adversarial Network, which utilizes an adversarial procedure to help adapt knowledge/features learned from high resource relations to different but related low resource relations. Specifically, the framework takes advantage of a relation discriminator to distinguish between samples from different relations, and help learn relation-invariant features more transferable from source relations to target relations. Experimental results show that the proposed approach outperforms previous methods regarding low resource settings for both link prediction and relation extraction. |
11:30-11:45 |
Fast Computation of Explanations for Inconsistency in Large-Scale Knowledge Graphs Mohamed H Gad-Elrab (Max Planck Institute for Informatics), Evgeny Kharlamov (Bosch Center for Artificial Intelligence), Daria Stepanova (Bosch Center for Artificial Intelligence), Jannik Stroetgen (Bosch Center for Artificial Intelligence) and Trung-Kien Tran (Bosch Center for Artificial Intelligence).
AbstractKnowledge graphs (KGs) are essential resources for many applications including Web search and Question Answering. As KGs are often automatically constructed (e.g., from the web) and enriched (e.g., using embedding-based completion), they may contain incorrect facts. Detecting them is a crucial, yet extremely expensive task. Prominent solutions detect and explain inconsistencies in KGs with respect to accompanying ontologies that describe the KG domain of interest. Compared to machine learning methods they are more reliable and human-interpretable but scale poorly on large KGs. In this paper, we present a novel approach to dramatically speed up the process of detecting and explaining inconsistencies in large KGs by exploiting KG abstractions that capture prominent data patterns. Though much smaller in size, KG abstractions preserve inconsistency and their explanations. Our experiments with large-scale KGs (e.g., DBpedia and Yago) demonstrate the feasibility of our approach and show that it significantly outperforms the popular baseline. The discovered inconsistency explanations in these large-scale KGs further help in making the results interpretable. |
11:45-12:15 |
What is Normal, What is Strange, and What is Missing in a Knowledge Graph: Unified Characterization via Inductive Summarization Caleb Belth (University of Michigan), Xinyi Zheng (University of Michigan), Jilles Vreeken (Helmholtz Center for Information Security (CISPA)) and Danai Koutra (University of Michigan).
AbstractKnowledge graphs (KGs) store highly heterogeneous information about the world in the structure of a graph, and are useful for tasks such as question answering and reasoning. However, they often contain errors and are missing information. Vibrant research in KG refinement has worked to resolve these issues, tailoring techniques to either detect specific types of errors or complete a KG.In this work, we introduce a unified solution to KG characterization by formulating the problem as unsupervised KG summarization with a set of inductive, soft rules, which describe what is normal in a KG, and thus can be used to identify what is abnormal, whether it be strange or missing. Unlike first-order logic rules, our rules are labeled, rooted graphs, i.e., patterns that describe the expected neighborhood around a (seen or unseen) node based on its type and information in the KG. Stepping away from the traditional support/confidence-based rule mining techniques, we propose KGIST, Knowledge Graph Inductive SummarizaTion, which learns a summary of inductive rules that best compress the KG according to the Minimum Description Length principle—a formulation that we are the first to use in the context of KG rule mining. We apply our rules to three large KGs (NELL, DBpedia, and Yago), and tasks such as compression, various types of error detection, and identification of incomplete information. We show that KGIST outperforms task-specific, supervised and unsupervised baselines in error detection and incompleteness identification (identifying up to 92.88% of missing entities—at least 10% more than baselines), while also being efficient for large knowledge graphs. |
12:15-12:30 |
Searching for Embeddings in a Haystack: Link Prediction on Knowledge Graphs with Subgraph Pruning Unmesh Joshi (Vrije University) and Jacopo Urbani (Vrije University).
AbstractEmbedding-based models of Knowledge Graphs (KGs) can be used to predict the existence of missing links in the KG by ranking entities according to their likelihood scores computed using the embeddings. An exhaustive computation of all likelihood scores is very expensive if the KG is large. To counter this problem, we propose a technique to reduce the search space by identifying smaller subsets of promising entities. Our technique first creates embeddings of subgraphs using the embeddings from the model. Then, it ranks the subgraphs, based on the metrics and considers only the entities in the top k subgraphs. Our empirical evaluation shows that our technique is able to reduce the search space significantly while maintaining a good recall. |
12:30-13:00 |
Collective Multi-type Entity Alignment Between Knowledge Graphs Qi Zhu (University of Illinois Urbana-Champaign), Hao Wei (Amazon Inc.), Bunyamin Sisman (Amazon Inc.), Da Zheng (Amazon Inc.), Christos Faloutsos (Carnegie Mellon University), Xin Luna Dong (Amazon Inc.) and Jiawei Han (University of Illinois Urbana-Champaign).
AbstractKnowledge graph (e.g. Freebase, YAGO) is a multi-relational graph representing rich factual information among entities of various types. Entity alignment is the key step towards knowledge graph integration from multiple sources. It aims to identify entities across different knowledge graphs that refer to the same real-world entity. However, current entity alignment systems overlook the sparsity of different knowledge graphs and can not align multi-type entities by one single model. In this paper, we present a Collective Graph neural network for Multi-type entity Alignment, called CG-MuAlign. Different from previous work, CG-MuAlign jointly aligns multiple types of entities, collectively leverages the neighborhood information and generalizes to unlabeled entity types. Specifically, we propose novel collective aggregation function tailored for this task, that (1) relieves the incompleteness of knowledge graphs via both cross-graph and self attentions, (2) scales up efficiently with mini-batch training paradigm and effective neighborhood sampling strategy. We conduct experiments on real world knowledge graphs with millions of entities and observe the superior performance beyond existing methods. In addition, the running time of our approach is much less than the current state-of-the-art deep learning methods. |
Research Tracks (2)
Web Mining-A (2)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Identifying Referential Intention with Heterogeneous Contexts Wenhao Yu (University of Notre Dame), Mengxia Yu (Peking Univerisity), Tong Zhao (University of Notre Dame) and Meng Jiang (University of Notre Dame).
AbstractCiting, quoting, and forwarding & commenting behaviors are widely seen in academia, news media, and social media. Existing behavior modeling approaches focused on mining content and describing preferences of authors, speakers, and users. However, behavioral intention plays an important role in generating content on the platforms. In this work, we propose to identify the referential intention which motivates the action of using the referred (e.g., cited, quoted, and retweeted) source and content to support their claims. We adopt a theory in sociology to develop a schema of four types of intentions. The challenge lies in the heterogeneity of observed contextual information surrounding the referential behavior, such as referred content (e.g., a cited paper), local context (e.g., the sentence citing the paper), neighboring context (e.g., the former and latter sentences), and network context (e.g., the academic network of authors, affiliations, and keywords). We propose a new neural framework with Interactive Hierarchical Attention (IHA) to identify the intention of referential behavior by properly aggregating the heterogeneous contexts. Experiments demonstrate that the proposed method can effectively identify the type of intention of citing behaviors (on academic data) and retweeting behaviors (on Twitter). And learning the heterogeneous contexts collectively can improve the performance. This work opens a door for understanding content generation from a fundamental perspective of behavior sciences. |
14:00-14:30 |
In Opinion Holders’ Shoes: Modeling Cumulative Influence for View Change in Online Argumentation Zhen Guo (North Carolina State University), Zhe Zhang (IBM) and Munindar Singh (North Carolina State University).
AbstractUnderstanding how people change their views during argumentative discussions is important in applications that involve human communication, e.g., in social media and education. Existing research focuses on lexical features of individual comments, dynamics of discussions, or the personalities of participants but deemphasizes a challenging factor: cumulative influence of the discussion on a participant's mindset that is exerted by the interplay of comments by different participants during the discussion.We make the following contributions. (1) We demonstrate the necessity of considering an individual's perception of comments from other participants for predicting persuasiveness through a human study. (2) We tackle the challenging task of predicting the points where a user's view changes considering the whole discussion, which includes massive noise and plausible alternatives. (3) We present a sequential model for cumulative influence that captures the interplay between comments as both local and nonlocal dependencies, and demonstrate its capability of selecting the most effective information for changing views. (4) We identify contextual and interactive features and propose corresponding sequence structures to incorporate these features. Our empirical evaluation using a Reddit Change My View dataset shows that contextual and interactive features are valuable in predicting view changes, and a sequential model notably outperforms the nonsequential baseline models. |
14:30-15:00 |
Few-Sample and Adversarial Representation Learning for Continual Stream Mining Zhuoyi Wang (The University of Texas at Dallas), Yigong Wang (The University of Texas at Dallas), Yu Lin (The University of Texas at Dallas), Evan Delord (The University of Texas at Dallas) and Khan Latifur (The University of Texas at Dallas).
AbstractDeep Neural Network (DNN) has been largely demonstrated to be effective for closed-world classification problems where the number of categories is fixed. However, DNNs notoriously fail when it meets the label prediction over the non-stationary data stream scenario, which has the continuous emergence of the unknown or novel class (categories not in the training set). To solve this challenge, the DNN should not only be able to detect novel class, but also incrementally learn new concepts from a few of data over time. Limited literature simultaneously address both problems, in this paper, we focus on improving not only the ability of DNNs on the generalization of the novel class, but also the effectiveness of continuously learning novel categories from only a few instances from data stream. Different with existing approaches that heavily relies on abundant labeled instances to train/update the model, our proposed Few Sample and Adversarial Representation Learning (FSAR) framework, it first trains a joint learning model to achieve an intra-class compacted and inter-class separated representation and recognize the novel class; next, through active annotation request, we collect a few samples belong to such new categories and utilize episode-training to exploit the intrinsic feature for few-shot learning. Specifically, we implement an adversarial confusion term based metric learning approach for the first step, which encourage the robustness and generalization ability by reducing over-confidence on the seen classes. Once trained, } is able to extract discriminative features for novel categories and incorporated with joint representation model to facilitate the few-sample learning in the stream. We evaluated \sysname{ on completely different datasets ( CUB-200, EMNIST, FASHION-MNIST and CIFAR-10), extensive experimental results on various benchmarks simulated stream show that \sysname{} effectively outperforms current state-of-the-art approaches. |
15:00-15:15 |
Extracting Knowledge from Web Text with Monte Carlo Tree Search Guiliang Liu (Baidu), Xu Li (Baidu), Jiakang Wang (Baidu), Mingming Sun (Baidu) and Ping Li (Baidu).
AbstractTo extract knowledge from general web text, it requires to build a domain-independent extractor that scales to the entire web corpus. This task is known as Open Information Extraction (OIE). This paper proposes to apply Monte-Carlo Tree Search (MCTS) to accomplish OIE. To achieve this goal, we define a Markov Decision Process for OIE and build a simulator to learn the reward signals, which provides a complete Reinforcement Learning framework for MCTS. Using this framework, MCTS explores candidate words (and symbols) under the guidance of a pre-trained Sequence-to-Sequence (Seq2Seq) predictor and generates abundant exploration samples during training. We apply the exploration samples to update the reward simulator and the predictor, based on which we implement another MCTS to search the optimal predictions during inference. Empirical evaluation demonstrates that the MCTS inference substantially improves the accuracy of prediction (more than 10%) and achieves a leading performance over other state-of-the-art comparison models. |
15:15-15:30 |
Natural Key Discovery in Wikipedia Tables Leon Bornemann (Hasso Plattner Institute), Tobias Bleifuß (Hasso Plattner Institute), Dmitri V. Kalashnikov (AT&T Labs - Research), Felix Naumann (Hasso Plattner Institute) and Divesh Srivastava (AT&T Labs-Research).
AbstractWikipedia is the largest encyclopedia to date. Scattered among its articles, there is an enormous number of tables that contain structured, relational information. In contrast to database tables, these webtables lack metadata, making it difficult to automatically interpret the knowledge they harbor. The natural key is a particularly important piece of metadata, which acts as a primary key and consists of attributes inherent to an entity. Determining natural keys is crucial for many tasks, such as information integration, table augmentation, or tracking changes to entities over time. To address this challenge, we formally define the notion of natural keys and propose a supervised learning approach to automatically detect natural keys in Wikipedia tables using carefully engineered features. Our solution includes novel features that extract information from time (a table’s version history) and space (other similar tables). On a curated dataset of 1,000 Wikipedia table histories, our model achieves 80% F-measure, which is at least 20% more than all related approaches. We use our model to discover natural keys in the entire corpus of Wikipedia tables and provide the dataset to the community to facilitate future research. |
Social Network-A (2)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
A Generic Edge-Empowered Graph Convolutional Network via Node-Edge Mutual Enhancement Pengyang Wang (University of Central Florida), Jiaping Gui (NEC Laboratories America, Inc.), Zhengzhang Chen (NEC Laboratories America, Inc.), Junghwan Rhee (NEC Laboratories America, Inc.), Haifeng Chen (NEC Laboratories America, Inc.) and Yanjie Fu (University of Central Florida).
AbstractGraph Convolutional Networks (GCNs) have shown to be a powerful tool for analyzing graph-structured data. Most of previous GCN methods focus on learning a good node representation by aggregating the representations of neighboring nodes, whereas largely ignoring the edge information. Although few recent methods have been proposed to integrate edge attributes into GCNs to initialize edge embeddings, these methods do not work when edge attributes are (partially) unavailable. Can we develop a generic edge-empowered framework to exploit node-edge enhancement, regardless of the availability of edge attributes? %There lacks a generic edge-empowered framework to exploit edge-node enhancement. To address this, In this paper, we propose a novel framework EE-GCN that achieves node-edge enhancement. In particular, the framework EE-GCN includes three key components: (i) Initialization: this step is to initialize the embeddings of both nodes and edges. Unlike node embedding initialization, we propose a line graph-based method to initialize the embedding of edges regardless of edge attributes. (ii) Feature space alignment: we propose a translation-based mapping method to align edge embedding with node embedding space, and the objective function is penalized by a translation loss when both spaces are not aligned. (iii) Node-edge mutually enhanced updating: node embedding is updated by aggregating embedding of neighboring nodes and associated edges, while edge embedding is updated by the embedding of associated nodes and itself. Through the above improvements, our framework provides a generic strategy for all of the spatial-based GCNs to allow edges to participate in embedding computation and exploit node-edge mutual enhancement. Finally, we present extensive experimental results to validate the improved performances of our method in terms of node classification, link prediction, and graph classification. |
14:00-14:30 |
Unsupervised Domain Adaptive Graph Convolutional Networks Man Wu (Florida Atlantic University), Shirui Pan (Monash University), Chuan Zhou (Chinese Academy of Sciences), Xiaojun Chang (Monash University) and Xingquan Zhu (Florida Atlantic University).
AbstractGraph convolutional networks (GCNs) have achieved impressive success in many graph related analytics tasks. However, most GCNs only work in a single domain (graph) incapable of transferring knowledge from/to other domains (graphs), due to the challenges in both graph representation learning and domain adaptation over graph structures. In this paper, we present a novel approach, unsupervised domain adaptive graph convolutional networks (UDA-GCN), for domain adaptation learning for graphs. To enable effective graph representation learning, we first develop a dual graph convolutional network component, which jointly exploits local and global consistency for feature aggregation. An attention mechanism is further used to produce a unified representation for each node in different graphs. To facilitate knowledge transfer between graphs, we propose a domain adaptive learning module to optimize three different loss functions, namely source classifier loss, domain classifier loss, and target classifier loss as a whole, thus our model can differentiate class labels in the source domain, samples from different domains, the class labels from the target domain, respectively. Experimental results on real-world datasets in the node classification task validate the performance of our method, compared to state-of-the-art graph neural network algorithms. |
14:30-15:00 |
MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding Xinyu Fu (The Chinese University of Hong Kong), Jiani Zhang (The Chinese University of Hong Kong), Ziqiao Meng (The Chinese University of Hong Kong) and Irwin King (The Chinese University of Hong Kong).
AbstractA large number of real-world graphs or networks are inherently heterogeneous, involving a diversity of node types and relationships between nodes. Heterogeneous graph embedding is to embed rich structural and semantic information of a heterogeneous graph into low-dimensional node representations. Existing models usually define multiple metapaths in a heterogeneous graph to capture the composite relations and guide neighbor selection. However, these models either omit node content features, discard intermediate nodes along the metapath, or only consider one metapath. To address these three limitations, we propose a new model named Metapath Aggregated Graph Neural Network (MAGNN) to boost the final performance. Specifically, MAGNN employs three major components, i,e, the node-type-specific transformation part to encapsulate input node content, the node-level metapath instance aggregation part to incorporate semantic intermediate nodes, and the metapath-level embedding fusion part to combine messages from multiple paths. Extensive experiments on three real-world heterogeneous graph datasets for node classification, node clustering, and link prediction show that MAGNN achieves more accurate prediction results than state-of-the-art baselines. |
15:00-15:15 |
Continuous-Time Link Prediction via Temporal Dependent Graph Neural Network Liang Qu (Shenzhen Key Laboratory of Computational Intelligence, Southern University of Science and Technology), Huaisheng Zhu (Shenzhen Key Laboratory of Computational Intelligence, Southern University of Science and Technology), Qiqi Duan (Shenzhen Key Laboratory of Computational Intelligence, Southern University of Science and Technology) and Yuhui Shi (Shenzhen Key Laboratory of Computational Intelligence, Southern University of Science and Technology).
AbstractRecently, graph neural networks (GNNs) have been shown to be an effective tool for learning the node representations of the networks and have achieved good performance on the semi-supervised node classification task. However, most existing GNNs methods fail to take networks' temporal information into account, therefore, cannot be well applied to dynamic network applications such as the continuous-time link prediction task. To address this problem, we propose a Temporal Dependent Graph Neural Network (TDGNN), a simple yet effective dynamic network representation learning framework which incorporates the network temporal information into GNNs. TDGNN introduces a novel Temporal Aggregator (TDAgg) to aggregate the neighbor nodes' features and edges' temporal information to obtain the target node representations. Specifically, it assigns the neighbor nodes aggregation weights using an exponential distribution to bias different edges' temporal information. The performance of the proposed method has been validated on six real-world dynamic network datasets for the continuous-time link prediction task. The experimental results show that the proposed method outperforms several state-of-the-art baselines. |
15:15-15:30 |
Asymptotic Behavior of Sequence Models Flavio Chierichetti (Sapienza University of Rome), Ravi Kumar (Google) and Andrew Tomkins (Google).
AbstractIn this paper we study the limiting dynamics of a sequential process that generalizes \polya's urn. This process has been studied also in the context of language generation, discrete choice, repeat consumption, and models for the web graph. The process we study generates future items by copying from past items. It is parameterized by a sequence of weights describing how much to prefer copying from recent versus more distant locations. We show that, if the weight sequence follows a power law with exponent $\alpha \in [0,1)$, then the sequences generated by the model tend toward a limiting behavior in which the eventual frequency of each token in the alphabet attains a limit. Moreover, in the case $\alpha > 2$, we show that the sequence converges to a token being chosen infinitely often, and each other token being chosen only constantly many times. |
User Modeling-A (2)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Dual Learning for Explainable Recommendation: Towards Unifying User Preference Prediction and Review Generation Peijie Sun (Hefei University of Technology), Le Wu (Hefei University of Technology), Kun Zhang (University of Science and Technology of China), Yanjie Fu (University of Central Florida), Richang Hong (Hefei University of Technology) and Meng Wang (Hefei University of Technology).
AbstractIn many recommender systems, users express item opinions through two kinds of behaviors: giving rating preferences and writing detailed reviews. As both kinds of behaviors reflect users' assessment of items, review enhanced recommender systems leverage these two kinds of user behaviors to boost recommendation performance. On the one hand, researchers proposed to better model the user and item embeddings with additional review information for enhancing preference prediction accuracy. On the other hand, some recent works focused on automatically generating item reviews for recommendation explanations with related user and item embeddings. We argue that, while the task of preference prediction with the accuracy goal is well recognized in the community, the task of generating reviews for explainable recommendation is also important to gain user trust and increase conversion rate. Some preliminary attempts have considered jointly modeling these two tasks, with the user and item embeddings are shared. These studies empirically showed that these two tasks are correlated, and jointly modeling them would benefit the performance of both tasks.In this paper, we make a further study of unifying these two tasks for explainable recommendation. Instead of simply correlating these two tasks with shared user and item embeddings, we argue that these two tasks are presented in dual forms. In other words, the input of the primal preference prediction task $p(R|C)$ is exactly the output of the dual review generation task $p(C|R)$, with $R$ and $C$ denote the preference value space and review space. Therefore, we could explicitly model the probabilistic correlation between these two dual tasks with $p(R,C)=p(R|C)p(C)=p(C|R)p(R)$. We design a unified dual framework of how to inject the probabilistic duality of the two tasks in the training stage. Furthermore, as the detailed rating and review information is not available for each user-item pair in the test stage, we propose a transfer learning based model for preference prediction and review generation. Finally, extensive experimental results on two real-world datasets clearly show the effectiveness of our proposed model for both user preference prediction and review generation. |
14:00-14:30 |
Déjà vu: A Contextualized Temporal Attention Mechanism for Sequential Recommendation Jibang Wu (University of Virginia), Renqin Cai (University of Virginia) and Hongning Wang (University of Virginia).
AbstractPredicting users' preferences based on their sequential behaviors in history is challenging and crucial for modern recommender systems. Most existing sequential recommendation algorithms focus on transitional structure among the sequential actions, but largely ignore the temporal and context information, when modeling the influence of a historical event to current prediction.In this paper, we argue that the influence from the past events on a user's current action should also vary over the course of time and under different context. Thus, we propose a Contextualized Temporal Attention Mechanism that learns to weigh historical actions' influence on not only what action it is, but also when and how the action took place. More specifically, to dynamically calibrate the relative input dependence from the self-attention mechanism, we deploy multiple parameterized kernel functions to learn various temporal dynamics, and then use the context information to determine which of these reweighing kernel to follow for each input. In empirical evaluations on two large public recommendation datasets, our model consistently outperforms an extensive set of state-of-the-art sequential recommendation methods. |
14:30-15:00 |
Hierarchical Adaptive Contextual Bandits for Resource Constraint based Recommendation Mengyue Yang (University of Chinese Academy of Sciences), Qingyang Li (Didi Research America), Zhiwei Qin (Didi Research America) and Jieping Ye (Didi Chuxing).
AbstractContextual multi-armed bandit (MAB) achieves cutting-edge performance on a variety of problems. When it comes to real-world scenarios such as recommendation system and online advertising, however, it is essential to take the resource consumption of exploration into consideration when maximizing the reward of bandit algorithms. In practice, there is typically non-zero cost associated with executing a recommendation (arm) in the environment, and hence, the policy should be learned with a fixed exploration cost constraint. It is challenging to learn a global optimal policy directly, since it is a NP-hard problem and significantly complicates the exploration and exploitation trade-off of bandit algorithms. Existing approaches focus on solving the problems by adopting the greedy policy which estimates the expected rewards and costs and uses a greedy selection based on each arm's expected reward/cost ratio using historical observation until the exploration resource is exhausted. However, existing methods are hard to extend to infinite time horizon, since the learning process will be terminated when there is no more resource. In this paper, we propose a hierarchical adaptive contextual bandit method (HATCH) to conduct the policy learning of contextual bandits with a budget constraint. HATCH adopts an adaptive method to allocate the exploration resource based on the remaining resource/time and the estimation of reward distribution among different user contexts. In addition, we utilize full of contextual feature information to find the best personalized recommendation. Finally, in order to prove the theoretical guarantee of the proposed method, we present a regret bound analysis and prove that HATCH achieves a regret bound as low as $O(T)$. The experimental results demonstrate the effectiveness and efficiency of the proposed method on both the synthetic data set and the real-world applications. |
15:00-15:15 |
Hierarchical Visual-aware Minimax Ranking Based on Co-purchase Data for Personalized Recommendation Xiaoya Chong (City University of Hong Kong), Qing Li (The Hong Kong Polytechnic University), Howard Leung (City University of Hong Kong), Qianhui Men (City University of Hong Kong) and Xianjin Chao (City University of Hong Kong).
AbstractPersonalized recommendation aims at ranking a set of items according to the learnt preference of the user. Existing method that directly optimized for ranking samples a negative item that the user has not bought yet and assumes that the user prefers the positive item that he has bought to the negative item. The strategy is to exclude irrelevant items from the dataset to narrow down the set of potential positive items to improve ranking accuracy. However, it conflicts with the goal of recommendation from the seller's point of view, which aims to enlarge that set for each user. In this paper, we diminish this limitation by proposing a novel learning method called Hierarchical Visual-aware Minimax Ranking (H-VMMR), in which a new concept of predictive sampling is proposed to sample items in a close relationship with the positive items (e.g. substitutes, compliments). We set up the problem by maximizing the preference discrepancy between positive and negative items, as well as minimizing the gap between positive and predictive items based on visual features. We also build a hierarchical learning model based on co-purchase data to solve the data sparsity problem. Our method can enlarge the set of potential positive items as well as true negative items during ranking. The experimental results show that our H-VMMR can outperform the state-of-the-art learning methods. |
15:15-15:30 |
Addressing the Target Customer Distortion Problem in Recommender Systems Xing Zhao (Texas A&M University), Ziwei Zhu (Texas A&M University), Majid Alfifi (Texas A&M University) and James Caverlee (Texas A&M University).
AbstractPredicting the potential target customers for a product is essential. However, traditional recommender systems typically aim to optimize an engagement metric without considering the overall distribution of target customers, thereby leading to serious distortion problems. In this paper, we conduct a data-driven study to reveal several distortions that arise from conventional recommenders. Toward overcoming these issues, we propose a target customer re-ranking algorithm to adjust the population distribution and composition in the Top-k target customers of an item while maintaining recommendation quality. By applying this proposed algorithm onto a real-world dataset, we find the proposed method can effectively make the class distribution of items' target customers close to the desired distribution, thereby mitigating distortion. |
Society (2)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-13:45 |
Dolphin: A Spoken Language Proficiency Assessment System for Elementary Education Zitao Liu (TAL AI Lab), Guowei Xu (TAL AI Lab), Tianqiao Liu (TAL AI Lab), Weiping Fu (TAL AI Lab), Yubi Qi (TAL AI Lab), Wenbiao Ding (TAL AI Lab), Yujia Song (TAL AI Lab), Chaoyou Guo (TAL AI Lab), Cong Kong (TAL AI Lab), Songfan Yang (TAL AI Lab) and Gale Yan Huang (TAL AI Lab).
AbstractVerbal fluency is critically important for children growth and personal development cohen1999verbal,berninger1992gender. Due to the limited and imbalanced educational resource in China, elementary students barely have chances to improve their oral language skills in classes. Verbal fluency tasks (VFTs) were invented to let the students practice their oral language skills after school. VFTs are simple but concrete math related questions that ask students to not only report answers but speak out the entire thinking process. In spite of the great success of VFTs, they bring a heavy grading burden to elementary teachers. To alleviate this problem, we develop Dolphin, a verbal fluency evaluation system for Chinese elementary education. Dolphin is able to automatically evaluate both phonological fluency and semantic relevance of students' answers of their VFT assignments. We conduct a wide range of offline and online experiments to demonstrate the effectiveness of Dolphin. In our offline experiments, we show that Dolphin improves both phonological fluency and semantic relevance evaluation performance when compared to state-of-the-art baselines on real-world educational data sets. In our online A/B experiments, we test Dolphin with 183 teachers from 2 major cities (Hangzhou and Xi'an) in China for 10 weeks and the results show that VFT assignments grading coverage is improved by 22\%. To encourage the reproducible results, we make our code public on an anonymous git repo: this https URL. |
13:45-14:15 |
Factoring Fact-Checks: Structured Information Extraction from Fact-Checking Articles Shan Jiang (Northeastern University), Simon Baumgartner (Google), Abe Ittycheriah (Google) and Cong Yu (Google).
AbstractFact-checking, which investigates claims made in public to arrive at a verdict supported by evidence and logical reasoning, has long been a significant form of journalism to combat misinformation in the news ecosystem. Most of the fact-checks share common structured information (called factors) such as claim, claimant, and verdict. In recent years, the emergence of ClaimReview as the standard schema for annotating those factors within fact-checking articles has led to wide adoption of fact-checking features by online platforms (e.g., Google, Bing). However, annotating fact-checks is a tedious process for fact-checkers and distracts them from their core job of investigating claims. As a result, less than half of the fact-checkers worldwide have adopted ClaimReview as of mid-2019. In this paper, we propose the task of factoring fact-checks for automatically extracting structured information from fact-checking articles. Exploring a public dataset of fact-checks, we empirically show that factoring fact-checks is a challenging task, especially for fact-checkers that are under-represented in the dataset. We then formulate the task as a sequence tagging problem and fine-tune the pre-trained BERT models with a modification made from our observations to approach the problem. Through extensive experiments, we demonstrate the performance of our models for well-known fact-checkers and promising initial results for under-represented fact-checkers. |
14:15-14:30 |
Reducing Disparate Exposure in Ranking: A Learning To Rank Approach Meike Zehlike (MPI Software Systems) and Carlos Castillo (Universitat Pompeu Fabra).
AbstractRanked search results have become the main mechanism by which we find content, products, places, and people online. Therefore their ordering contributes not only to the satisfaction of the searcher but also to career and business opportunities, educational placement, and even social success of those searched. Over the past decade, data mining researchers have become increasingly concerned with systematic biases in data-driven ranking models and various methods have been proposed to mitigate discrimination and inequality of opportunity. Most of those post-process a ranking and reorder its items subject to predefined fairness constraints. This procedure however has the disadvantage that it still allows an unfair ranking model to be trained and later deployed. In this paper we explore a new in-processing approach: DELTR, a learning-to-rank framework that addresses potential issues of discrimination and unequal opportunity in rankings at training time. We measure these problems in terms of discrepancies in the average group exposure and design a ranker that optimizes search results in terms of relevance and in terms of reducing such discrepancies. We perform an extensive experimental study showing that being “colorblind” i.e., ignoring protected attributes such as race or gender, can be among the best or the worst choices from the perspective of relevance and exposure, depending on how much and which kind of bias is present in the training set. We show that our in-processing method performs better in terms of relevance and equality of exposure than a pre-processing and a post-processing method across all tested scenarios. |
14:30-15:00 |
Examining Protest as An Intervention to Reduce Online Prejudice: A Case Study of Prejudice Against Immigrants Kai Wei (Amazon.com), Yu-Ru Lin (University of Pittsburgh) and Muheng Yan (University of Pittsburgh).
AbstractThere has been growing concern about online users using social media as a tool to spread hate and racist speech. While previous studies have extensively studied online hate speech, how to effectively reduce online prejudice still remains a challenge. Over the past several decades, protests have been a frequently used intervention for countering prejudice. However, research to date has not specifically examined the effects of social protest in online prejudice. In this work, we examine the relationship between protest and online prejudice. Using panel data collected from Twitter, we focus on the changes in users' prejudice against immigrants following recent immigrant protests. The findings of this work have shown that protest is related to the decrease of online users' prejudice, suggesting the possibility of using protests as mitigation of online prejudice. |
15:00-15:30 |
Understanding Electricity-Theft Behavior via Multi-Source Data Wenjie Hu (Zhejiang University), Yang Yang (Zhejiang University), Jianbo Wang (State Grid Taizhou Power Supply Co. Ltd.), Xuanwen Huang (Zhejiang University) and Ziqiang Cheng (Zhejiang University).
AbstractElectricity theft, the behavior that involves users conducting illegal operations on electrical meters to avoid individual electricity bills, is a common phenomenon in the developing countries. Considering its harmfulness to both power grids and the public, several mechanized methods have been developed to automatically recognize electricity-theft behaviors. However, these methods, which mainly assess users' electricity usage records can be insufficient due to the diversity of theft tactics and the irregularity of user behaviors. Moreover, one cannot fully understand the user behaviors that lurk in the massive volume of data using such mechanized methods.To address the abovementioned concerns, in this paper, we propose to recognize electricity-theft behavior via multi-source data. In addition to users' electricity usage records, we analyze user behaviors by means of regional factors (non-technical loss) and climatic factors (temperature) in the corresponding transformer area. By conducting analytical experiments, we unearth several interesting patterns and thereby derive insights into how these different types of information influence users' electricity usage. For instance, electricity thieves are likely to consume much more electrical power than normal users, especially under extremely high or low temperatures. Motivated by these empirical observations, we further design a novel hierarchical framework for identifying electricity thieves. Intuitively, it uniformly leverages multi-source information to extract hierarchical correlations between this information and electricity-theft behavior. Experimental results based on a real-world dataset demonstrate that our proposed model can achieve the best performance in electricity-theft detection (e.g., at least +3.0% in terms of F0.5) compared with several baselines. Last but not least, our work has been applied by the State Grid of China and used to successfully catch electricity thieves in Hangzhou with a precision of 15% (an improvement from 0% attained by several other models the company employed and kept online testing continuously for years) during monthly on-site investigation. |
Security (2)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-13:45 |
Practical Data Poisoning Attack against Next-Item Recommendation Hengtong Zhang (SUNY at Buffalo), Yaliang Li (Alibaba Group), Bolin Ding (Alibaba Group) and Jing Gao (University at Buffalo).
AbstractOnline recommendation systems make use of a variety of information sources to provide users the items that users are potentially interested in. However, due to the openness of the online platform, recommendation systems are vulnerable to data poisoning attacks, where malicious data samples are injected into the training set of the recommendation system by controlled users to promote or demote specific items. Existing attack approaches are either based on simple heuristic rules or designed against specific recommendations approaches.The former often suffers unsatisfactory performance, while the latter requires strong knowledge of the target system. In this paper, we focus on a general next-item recommendation setting and propose a practical poisoning attack approach named LOKI against blackbox recommendation systems. The proposed LOKI utilizes the reinforcement learning algorithm to train the attack agent, which can be used to generate user behavior samples for data poisoning. In real-world recommendation systems, the cost of retraining recommendation models is high, and the interaction frequency between users and a recommendation system is restricted. Given these real-world restrictions, we propose to let the agent interact with a recommender simulator instead of the target recommendation system and leverage the transferability of the generated adversarial samples to poison the target system. We also propose to use the influence function to efficiently estimate the influence of injected samples on the recommendation results, without re-training the models within the simulator. Extensive experiments on two datasets against four representative recommendation models show that the proposed LOKI achieves better attacking performance than existing methods and is effective even when the recommendation system is equipped with an anomaly detector. |
13:45-14:15 |
Adversarial Attacks on Graph Neural Networks via Node Injections: A Hierarchical Reinforcement Learning Approach Yiwei Sun (The Pennsylvania State University), Suhang Wang (The Pennsylvania State University), Xianfeng Tang (The Pennsylvania State University), Tsung-Yu Hsieh (The Pennsylvania State University) and Vasant Honavar (The Pennsylvania State University).
AbstractIn recent years, Graph Neural Networks have achieved immense success for node classification with its power to explore the topological structure in graph data. They are widely adopted in various domains including social media, E-commerce, and FinTech applications. However, recent studies show that GNNs are vulnerable to attacks which aim at adversely impacting the node classification accuracy. Previous studies of graph adversarial attacks mainly focus on manipulating existing graph structures, which usually requires more budgets to modify the existing connections in most real-world applications. In contrast, it is more practical to inject adversarial nodes into existing graphs, which can also potentially reduce the performance of the GNNs on existing nodes. Taking social network as an example, injecting fake profiles with forged links to mislead the predicted labels on existing accounts is much easier than directly modifying the existing graph. Motivated by such observations, in this paper, we study a novel problem of node injection poisoning on graph data. Since establishing links between the injected adversarial nodes and existing node could naturally be formulated as a Markov Decision Process, we propose a reinforcement learning method, namely NIPA, to sequentially modify the labels and adjacent edges of those injected nodes, without changing the link structure between existing nodes. Specifically, we introduce a hierarchical Q-learning network to manipulate the labels of the adversarial nodes and their links with other nodes in the graph, and design steering reward function to guide the RL agent so as to reduce GNNs accuracy. NIPA consistently out-performs state-of-the-art methods on three benchmark datasets, demonstrating its efficacy of poisoning graph data via node injection. |
14:15-14:45 |
The Chameleon Attack: Manipulating Content Display in Online Social Media Aviad Elyashar (Ben-Gurion University of the Negev), Abigail Paradise (Ben-Gurion University of the Negev), Sagi Uziel (Ben-Gurion University of the Negev) and Rami Puzis (Ben-Gurion University of the Negev).
AbstractOnline social networks (OSNs) are ubiquitous attracting millions of users all over the world. Being a popular communication media OSNs are exploited in a variety of cyberattacks. In this article, we discuss the Chameleon attack technique, a new type of OSN-based trickery where malicious posts and profiles change the way they are displayed to OSN users to conceal themselves before the attack or avoid detection. Using this technique, adversaries can, for example, avoid censorship by concealing true content when it is about to be inspected; acquire social capital to promote new content while piggybacking a trending one; cause embarrassment and serious reputation damage by tricking a victim to like, retweet, or comment a message that he wouldn’t normally do without any indication for the trickery within the OSN. An experiment performed with closed Facebook groups of sports fans shows that (1) Chameleon pages can pass by the moderation filters by changing the way their posts are displayed and (2) moderators do not distinguish between regular and Chameleon pages. We list the OSN weaknesses that facilitate the Chameleon attack and propose a set of mitigation guidelines. |
14:45-15:15 |
A Generic Solver Combining Unsupervised Learning and Representation Learning for Breaking Text-Based Captchas Sheng Tian (Ant Financial Services Group; Electronic Information School, Wuhan University) and Tao Xiong (Ant Financial Services Group).
AbstractAlthough there are many alternative captcha schemes available, text-based captchas are still one of the most popular security mechanism to maintain Internet security and prevent malicious attacks, due to the user preferences and ease of design. Over the past decade, different methods of breaking captchas have been proposed, which helps captcha keep evolving and become more robust. However, these previous works generally require heavy expert involvement and gradually become ineffective with the introduction of new security features. This paper proposes a generic solver combining unsupervised learning and representation learning to automatically remove the noisy background of captchas and solve text-based captchas. We introduce a new training scheme for constructing mini-batches, which contain a large number of unlabeled hard examples, to improve the efficiency of representation learning. Unlike existing deep learning algorithms, our method requires significantly fewer labeled samples and surpasses the recognition performance of a fully-supervised model with the same network architecture. Moreover, extensive experiments show that the proposed method outperforms state-of-the-art by delivering a higher accuracy on various captcha schemes. We provide further discussions of potential applications of the proposed unified framework. We hope that our work can inspire the community to enhance the security of text-based captchas. |
Health (1)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Text-to-SQL Generation for Question Answering on Electronic Medical Records Ping Wang (Virginia Tech), Tian Shi (Virginia Tech) and Chandan K. Reddy (Virginia Tech).
AbstractElectronic health records (EHR) data contains comprehensive patient information and is typically stored in a relational database with multiple tables. Effective and efficient patient information retrieval from EHR data is a challenging task for medical experts. Question-to-SQL generation methods tackle this problem by first predicting the SQL query for a given question about a database, and then, executing the query against the database. However, most of the existing approaches have not been adapted to the healthcare domain due to a lack of healthcare Question-to-SQL datasets for model parameter inferences. Moreover, wide use of the abbreviation of terminologies and possible typos in questions introduce additional challenges for accurately generating the corresponding SQL queries. In this paper, we tackle these challenges by developing a deep learning based TRanslate-Edit Model for Question-to-SQL (TREQS) generation, which adapts the widely used sequence-to-sequence model to directly generate the SQL query for a given question, and further performs the required edits using an attentive-copying mechanism and task-specific look-up tables. Based on the widely used publicly available electronic medical database, we create a new large-scale Question-SQL pair dataset, named MIMICSQL, in order to perform the Question-to-SQL generation task in the healthcare domain. Extensive experimental results are conducted to evaluate the performance of our proposed model on MIMICSQL. Both quantitative and qualitative experimental results indicate the flexibility and efficiency of our proposed method in predicting condition values and its robustness to random questions with abbreviations and typos. |
14:00-14:30 |
Automatic Boolean Query Formulation for Systematic Review Literature Search Harrisen Scells (The University of Queensland), Guido Zuccon (The University of Queensland), Bevan Koopman (CSIRO) and Justin Clark (Institute for Evidence-Based Healthcare, Bond University).
AbstractFormulating Boolean queries for systematic review literature search is a challenging task. Commonly, queries are formulated by information specialists using the protocol specified in the review and interactions with the research team. Information specialists have in-depth experience on how to formulate queries in this domain, but may not have in-depth knowledge about the reviews' topics. Query formulation requires a significant amount of time and effort, and is performed interactively; specialists repeatedly formulate queries, attempt to validate their results, and reformulate specific Boolean clauses. In this paper, we investigate the possibility of automatically formulating a Boolean query from the systematic review protocol. We propose a novel five-step approach to automatic query formulation, specific to Boolean queries in this domain, which approximates the process by which information specialists formulate queries. In this process, we use syntax parsing to derive the logical structure of high-level concepts in a query, automatically extract and map concepts to entities in order to perform entity expansion, and finally apply post-processin operations (such as stemming and search filters).Automatic query formulation for systematic review literature search has several benefits: (i) it can provide reviewers with an indication of the types of studies that will be retrieved, without the involvement of an information specialist, (ii) it can provide information specialists with an initial query to begin the formulation process, (iii) it can provide researchers that perform rapid reviews with a method to quickly perform searches. |
14:30-15:00 |
Learning Contextualized Document Representations for Healthcare Answer Retrieval Sebastian Arnold (Beuth University of Applied Sciences Berlin), Betty van Aken (Beuth University of Applied Sciences Berlin), Paul Grundmann (Beuth University of Applied Sciences Berlin), Felix A. Gers (Beuth University of Applied Sciences Berlin) and Alexander Löser (Beuth University of Applied Sciences Berlin).
AbstractWe present Contextual Discourse Vectors (CDV), a distributed document representation for efficient answer retrieval from long healthcare documents. Our approach is based on structured query tuples of entities and aspects from free text and medical taxonomies. Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse. We use our continuous representations to resolve queries with short latency using approximate nearest neighbor search on sentence level. We apply the CDV model for retrieving coherent answer passages from ten English public health resources from the Web, addressing both patients and medical professionals. Because there is no end-to-end training data available for all application scenarios, we train our model with self-supervised data from Wikipedia. We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking and is able to adapt to heterogeneous domains without additional fine-tuning. |
15:00-15:15 |
Sampling Query Variations for Learning to Rank to Improve Automatic Boolean Query Generation in Systematic Reviews Harrisen Scells (The University of Queensland), Guido Zuccon (The University of Queensland), Mohamed Sharaf (The University of Queensland) and Bevan Koopman (CSIRO).
AbstractSearching medical literature for synthesis in a systematic review is a complex and labour intensive task. In this context, expert searchers construct lengthy Boolean queries. The universe of possible query variations can be massive: a single query can be composed of hundreds of field-restricted search terms/phrases or ontological concepts, each grouped by a logical operator nested to depths of sometimes five or more levels deep. With the many choices about how to construct a query, it is difficult to both formulate and recognise effective queries. To address this challenge, automatic methods have recently been explored for generating and selecting effective Boolean query variations for systematic reviews. The limiting factor of these methods is that it is computationally infeasible to process all query variations. To overcome this, we propose novel query variation sampling methods for training Learning to Rank models to rank queries. Our results show that query sampling methods do directly impact the ability of a Learning to Rank model to effectively identify good query variations. Thus, selecting good query sampling methods is a key problem for the automatic reformulation of effective Boolean queries for systematic review literature search. We find that the best sampling strategies are those which balance the diversity of queries with the quantity of queries. |
Mobile (2)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Nowhere to Hide: Cross-modal Identity Leakage between Biometrics and Devices Chris Xiaoxuan Lu (University of Liverpool), Yang Li (New York University), Yuanbo Xiangli (The Chinese University of Hong Kong) and Zhengxiong Li (University at Buffalo, SUNY).
AbstractAlong with the benefits of Internet of Things (IoT) come potential privacy risks, since billions of the connected devices are granted to sense information about their users, communicating to other parties over the Internet. Of particular interest to the adversary is the user identity, which, once obtained, can be used for many vicious attacks subsequently. While the exposure of a particular type of physical biometrics or device IDs is extensively studied, the compound leakage interwoven by both sides remains unknown to users in IoT-rich environments. In this work, we explore the feasibility of the compound identity leakage across cyber-physical spaces and unveil that co-located smart device IDs (e.g., smartphone MAC addresses) and physical biometrics (e.g., facial/vocal samples) are side channels to each other. Based on the side channels in combination, our presented approach enables an attacker to automatically compromise users' biometrics and device IDs in tandem. We show that our method is robust to cross-modal mismatch and various observation noise in the wild, comprehensively profiling victims with nearly zero analysis effort from the attacker. Two real-world experiments on different biometrics and WiFi MAC addresses validate the new type of privacy leakage. We show that in extreme cases, the presented approach can compromise more than 70% device IDs and harvests multiple biometric clusters of ~94% purity at the same time. |
14:00-14:30 |
Deconstructing Google’s Web Light Service Ammar Tahir (Lahore University of Management Sciences (LUMS)), Muhammad Tahir Munir (Lahore University of Management Sciences (LUMS)), Shaiq Munir Malik (Lahore University of Management Sciences (LUMS)), Zafar Ayyub Qazi (Lahore University of Management Sciences (LUMS)) and Ihsan Ayyub Qazi (Lahore University of Management Sciences (LUMS)).
AbstractWeb Light is a transcoding service introduced by Google to show lighter and faster webpages to users searching on slow mobile clients. The service detects slow clients (e.g., users on 2G) and converts webpages on the fly into a version optimized for these clients. The service promises improved mobile web browsing experience, in particular, for users from developing countries where slow networks can be common. However, there are several concerns around this service, including, its effectiveness in preserving relevant content on a page, improving user performance, showing third-party advertisements as well as privacy concerns.In this paper, we perform the first independent, empirical analysis of Google's Web Light service to shed light on these concerns. Through extensive experiments over thousands of real Web Light pages as well as controlled experiments with synthetic Web Light pages, we (i) deconstruct how Web Light modifies webpages, (ii) investigate how ads are shown on Web Light and which ad networks are supported, (iii) measure and compare Web Light's page load performance, (iv) discuss privacy concerns for users and publishers and (v) investigate the potential use of Web Light as a censorship circumvention tool. |
14:30-15:00 |
Read Between the Lines: An Empirical Measurement of Sensitive Applications of Voice Personal Assistant Systems Faysal Hossain Shezan (University of Virginia), Hang Hu (Virginia Tech), Jiamin Wang (Virginia Tech), Gang Wang (University of Illinois at Urbana-Champaign) and Yuan Tian (University of Virginia).
AbstractVoice Personal Assistant (VPA) systems such as Amazon Alexa and Google Home have been used by tens of millions of households. Recent work demonstrated proof-of-concept attacks against their voice interface to invoke unintended applications or operations. However, there is still a lack of empirical understanding of what type of third-party applications that VPA systems support, and what consequences these attacks may cause. In this paper, we perform an empirical analysis of the third-party applications of Amazon Alexa and Google Home to systematically assess the attack surfaces. A key methodology is to characterize a given application by classifying the sensitive voice commands it accepts. We develop a natural language processing tool that classifies a given voice command from two dimensions: (1) whether the voice command is designed to insert action or retrieve information; (2) whether the command is sensitive or nonsensitive. The tool combines a deep neural network and a keyword-based model, and uses Active Learning to reduce the manual labeling effort. The sensitivity classification is based on a user study (N=404) where we measure the perceived sensitivity of voice commands. A ground-truth evaluation shows that our tool achieves over 95\% of accuracy for both types of classifications. We apply this tool to analyze 77,957 Amazon Alexa applications and 4,813 Google Home applications (198,199 voice commands from Amazon Alexa, 13,644 voice commands from Google Home) over two years (2018-2019). In total, we identify 19,263 sensitive ``action injection'' commands and 5,352 sensitive ``information retrieval'' commands. These commands are from 4,596 applications (5.55\% out of all applications), most of which belong to the ``smart home'' category. While the percentage of sensitive applications is small, we show the percentage is increasing over time from 2018 to 2019. |
15:00-15:30 |
An Intent-Based Automation Framework for Securing Dynamic Consumer IoT Infrastructures Vasudevan Nagendra (Stony Brook University), Arani Bhattacharya (Stony Brook University), Vinod Yegneswaran (SRI International), Amir Rahmati (Stony Brook University) and Samir Das (Stony Brook University).
AbstractConsumer IoT is characterized by heterogeneous devices with diverse functionality and programming interfaces. This lack of homogeneity makes the integration and security management of IoT infrastructures a daunting task for users and administrators. In this paper, we introduce VISCR, a Vendor-Independent policy Specification and Conflict Resolution engine that enables conflict-free policy specification and enforcement in IoT environments. VISCR converts the topology of the IoT infrastructure into a tree-based abstraction and translates existing policies from heterogeneous vendor-specific programming languages such as Groovy-based SmartThings, OpenHAB, IFTTT-based templates, and MUD-based profiles into a vendor-independent graph-based specification. Using the two, VISCR can automatically detect rouge policies, conflicts, and bugs for coherent automation. Upon detection, VISCR infers new policies and proposes them to users as alternatives to existing policies for fine-tuning and conflict-free enforcement. We evaluated VISCR using a dataset of 907 IoT apps, programmed using heterogeneous automation specifications in a simulated smart-building IoT infrastructure. In our experiments, among 907 IoT apps, VISCR exposed 342 of IoT apps as exhibiting one or more violations. VISCR detected 100% of violations reported by existing state-of-the-art tool, while detecting new types of violations in an additional 266 apps. In terms of performance, VISCR can generate 400 abstraction trees (used in specifying policies) with 100K leaf nodes in <1.2sec. In our experiments, VISCR took 80.7 seconds to analyze our infrastructure of 907 apps; a 14.2× reduction compared to the state-of-the-art. After the initial analysis, VISCR is capable of adopting new policies in sub-second latency to handle changes. |
Web Mining-B (2)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing David Zeber (Mozilla), Sarah Bird (Mozilla), Camila Oliveira (Mozilla), Walter Rudametkin (INRIA), Ilana Segall (Mozilla), Fredrik Wollsen (Mozilla) and Martin Lopatka (Mozilla).
AbstractLarge-scale web crawls have emerged as the state of the art for studying characteristics of the Web, such as the prevalence of online tracking and browser fingerprinting. Web crawling is an attractive approach to data collection, as crawls can be run at relatively low infrastructure cost and don't require handling sensitive user data such as browsing histories. However, the validity of using crawls as a proxy for human browsing data has not been well studied. Crawls may fail to capture the diversity of user environments, including operating systems, geolocation, cookies, as well as content in authenticated sessions, advertisement campaigns, and other dynamic content. Moreover, the snapshot view of the Web presented by one-time crawls does not reflect its constantly evolving nature, which hinders reproducibility of crawl-based studies. In this paper, we quantify the repeatability and representativeness of Web crawls in terms of common tracking and fingerprinting metrics, considering both variation across crawls and divergence from human browser usage. We observe noticeable variation between simultaneous crawls, run from different operating systems on both residential personal computers and cloud services, relative to the baseline variation measured across simultaneous crawls run from a single common environment. Additionally, we note substantial variation across a collection of crawls run sequentially over time, with the specific scripts loaded, fingerprinting resources encountered, and third party resources all becoming increasingly diverse over time. We also assess the agreement between crawls visiting a standard list of high-traffic websites and actual browsing behaviour measured from an opt-in sample of over 50,000 users of the Firefox Web browser. Our analysis reveals clear differences between the treatment of stateless crawling infrastructure and generally stateful human browsing, showing, for example, that crawlers tend to experience higher rates of third-party activity than human browser users on loading pages from the same domains. |
14:00-14:30 |
Apophanies or Epiphanies? How Crawlers Impact Our Understanding of the Web Syed Suleman Ahmad (University of Wisconsin-Madison), Muhammad Daniyal Dar (University of Iowa), Rishab Nithyanand (University of Iowa), Narseo Vallina-Rodriguez (IMDEA Networks/ICSI) and Muhammad Fareed Zaffar (LUMS).
AbstractData generated by web crawlers has formed the basis for much of our current understanding of the Internet. However, not all crawlers are created equal and crawlers generally find themselves trading off between computational overhead, developer effort, data accuracy, and completeness. Therefore, the choice of crawler has a critical impact on the data generated and knowledge inferred from it. In this paper, we conduct a systematic study of the trade-offs presented by different crawlers and the impact that these can have on different types of measurement studies. We make the following contributions: First, we conduct a survey of all research published since 2015 in the premier security and Internet measurement venues to identify and verify the reproducibility of crawling methodologies deployed for different problem domains and publication venues. Next, we conduct a qualitative evaluation of a subset of all crawling tools identified in our survey. This evaluation allows us to draw conclusions about the suitability of each tool for specific types of data gathering. Finally, we present a methodology and a measurement framework to empirically highlight the differences between different crawlers. We use this framework to show how the choice of crawler can impact our understanding of the web. |
14:30-15:00 |
Power-Law Graphs Have Minimal Scaling of Kemeny Constant for Random Walks Wanyue Xu (Fudan University), Yibin Sheng (Fudan University), Zuobai Zhang (Fudan University), Haibin Kan (Fudan University) and Zhongzhi Zhang (Fudan University).
AbstractThe mean hitting time from a node $i$ to a node $j$ selected randomly according to the stationary distribution of random walks is called the Kemeny constant, which has found various applications. It was proved that over all graphs with $N$ vertices, complete graphs have the exact minimum Kemeny constant, growing linearly with $N$. Here we study numerically or analytically the Kemeny constant on many sparse real-world and model networks with scale-free small-world topology, and show that their Kemeny constant also behaves linearly with $N$. Thus, sparse networks with scale-free and small-world topology are favorable architectures with optimal scaling of Kemeny constant. We then present a theoretically guaranteed estimation algorithm, which approximates the Kemeny constant for a graph in nearly linear time with respect to the number of edges. Extensive numerical experiments on model and real networks show that our approximation algorithm is both efficient and accurate. |
15:00-15:15 |
Evolution of a Web-Scale Near Duplicate Image Detection System Andrey Gusev (Pinterest) and Jiajing Xu (Pinterest).
AbstractDetecting near duplicate images is fundamental to the content ecosystem of photo sharing web applications. However, such task is challenging when involving web-scale image corpus containing billions of images. In this paper, we present an efficient system for detecting near duplicate images over 7 billion images. Our system consists of three stages: candidate generation, candidate selection, and clustering. We also demonstrate that this system can be used to greatly improve the accuracy of recommendations and search results across a number of real-world applications.In addition, we include the evolution of the system over the course of six years, bringing out experiences and lessons on how new systems are designed to accommodate organic content growth as well as the latest technology. Finally, we are releasing a human-labeled dataset of \textasciitilde 53,000 pairs of images introduced in this paper. |
Research Tracks (3)
Web Mining-A (3)
(UTC/GMT +8) 16:00-18:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
What Changed Your Mind: The Roles of Dynamic Topics and Discourse in Argumentation Process Jichuan Zeng (The Chinese University of Hong Kong), Jing Li (The Hong Kong Polytechnic University), Yulan He (The University of Warwick), Cuiyun Gao (The Chinese University of Hong Kong), Michael Lyu (The Chinese University of Hong Kong) and Irwin King (The Chinese University of Hong Kong).
AbstractIn our world with full of uncertainty, debates and argumentation contribute to the progress of science and society. Despite of the increasing attention to characterize human arguments, most progress made so far focus on the debate outcome, largely ignoring the dynamic patterns in argumentation processes. This paper presents a study that automatically analyzes the key factors in argument persuasiveness, beyond simply predicting who will persuade whom. Specifically, we propose a novel neural model that is able to dynamically track the changes of latent topics and discourse in argumentative conversations, allowing the investigation of their roles in influencing the outcomes of persuasion. Extensive experiments have been conducted on argumentative conversations on both social media and supreme court. The results show that our model outperforms state-of-the-art models in identifying persuasive arguments via explicitly exploring dynamic factors of topic and discourse. We further analyze the effects of topics and discourse on persuasiveness, and find that they are both useful — topics provide concrete evidence while superior discourse styles may bias participants, especially in social media arguments. In addition, we draw some findings from our empirical results, which will help people better engage in future persuasive conversations. |
16:30-17:00 |
Fast Generating A Large Number of Gumbel-Max Variables Yiyan Qi (Xi'an Jiaotong University), Pinghui Wang (Xi'an Jiaotong University), Yuanming Zhang (Xi'an Jiaotong University), Junzhou Zhao (Xi'an Jiaotong University), Guangjian Tian (Huawei Noah's Ark Lab) and Xiaohong Guan (Xi'an Jiaotong University).
AbstractThe well-known Gumbel-Max Trick for sampling from a categorical distribution (or more generally a nonnegative vector) and its variants have been widely used in areas such as machine learning and information retrieval. To sample a random element i (or a Gumbel-Max variable i) in proportion to its positive weight v_i, the Gumbel-Max Trick first computes a Gumbel random variable g_i for each positive-weight element i, and then samples the element i with the largest value of g_i+ ln v_i. Recently, applications including similarity estimation and graph embedding require to generate k independent Gumbel-Max variables from the elements of high dimensional vectors. However, it is computationally expensive for a large k (e.g., hundreds or even thousands) when using the traditional Gumbel-Max Trick. To solve this problem, we propose a novel algorithm, FastGM, that reduces the time complexity from O(kn^+) to O(k ln k + n^+), where n^+ is the number of positive elements in the vector of interest. Instead of computing k independent Gumbel random variables directly, we find that there exists a technique to generate these variables in descending order. Using this technique, our method FastGM computes variables g_i+ ln v_i for all positive elements i in descending order. As a result, FastGM significantly reduces the computation time because we can early stop the procedure of Gumbel random variables computing for many elements especially for those with small weights. Experiments on a variety of real-world datasets show that FastGM is orders of magnitude faster than state-of-the-art methods without sacrificing accuracy and incurring additional expenses. |
17:00-17:30 |
Modeling Heterogeneous Statistical Patterns in High-dimensional Data by Adversarial Distributions: An Unsupervised Generative Framework Han Zhang (Tsinghua University), Wenhao Zheng (Alibaba Youku), Charley Chen (Tsinghua University), Kevin Gao (Tsinghua University), Yao Hu (Alibaba Youku Cognitive and Intelligent Lab), Ling Huang (AHI Fintech) and Wei Xu (Tsinghua University).
AbstractSince the label collecting is prohibitive and time-consuming, unsupervised methods are preferred in applications such as fraud detection. Meanwhile, such applications usually require modeling the intrinsic clusters in high-dimensional data, which usually displays heterogeneous statistical patterns as the patterns of different clusters may appear in different dimensions. Existing methods propose to model the data clusters on selected dimensions, yet omitting any dimension globally may damage the pattern of certain clusters. In order to address the above issues, we propose a novel unsupervised generative framework called FIRD, which utilizes adversarial distributions to fit and disentangle the heterogeneous statistical patterns. When applying to discrete spaces, FIRD effectively distinguishes the synchronized fraudsters from the normal users. In addition, FIRD also provides superior performance on anomaly detection datasets compared with SOTA anomaly detection methods (over 5\% average AUC improvement). The significant experiment results on various datasets verify that the proposed method can better model the heterogeneous statistical patterns in high-dimensional data and benefit downstream applications. |
17:30-17:45 |
Solving Billion-Scale Knapsack Problems Xingwen Zhang (Ant Financial Services Group), Feng Qi (Ant Financial Services Group), Zhigang Hua (Ant Financial Services Group) and Shuang Yang (Ant Financial Services Group).
AbstractReal-world resource allocation tasks are often approached by solving knapsack problems (KPs), which are NP-hard and have been tractable only at a relatively small scale. This paper examines KPs in a slightly generalized form, and shows large-scale KPs can be solved nearly optimally in a scalable distributed paradigm via synchronous coordinate descent(SCD). The proposed algorithm can be implemented with off-the-shelf distributed computing frame-works (e.g. MPI, Hadoop, Spark) fairly easily. As an example, our implementation leads to one of the most efficient KP solvers known to date, and it is capable to solve resource allocation problems at an unprecedented scale (e.g., KPs with 1 billion decision variables and 1 billion constraints can be solved within 1 hour). Both synthetic tests and live A/B experiments were conducted to analyze the performance of our approach. The system has been deployed to production and called on a daily basis, yielding significant business impacts. |
17:45-18:00 |
Efficient Online Multi-Task Learning via Adaptive Kernel Selection Peng Yang (Baidu US) and Ping Li (Baidu).
AbstractConventional multi-task model restricts the task structure to be linearly related, which may not be suitable when data is linearly nonseparable. To remedy this issue, we propose a kernel algorithm for online multi-task classification, as the large approximation space provided by reproducing kernel Hilbert spaces often contains an accurate function. Specifically, it maintains a local-global Gaussian distribution over each task model that guides the direction and scale of parameter updates. Nonetheless, optimizing over this space is computationally expensive. Most multi-task learning methods require accessing to the entire data for the learning algorithm, which is luxury unavailable in large-scale streaming datasets. To address this issue, we propose a random sampling technique across multiple tasks for adaptive sketching. Instead of requiring labels of all inputs, the proposed algorithm determines whether to learn an input or not via considering the confidence from its related tasks over label prediction. Theoretically, the algorithm learned on actively sampled labels can achieve a comparable result with one learned on all labels. Empirically, the proposed algorithm is able to achieve promising learning efficacy, while reducing the computational complexity and labeling cost simultaneously. |
Social Network-A (3)
(UTC/GMT +8) 16:00-18:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
Traffic Flow Prediction via Spatial Temporal Graph Neural Network Xiaoyang Wang (Beijing Jiaotong University), Yao Ma (Michigan State University), Yiqi Wang (Michigan State University), Wei Jin (Michigan State University), Xin Wang (Changchun Institute of Technology), Jiliang Tang (Michigan State University), Caiyan Jia (Beijing Jiaotong University) and Jian Yu (Beijing Jiaotong University).
AbstractTraffic flow analysis, prediction and management are keystones for building smart cities in the new era. With the help of deep neural networks and big traffic data, we can better understand the latent patterns hidden in the complex transportation networks. The dynamics of traffic flow not only depends on the sequential patterns in the temporal dimension but also relies on other roads in the spatial dimension. Although there are existing works on predicting the future traffic flow dynamics, the majority of them have certain limitations on modeling both spatial and temporal dependencies. In this paper, we propose a novel spatial temporal graph neural network for traffic flow prediction, which can comprehensively capture spatial and temporal patterns. In particular, the framework offers a learnable positional attention mechanism to effectively aggregate information from adjacent roads. Meanwhile, it provides a sequential component to model the traffic flow dynamics which can exploit both local and global temporal dependencies. Experimental results on various real traffic datasets demonstrate the effectiveness of the proposed framework. |
16:30-17:00 |
Graph Attention Topic Modeling Network Liang Yang (Hebei University of Technology), Yuanfang Guo (Beihang University), Xiaochun Cao (Chinese Academy of Sciences), Junhua Gu (Hebei University of Technology), Di Jin (Tianjin University), Fan Wu (Hebei University of Technology) and Chuan Wang (Chinese Academy of Sciences).
AbstractTo alleviate the overfitting issue of Probabilistic Latent Semantic Indexing (pLSI), Latent Dirichlet Allocation (LDA) introduces Dirichlet priors for latent variables. Many following correlated topic modeling approaches are proposed to prevent the failure of capturing the rich topical correlations among topics, which is introduced from the independent occurrence assumption of the introduced Dirichlet priors. However, they usually possess the drawback of high inference complexity.In this paper, we open up a new way to overcome the overfitting issue of pLSI by using the amortized inference with word embedding as input instead of introducing Dirichlet prior as in LDA. For generative topic model, the large number of free latent variables is the root of overfitting. To reduce the number of parameters, the amortized inference replaces the inference of latent variable with a function which possesses the shared (amortized) learnable parameters. The number of the shared parameters is fixed and independent of the scale of the corpus. In order to overcome the limited application of amortized inference to independent and identically distributed (i.i.d) data, a novel graph neural network, Graph Attention TOpic Network (GATON), is introduced to model the topic structure of non-i.i.d documents according to the following two findings. First, pLSI is interpreted as the stochastic block model (SBM) on a specific bi-partite graph. Second, graph attention network (GAT) is explained as the semi-amortized inference of SBM, which relaxes the i.i.d data assumption of vanilla amortized inference. The GATON provides a novel way, i.e. graph convolution operation, to integrate word similarity and word co-occurrence structure. Specifically, the bag-of-words document representation is modeled as the bi-partite graph topology, while word embedding, which captures the word similarity, is modeled as the attribute of the word node and the term frequency vector is treated as the attribute of the document node. By the weighted (attention) graph convolution operation, the word co-occurrence structure and word similarity patterns are seamlessly integrated for topic identification. Extensive experiments demonstrate that the effectiveness of GATON on topic identification not only benefits the document classification, but also significantly refines the input word embedding. |
17:00-17:30 |
Deep Adversarial Completion for Sparse Heterogeneous Information Network Embedding Kai Zhao (Beijing University of Posts and Telecommunications), Ting Bai (Beijing University of Posts and Telecommunications), Bin Wu (Beijing University of Posts and Telecommunications), Bai Wang (Beijing University of Posts and Telecommunications), Youjie Zhang (Beijing University of Posts and Telecommunications), Yuanyu Yang (Beijing University of Posts and Telecommunications) and Jian-Yun Nie (University of Montreal).
AbstractHeterogeneous information network (HIN) contains multiple types of entities and relations. Most of existing HIN embedding methods learn the semantic information based on the heterogeneous structures between different entities, which are implicitly assumed to be complete. However, in real world, it is common that some relations are partially observed due to privacy or other reasons, resulting in a sparse network, in which the structure may be incomplete, and the "unseen" links may also be positive due to the missing relations in data collection. To address this problem, we propose a novel and principled approach: a Multi-View Adversarial Completion Model (MV-ACM). Each relation space is characterized in a single viewpoint, enabling us to use the topological structural information in each view. Based on the multi-view architecture, an adversarial learning process is utilized to learn the reciprocity (i.e. complementary information) between different relations: In the generator, MV-ACM generates the complementary views by computing the similarity of the semantic representation of the same node in different views; while in the discriminator, MV-ACM discriminates whether the view is complementary by the topological structural similarity. Then we update the node's semantic representation by aggregating neighborhoods information from the syncretic views. We conduct systematical experiments on six real-world networks from varied domains: AMiner, PPI, YouTube, Twitter, Amazon and Alibaba. Empirical results show that MV-ACM significantly outperforms the state-of-the-art approaches for both link prediction and node classification tasks. |
17:30-17:45 |
Heterogeneous Graph Transformer Ziniu Hu (University of California, Los Angeles), Yuxiao Dong (Microsoft), Kuansan Wang (Microsoft) and Yizhou Sun (University of California, Los Angeles).
AbstractRecent years have witnessed the emergent success of graph neural networks (GNNs) for modeling structured data. However, most GNNs are designed for homogeneous networks, in which all nodes or edges have the same feature space and representation distribution, making them infeasible for representing evolving heterogeneous structures. In this paper, we present the Heterogeneous Graph Transformer (HGT) architecture for modeling Web-scale heterogeneous and dynamic graphs. To model heterogeneity, we design node- and edge-type dependent parameters to model the heterogeneous attention over each edge, empowering HGT to maintain dedicated representations for different types of nodes and edges. To capture graph dynamics, rather than slicing the graph based on time, we keep the whole graph with each edge/node associated with its timestamp and propose the relative temporal encoding strategy to capture the dynamic dependency with arbitrary durations. To handle Web-scale data, we design the heterogeneous mini-batch graph sampling algorithm with an inductive timestamp assignment method for efficient and scalable training. Extensive experiments on the Open Academic Graph of 179 million nodes and 2 billion edges show that the proposed HGT model consistently outperforms all the state-of-the-art GNN baselines by 14.6%--24.0% on various downstream tasks. |
17:45-18:00 |
Structure-Feature based Graph Self-adaptive Pooling Liang Zhang (Xidian University), Xudong Wang (Xidian University), Hongsheng Li (Xidian University), Guangming Zhu (Xidian University), Peiyi Shen (Xidian University), Ping Li (ShangHai BNC), Xiaoyuan Lu (ShangHai BNC), Syed Afaq Ali Shah (The University of Western Australia) and Mohammed Bennamoun (The University of Western Australia).
AbstractVarious methods to deal with graph data have been proposed in recent years. However, most of these methods focus on graph feature aggregation rather than graph pooling. Besides, the existing top-k selection graph pooling methods have a few problems. First, to construct the pooled graph topology, current top-k selection methods evaluate the importance of the node from a single perspective only, which is simplistic and unobjective. Second, the feature information of unselected nodes is directly lost during the pooling process, which inevitably leads to a massive loss of graph feature information. To solve these problems mentioned above, we propose a novel graph self-adaptive pooling method with the following objectives: (1) to construct a reasonable pooled graph topology, structure and feature information of the graph are considered simultaneously, which provide additional veracity and objectivity in node selection; and (2) to make the pooled nodes contain sufficiently effective graph information, node feature information is aggregated before discarding the unimportant nodes; thus, the selected nodes contain information from neighbor nodes, which can enhance the use of features of the unselected nodes. Experimental results on four different datasets demonstrate that our method is effective in graph classification and outperforms state-of-the-art graph pooling methods. |
User Modeling-A (3)
(UTC/GMT +8) 16:00-18:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
Weakly Supervised Attention for Hashtag Recommendation using Graph Data Amin Javari (University of Illinois at Urbana-Champaign), Zhankui He (UCSD), Zijie Huang (University of California, Los Angeles), Jeetu Raj (University of Illinois at Urbana-Champaign) and Kevin Chang (University of Illinois at Urbana-Champaign).
AbstractPersonalized trending hashtag recommendation for users could substantially promote user engagement in microblogging websites: users can easily discover recent microblogs aligned with their interests and information needs. However, user profiling and making personalized recommendations on microblogging websites is challenging because most users tend not to generate content data. Our core idea to address the problem is to build a network-based interest profile of users and incorporate it into hashtag recommendation. Indeed, user's followee/follower connections implicitly indicate their interests. Considering that microblogging networks are scale-free networks, to maintain the efficiency and effectiveness of the model, rather than analyzing the entire network, we model users by focusing on their links towards popular/hub nodes. That is, hashtags and hub nodes in the network are projected into a shared latent space. To predict the relevance of a user to a hashtag, a projection of the user is built by aggregating the embeddings of her hub neighbors guided by an attention model and then compared with the target hashtag. Classically, attention models with low complexity can be trained in an end to end manner. However, due to the high complexity of our problem, we propose a novel weak supervision model for the attention component, which significantly improves the effectiveness of the model. We performed extensive experiments on two datasets collected from Twitter and Weibo, and the results confirm that our method substantially outperforms the baseline methods. |
16:30-17:00 |
Intention Modeling from Ordered and Unordered Facets for Sequential Recommendation Xueliang Guo (School of Computer Science, Beijing Institute of Technology), Chongyang Shi (Beijing Institute of Technology School of Computer Science) and Chuanming Liu (Computer Science and Information Engineering, National Taipei University of Technology).
AbstractRecently, sequential recommendation has attracted substantial attention from researchers due to its status as an essential service for e-commerce. Accurately understanding user intention is an important factor to improve the performance of recommendation system. However, user intention is highly time-dependent and flexible, so it is very challenging to learn the latent dynamic intention of users for sequential recommendation.To this end, in this paper, we propose a novel intention modeling from ordered and unordered facets (IMfOU) for sequential recommendation. Specifically, the global and local item embedding (GLIE) we proposed can comprehensively capture the sequential context information in the sequences and highlight the important features that users care about. We further design ordered preference drift learning (OPDL) and unordered purchase motivation learning (UPML) to obtain user's the process of preference drift and purchase motivation respectively. With combining the users' dynamic preference and current motivation, it considers not only sequential dependencies between items but also flexible dependencies and models the user purchase intention more accurately from ordered and unordered facets respectively. Evaluation results on three real-world datasets demonstrate that our proposed approach achieves better performance than the state-of-the-art sequential recommendation methods achieving improvement of AUC by an average of 2.26\%. |
17:00-17:30 |
Personalized Employee Training Course Recommendation with Career Development Awareness Chao Wang (University of Science and Technology of China), Hengshu Zhu (Baidu Inc.), Chen Zhu (Baidu Talent Intelligence Center), Xi Zhang (College of Management and Economics, Tianjin University), Enhong Chen (University of Science and Technology of China) and Hui Xiong (Rutgers University).
AbstractAs a major component of strategic talent management, learning and development (L\&D) aims at improving the individual and organization performances through planning tailored training for employees to increase and improve their skills and knowledge. While many companies have developed the learning management systems (LMSs) for facilitating the online training of employees, a long-standing important issue is how to achieve personalized training recommendations with the consideration of their needs for future career development. To this end, in this paper, we propose an explainable personalized online course recommender system for enhancing employee training and development. A unique perspective of our system is to jointly model both the employees' current competencies and their career development preferences in an explainable way. Specifically, the recommender system is based on a novel end-to-end hierarchical framework, namely Demand-aware Collaborative Bayesian Variational Network (DCBVN). In DCBVN, we first extract the latent interpretable representations of the employees' competencies from their skill profiles with autoencoding variational inference based topic modeling. Then, we develop an effective demand recognition mechanism for learning the personal demands of career development for employees. In particular, all the above processes are integrated into a unified Bayesian inference view for obtaining both accurate and explainable recommendations. Finally, extensive experimental results on real-world data clearly demonstrate the effectiveness and the interpretability of DCBVN, as well as its robustness on sparse and cold-start scenarios. |
17:30-17:45 |
Understanding User Behavior For Document Recommendation Xuhai Xu (University of Washington), Ahmed Hassan Awadallah (Microsoft), Susan T. Dumais (Microsoft), Farheen Omar (Microsoft), Bogdan Popp (Microsoft), Robert Rounthwaite (Microsoft) and Farnaz Jahanbakhsh (Massachusetts Institute of Technology).
AbstractPersonalized document recommendation systems aim to provide users with a quick shortcut to the documents they may want to access next, usually with an explanation about why the document is recommended. Previous work explored various methods on better recommendations and better explanations for different domains including news, movies, products, etc. However, there are few efforts that closely study how users react to the recommended items in a document recommendation scenario. We conducted a large-scale log study of users' interaction behavior with the explainable recommendation on one of the largest cloud document platforms. Our analysis reveals a number of factors, including display position, file type, authorship, recency of last access, and most importantly, the recommendation explanations, that are associated with whether users will recognize or open the recommended documents. Moreover, we specifically focus on explanations and conducted an online experiment to investigate the influence of different explanations on user behavior. Our analysis indicates that the recommendations help users access their documents significantly faster, but sometimes users miss a recommendation and resort to other more complicated methods to open the documents. Our results suggest opportunities to improve explanations and more generally the design of systems that provide and explain recommendations for documents. |
17:45-18:00 |
Deep Rating Elicitation for New Users in Collaborative Filtering Wonbin Kweon (Pohang University of Science and Technology), Seongku Kang (Pohang University of Science and Technology), Junyoung Hwang (Pohang University of Science and Technology) and Hwanjo Yu (Pohang University of Science and Technology).
AbstractRecent recommender systems started to use rating elicitation, which asks new users to rate the small seed items for inferring their preferences, to improve the quality of initial recommendations. The key challenge of the rating elicitation is to choose the most ''representative'' seed items to best infer the new users’ preference. The state-of-the-art approaches have two critical limitations: 1) They cannot capture the non-linear characteristics of collaborative filtering (CF) information, 2) They cannot fully consider the interactions between the whole seed items at a time, because they select the seed items in a greedy fashion. This paper proposes a novel end-to-end deep learning framework, called DRE, which chooses all the seed items at a time with consideration of the non-linear interactions. To this end, it first defines categorical distributions to sample seed items from the entire itemset, then it trains both the categorical distributions and a neural reconstruction network to infer users’preferences on the remaining items from CF information of the sampled items. Through the end-to-end training, the categorical distributions are learned to select the most representative seed items while reflecting the complex non-linear interactions. Experimental results show that DRE outperforms the state-of-the-art methods in the recommendation quality by accurately inferring the new users’ preferences, and its seed itemset better represents the latent space than the seed itemset obtained by the other methods. |
Crowdsourcing (1)
(UTC/GMT +8) 16:00-18:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
Reputation Agent: Prompting Fair Reviews in Gig Markets Carlos Toxtli (West Virginia University), Angela Richmond (Universidad Nacional Autonoma de Mexico) and Saiph Savage (West Virginia University).
AbstractGig markets rely on reviews to help customers or employers identify the workers they want to hire. However, gig markets have been plagued with unfair assessments containing inaccurate reputation signals about workers that can not only limit workers’ future job opportunities, but can also result in workers not getting paid or even being terminated from the marketplace. Unfair reviews are generally created because employers have a hard time differentiating the factors within the workers' control and the ones that have little to do with their performance (e.g., when they complain about an Uber driver getting stuck in traffic). However, because market power is typically placed in the hands of employers, a bad worker review can result in the worker losing her entire livelihood. To address this problem, we present Reputation Agent, a review validation system that helps employers to generate fair reviews. Reputation Agent implements an intelligent interface that: (1) uses deep learning to automatically detect when an individual has included unfair factors into her review (factors that are outside the control of the gig worker, according to the policies of the market); and (2) prompts the individual to reconsider her review if she has incorporated unfair factors. To study the effectiveness of Reputation Agent, we conducted a controlled experiment over different gig markets. Our experiment illustrates that across markets, Reputation Agent, in contrast with traditional approaches, motivates customers and employers to review gig workers' performance more fairly. We discuss how tools that bring more transparency to employers about the policies of a gig market can help build empathy, spark discussions around the established gig market rules, and could be used to help platform maintainers identify potential injustices towards workers generated by their interfaces. Our vision is that with truth and transparency we can bring fairer treatment of gig workers. |
16:30-17:00 |
Attention Please: Your Attention Check Questions in Survey Studies Can Be Automatically Answered Weiping Pei (Colorado School of Mines), Arthur Mayer (Colorado School of Mines), Kaylynn Tu (Colorado School of Mines) and Chuan Yue (Colorado School of Mines).
AbstractAttention check questions have become commonly used in online surveys published on popular crowdsourcing platforms as a key mechanism to filter out inattentive respondents and improve data quality. However, little research considers the vulnerabilities of this important quality control mechanism that can allow attackers including irresponsible and malicious respondents to automatically answer attention check questions for efficiently achieving their goals. In this paper, we perform the first study to investigate such vulnerabilities, and demonstrate that attackers can leverage deep learning techniques to pass attention check questions automatically. We propose AC-EasyPass, an attack framework with a concrete model, that combines convolutional neural network and weighted feature reconstruction to easily pass attention check questions. We construct the first attention check question dataset that consists of both original and augmented questions, and demonstrate the effectiveness of AC-EasyPass. We explore two simple defense methods, adding adversarial sentences and adding typos, for survey designers to mitigate the risks posed by AC-EasyPass; however, these methods are fragile due to their limitations from both technical and usability perspectives, underlining the challenging nature of defense. We hope our work will raise sufficient attention of the research community towards developing more robust attention check mechanisms. More broadly, our work intends to prompt the research community to seriously consider the emerging risks posed by the malicious use of machine learning techniques to the quality, validity, and trustworthiness of crowdsourcing and social computing. |
17:00-17:30 |
OpenCrowd: A Human-AI Collaborative Approach for Finding Social Influencers via Open-Ended Answers Aggregation Ines Arous (University of Fribourg), Jie Yang (Amazon Research), Mourad Khayati (University of Fribourg) and Philippe Cudre-Mauroux (University of Fribourg).
AbstractFinding social influencers is a fundamental task in many online applications ranging from brand marketing to opinion mining. Existing methods heavily rely on the availability of expert labels, whose collection is usually a laborious process even for domain experts. Using open-ended questions, crowdsourcing provides a cost-effective way to find a large number of social influencers in a short time. Individual crowd workers, however, only possess fragmented knowledge that is often of low quality. To tackle those issues, we present OpenCrowd, a unified Bayesian framework that seamlessly incorporates supervised learning and crowdsourcing for effectively finding social influencers. To infer a set of influencers, OpenCrowd bootstraps the learning process using a small number of expert labels and then jointly learns a feature-based answer quality model and the reliability of the workers. Model parameters and worker reliability are updated iteratively, allowing their learning processes to benefit from each other until an agreement on the quality of the answers is reached. We derive a principled optimization algorithm based on variational inference with efficient rules update for learning OpenCrowd parameters. Experimental results on finding social influencers in different domains show that our approach substantially improves state of the art by 11.5% AUC. |
Health (2)
(UTC/GMT +8) 16:00-18:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
StageNet: Stage-Aware Neural Networks for Health Risk Prediction Junyi Gao (IQVIA), Cao Xiao (IQVIA), Yasha Wang (Peking University), Wen Tang (Peking University Health Science Center), Lucas Glass (IQVIA) and Jimeng Sun (Georgia Institute of Technology).
AbstractDeep learning has demonstrated success in health risk prediction especially for patients with chronic and progressing conditions. Most existing works focus on learning chronic disease patterns from longitudinal patient data, but pay little attention to the disease progression stage itself. To fill the gap, we propose a Stage-aware neural Network (StageNet) model to extract disease stage information from patient data and integrate it into risk prediction. StageNet is enabled by (1) a stage-aware long short-term memory (LSTM) module that extracts health stage variations unsupervisedly; (2) a stage-adaptive convolutional module that incorporates stage-related variation patterns into risk prediction. We evaluate StageNet on two real-world datasets and show that StageNet outperforms state-of-the-art models in risk prediction task and patient subtyping task. Compared to the best baseline model, StageNet achieves up to 12% higher AUPRC for risk prediction task on two real world patient datasets. StageNet also achieves over 58% higher Calinski-Harabasz score (a cluster quality metric) for a patient subtyping task. |
16:30-17:00 |
DeepEnroll: Patient-Trial Matching with Deep Embedding and Entailment Prediction Xingyao Zhang (Tsinghua University), Cao Xiao (IQVIA), Lucas Glass (IQVIA) and Jimeng Sun (Georgia Institute of Technology).
AbstractClinical trials are essential for drug development but often suffer from expensive, inaccurate and insufficient patient recruitment. The core problem of patient-trial matching is to find qualified patients for a trial, where patient information is stored in electronic health records (EHR) while trial eligibility criteria (EC) are described in text documents available on the web. How to represent longitudinal patient EHR? How to extract complex logical rules from EC? Most existing works rely on manual rule-based extraction, which is time consuming and inflexible for complex inference. To address these challenges, we proposed DeepEnroll, a cross-modal inference learning model to jointly encode enrollment criteria (text) and patients records (tabular data) into a shared latent space for matching inference. DeepEnroll applies a pre-trained Bidirectional Encoder Representations from Transformers(BERT) model to encode clinical trial information into sentence embedding. And uses a hierarchical embedding model to represent patient longitudinal EHR. In addition, DeepEnroll is augmented by a numerical information embedding and entailment module to reason over numerical information in both EC and EHR. These encoders are trained jointly to optimize patient-trial matching score. We evaluated DeepEnroll on the trial-patient matching task with demonstrated on real world datasets. DeepEnroll outperformed the best baseline by up to 12.4% in average F1. |
17:00-17:30 |
REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild Rahul Duggal (Georgia Institute of Technology), Scott Freitas (Georgia Institute of Technology), Cao Xiao (IQVIA), Duen Horng Chau (Georgia Institute of Technology) and Jimeng Sun (Georgia Institute of Technology).
AbstractIn recent years, significant attention has been devoted towards integrating deep learning technologies in the healthcare domain. However, to safely and practically deploy deep learning models for home health monitoring, two significant challenges must be addressed: the models should be (1) robust against noise; and (2)compact and energy-efficient. We propose REST, a new method that simultaneously tackles both issues via 1) adversarial training and controlling the Lipschitz constant of the neural network through spectral regularization while 2) enforcing sparsity on whole filters. We demonstrate that REST produces highly-robust and efficient models that substantially outperform the original full-sized models in the presence of noise. For the sleep staging task over single-channel electroencephalogram (EEG), REST achieves a macro-F1 score of 0.69 vs. 0.33 for the Vanilla model in the presence of adversarial noise while obtaining 19x parameter reduction and 15x MFLOPS reduction on two large, real-world EEG datasets. By deploying these models to an Android application on a smartphone, we quantitatively observe that REST allows models to achieve up to 17x energy reduction and 9x faster inference. |
Economics (1)
(UTC/GMT +8) 16:00-18:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
A Data-Driven Metric of Incentive Compatibility Yuan Deng (Duke University), Sébastien Lahaie (Google), Vahab Mirrokni (Google) and Song Zuo (Google).
AbstractAn incentive-compatible auction incentivizes buyers to truthfully reveal their private valuations. However, many ad auction mechanisms deployed in practice are not incentive-compatible, such as first-price auctions (for display advertising) and the generalized second-price auction (for search advertising). We introduce a new metric to quantify incentive compatibility in both static and dynamic environments. Our metric is data-driven and can be computed directly through black-box auction simulations without relying on reference mechanisms or complex optimizations. We provide interpretable characterizations of our metric and prove that it is monotone in auction parameters for several mechanisms used in practice, such as soft floors and dynamic reserve prices. We empirically evaluate our metric on ad auction data from a major ad exchange and a major search engine to demonstrate its broad applicability in practice. |
16:30-17:00 |
Liquidity in Credit Networks with Constrained Agents Geoffrey Ramseyer (Stanford University), Ashish Goel (Stanford University) and David Mazieres (Stanford University).
AbstractIn order to scale transaction rates for deployment across the global web, many cryptocurrencies have deployed so-called "Layer-2" networks of private payment channels. An idealized payment network behaves like a Credit Network, a model for transactions across a network of bilateral trust relationships. Credit Networks capture many aspects of traditional currencies as well as new virtual currencies and payment mechanisms. In the traditional credit network model, if an agent defaults, every other node that trusted it is vulnerable to loss. In a cryptocurrency context, trust is manufactured by capital deposits, and thus there arises a natural tradeoff between network liquidity (i.e. the fraction of transactions that succeed) and the cost of capital deposits.In this paper, we introduce constraints that bound the total amount of loss that the rest of the network can suffer if an agent (or a set of agents) were to default - equivalently, how the network changes if agents can support limited solvency guarantees.We show that these constraints preserve the analytical structure of a credit network. Furthermore, we show that aggregate borrowing constraints greatly simplify the network structure and in the payment network context achieve the optimal tradeoff between liquidity and amount of escrowed capital. |
17:00-17:30 |
Why Do Competitive Markets Converge to First-Price Auctions? Renato Paes Leme (Google), Balasubramanian Sivan (Google) and Yifeng Teng (University of Wisconsin-Madison).
AbstractWe consider a setting in which bidders participate in multiple auctions run by different sellers, and optimize their bids for the aggregate auction. We analyze this setting by formulating a game between sellers, where a seller’s strategy is to pick an auction to run. Our analysis aims to shed light on the recent change in the Dis-play Ads market landscape: here, ad exchanges (sellers) were mostly running second price auctions earlier and over time they switched to variants of the first price auction, culminating in Google’s Ad Exchange moving to a first price auction in 2019. Our model and results offer an explanation for why the first price auction occurs as a natural equilibrium in such competitive markets. |
17:30-17:45 |
Envy, Regret, and Social Welfare Loss Riccardo Colini Baldeschi (Facebook, Core Data Science), Stefano Leonardi (Sapienza University of Rome), Okke Schrijvers (Facebook) and Eric Sodomka (Facebook, Core Data Science).
AbstractIncentive compatibility (IC) is a desirable property for any auction mechanism, including those used in online advertising. However, in real world applications practical constraints and complex environments often result in mechanisms that lack incentive compatibility. Recently, several papers investigated the problem of deploying black-box statistical tests to determine if an auction mechanism is incentive compatible. Unfortunately, most of those methods are costly, since they require the execution of many counterfactual experiments.In this work, we show that similar results can be obtained using the notion of IC-Envy. The advantage of IC-Envy is its efficiency: it can be computed using only the auction's outcome. In particular, we focus on two relevant environments: position auctions and Ad Types auctions. For position auctions, we show that for a large class of pricing schemes (which includes e.g. VCG and GSP), IC-Envy >= IC-Regret (and IC-Envy = IC-Regret under mild supplementary conditions). Next, we consider non-separable CTRs in the Ad Types environment. In this setting, we show that for a generalization of the GSP mechanism IC-Envy >= IC-Regret holds as well. Our theoretical results are completed showing that, in the position auction environment, IC-Envy can be used to bound the loss in social welfare due to the advertiser untruthful behavior.Finally, we show experimentally that IC-Envy can be used as a feature to predict IC-Regret in settings not covered by the theoretical results. In particular, using IC-Envy yields better results than training models using only price and value features. |
17:45-18:00 |
Private Data Manipulation in Optimal Sponsored Search Auction Xiaotie Deng (Peking University), Tao Lin (Peking University) and Tao Xiao (Shanghai Jiao Tong University).
AbstractThe sponsored search auction has been the first successful mechanism to commercialize an Internet service, since some 20 years ago. It is a market between a seller of online advertisement slots and many buyers of the slots to place their adverts. Conceptually, the auction is repeated billions of times for it to become an atypical auction where the market maker would be able to observe buyers' data and to adapt the auction protocol to its knowledge of buyers' value distributions. We formulate the auction under the above scenario as a Private Data Manipulation game between the seller and buyers: the seller first announces an auction whose allocation and payment rules are based on the buyers' distributions, then every buyer submits a value distribution for the auction (implemented by its submitted data following this distribution), finally the allocation and payment rules are carried out. We are interested in whether and how rational buyers would submit value distributions. Taking this consideration into account, we re-evaluate the theory, methodology and techniques that have been the most intensively studied in Internet economics. |
Systems (1)
(UTC/GMT +8) 16:00-18:00, April, 22, Wednesday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
AutoMAP: Diagnose Your Microservice-based Web Applications Automatically Meng Ma (Peking University), Ping Wang (Peking University), Jing Min Xu (IBM Research - China), Yuan Wang (IBM CRL), Pengfei Chen (Sun Yat-sen University) and Zonghua Zhang (IMT Lille Douai, Institut Mines-Télécom).
AbstractThe high complexity and dynamics of the microservice architecture make its application diagnosis extremely challenging. In this study, we design a novel tool, named AutoMAP, which enables dynamic generation of service correlations and automated diagnosis leveraging multiple types of metrics. In AutoMAP, we propose the concept of anomaly behavior graph to describe the correlations between services associated with different types of metrics. Two binary operations, as well as a similarity function on behavior graph are defined to help AutoMAP choose appropriate diagnosis metric in any particular scenario. Following the behavior graph, we design a heuristic investigation algorithm by using forward, self, and backward random walk, with an objective to identify the root cause services. To demonstrate the strengths of AutoMAP, we develop a prototype and evaluate it in both simulated environment and real-work enterprise cloud system. Experimental results clearly indicate that AutoMAP achieves over 90% precision, which significantly outperforms other selected baseline methods. AutoMAP can be quickly deployed in a variety of microservice-based systems without any system knowledge. It also supports introduction of various expert knowledge to improve accuracy. |
16:30-17:00 |
Comparing the Effects of DNS, DoT, and DoH on Web Performance Austin Hounsel (Princeton University), Kevin Borgolte (Princeton University), Paul Schmitt (Princeton University), Jordan Holland (Princeton University) and Nick Feamster (University of Chicago).
AbstractNearly every service on the Internet relies on the Domain Name System (DNS), which translates a human-readable name to an IP address before two endpoints can communicate. Today, DNS traffic is unencrypted, leaving users vulnerable to eavesdropping and tampering. Past work has demonstrated that DNS queries can reveal a user's browsing history and even what smart devices they are using at home. In response to these privacy concerns, two new protocols have been proposed: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT). Instead of sending DNS queries and responses in the clear, DoH and DoT establish encrypted connections between users and resolvers. By doing so, these protocols provide privacy and security guarantees that traditional DNS (Do53) lacks.In this paper, we measure the effect of Do53, DoT, and DoH on query response times and page load times from five global vantage points. We find that although DoH and DoT response times are generally higher than Do53, both protocols can perform better than Do53 in terms of page load times. However, as throughput decreases and substantial packet loss and latency are introduced, web pages load fastest with Do53. Additionally, web pages successfully load more often with Do53 and DoT than DoH. Based on these results, we provide several recommendations to improve DNS performance, such as opportunistic partial responses and wire format caching. |
17:00-17:30 |
Understanding the Performance Costs and Benefits of Privacy-focused Browser Extensions Kevin Borgolte (Princeton University) and Nick Feamster (University of Chicago).
AbstractAdvertisements and behavioral tracking have become an invasive nuisance on the Internet in recent years. Privacy advocates and expert users consider the invasion significant enough to warrant the use of ad blockers and anti-tracking browser extensions. At the same time, one of the largest advertisement companies in the world, Google, is developing the most popular browser, Google Chrome. This conflict of interest, that is developing a browser (a user agent) and being financially motivated to track users' online behavior, possibly violating their privacy expectations, while claiming to be a "user agent," did not remain unnoticed. As a matter of fact, Google recently sparked an outrage when proposing changes to Chrome how extensions can inspect and modify requests to "improve extension performance and privacy," which would render existing privacy-focused extensions inoperable.In this paper, we analyze how eight popular privacy-focused browser extensions for Google Chrome and Mozilla Firefox, the two desktop browsers with the highest market share, affect browser performance. We measure browser performance through several metrics focused on user experience, such as page-load times, number of fetched resources, as well as response sizes. To address potential regional differences in advertisements or tracking, such as influenced by the European General Data Protection Regulation (GDPR), we perform our study from two vantage points, the United States of America and Germany. Moreover, we also analyze how these extensions affect system performance, in particular CPU time, which serves as a proxy indicator for battery runtime of mobile devices. Contrary to Google's claims that extensions which inspect and block requests negatively affect browser performance, we find that a browser with privacy-focused request-modifying extensions performs similar or better on our metrics compared to a browser without extensions. In fact, even a combination of such extensions performs no worse than a browser without any extensions. Our results highlight that privacy-focused extensions not only improve users' privacy, but can also increase users' browsing experience. |
Research Tracks (4)
Web Mining-A (4)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog Shen Gao (Peking University), Xiuying Chen (Peking University), Chang Liu (Peking University), Li Liu (INCEPTION INSTITUTE OF ARTIFICIAL INTELLIGENCE), Dongyan Zhao (Peking University) and Rui Yan (Peking University).
AbstractStickers with vivid and engaging expressions are becoming increasingly popular in online messaging apps, and some works are dedicated to automatically select sticker response by matching text labels of stickers with previous utterances. However, due to their large quantities, it is impractical to require text labels for the all stickers. Hence, in this paper, we propose to recommend an appropriate sticker to user based on multi-turn dialog context history without any external labels. Two main challenges are confronted in this task. One is to learn semantic meaning of stickers without corresponding text labels. Another challenge is to jointly model the candidate sticker with the multi-turn dialog context. To tackle these challenges, we propose a sticker response selector (SRS) model. Specifically, SRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker with each utterance in the dialog history. SRS then learns the short-term and long-term dependency between all interaction results by a fusion network to output the the final matching score. To evaluate our proposed method, we collect a large-scale real-world dialog dataset with stickers from one of the most popular online chatting platform. Extensive experiments conducted on this dataset show that our model achieves the state-of-the-art performance for all commonly-used metrics. Experiments also verify the effectiveness of each component of SRS. To facilitate further research in sticker selection field, we release this dataset of 350K multi-turn dialog and sticker pairs. |
11:00-11:30 |
MetaNER: Named Entity Recognition with Meta-Learning Jing Li (Inception Institute of Artificial Intelligence), Shuo Shang (Inception Institute of Artificial Intelligence) and Ling Shao (Inception Institute of Artificial Intelligence).
AbstractRecent advances in named entity recognition (NER) using deep neural models have yielded state-of-the-art performance on single domain data such as newswires. However, they still suffer from (i) requiring massive amounts of training data to avoid overfitting; (ii) huge performance degradation when there is a domain shift in the data distribution between training and testing. To make an NER system more broadly useful, it is crucial to reduce its training data requirements and transfer knowledge to other domains. In this paper, we investigate the problem of domain adaptation for NER under homogeneous and heterogeneous settings. We propose MetaNER, a novel meta-learning approach for domain adaptation in NER. Specifically, MetaNER incorporates meta-learning and adversarial training strategies to encourage robust, general and transferable representations for sequence labeling. The key advantage of MetaNER is that it is capable of accurately and quickly adapting to new unseen domains with a small amount of annotated data from those domains. We extensively evaluate MetaNER on multiple datasets under homogeneous and heterogeneous settings. The experimental results show that MetaNER achieves state-of-the-art performance against eight baselines. Impressively, MetaNER surpasses the in-domain performance using only 16.17% and 34.76% of target domain data on average for homogeneous and heterogeneous settings, respectively. We conduct experiments to further analyze the parameter settings and architectural choices. We also present a study for qualitative analysis. |
11:30-12:00 |
Generating Representative Headlines for News Stories Xiaotao Gu (University of Illinois at Urbana-Champaign), Yuning Mao (University of Illinois at Urbana-Champaign), Jiawei Han (University of Illinois at Urbana-Champaign), Jialu Liu (Google), You Wu (Google), Cong Yu (Google), Daniel Finnie (Google), Hongkun Yu (Google), Jiaqi Zhai (Google) and Nicholas Zukoski (Google).
AbstractMillions of news articles are published online every day, which can be overwhelming for readers to follow. Grouping articles that are reporting the same event into news stories is a common way of assisting readers in their news consumption. However, it remains a challenging research problem to efficiently and effectively generate a representative headline for each story. Automatic summarization of a document set has been studied for decades, while few studies have focused on generating representative headlines for a set of articles. Unlike summaries, which aim to capture most information with least redundancy, headlines aim to capture information jointly shared by the story articles in short length, and exclude information that is too specific to each individual article.In this work, we study the problem of generating representative headlines for news stories. We develop a distant supervision approach to train large-scale generation models without any human annotation. This approach centers on two technical components. First, we propose a multi-level pre-training framework that incorporates massive unlabeled corpus with different quality-vs.-quantity balance at different levels. We show that models trained within this framework outperform those trained with pure human curated corpus. Second, we propose a novel self-voting-based article attention layer to extract salient information shared by multiple articles. We show that models that incorporate this layer are robust to potential noises in news stories and outperform existing baselines with or without noises. We can further enhance our model by incorporating human labels, and we show our distant supervision approach significantly reduces the demand on labeled data. Finally, to serve the research community, we publish the first manually curated benchmark dataset, NewSHead, which contains367k stories(each with3-5articles), 6.5times larger than the current largest multi-document summarization dataset. |
12:00-12:15 |
Leveraging Context for Neural Question Generation in Open-domain Dialogue Systems Yanxiang Ling (Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology), Fei Cai (Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology), Honghui Chen (Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology) and Maarten de Rijke (Informatics Institute, University of Amsterdam).
AbstractQuestion generation in open-domain dialogue systems is a challenging but less-explored task. It is aimed at enhancing interactiveness and persistence of human-machine interactions. Previous work mainly focuses on question generation in the setting of single-turn dialogues, or investigates it as a data augmentation method for machine comprehension. We propose a Context-augmented Neural Question Generation (CNQG) model that leverages the conversational context to generate questions for promoting interactiveness and persistence of multi-turn dialogues. More specifically, we formulate the task of question generation as a two-stage process. First, we employ an encoder-decoder framework to predict a question pattern, which denotes a set of representative interrogatives, and identify the potential topics from the conversational context by employing point-wise mutual information. Then, we generate the question by decoding the concatenation of the current dialogue utterance, the pattern, and the topics with an attention mechanism. To the best of our knowledge, ours is the first work on question generation in multi-turn opendomain dialogue systems. Our experimental results on two publicly available multi-turn conversation datasets show that CNQG outperforms the state-of-the-art baselines in terms of BLEU-1, BLEU-2, Distinct-1 and Distinct-2. In addition, we find that CNQG allows one to efficiently distill useful features from long contexts, and maintain robust effectiveness even for short contexts. |
12:15-12:30 |
Don’t Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing Subendhu Rongali (University of Massachusetts Amherst), Luca Soldaini (Amazon Alexa Search), Emilio Monti (Amazon Alexa) and Wael Hamza (Amazon Alexa AI).
AbstractVirtual assistants such as Amazon Alexa, Apple Siri, and Google Assistant often rely on a semantic parsing component to understand which action(s) to execute for an utterance spoken by its users. Traditionally, rule-based or statistical slot-filling systems have been used to parse ''simple'' queries; that is, queries that contain a single action and can be decomposed into a set of non-overlapping entities. More recently, shift-reduce parsers have been proposed to process more complex utterances. These methods, while powerful, impose specific limitations on the type of queries that can be parsed; namely, they require a query to be representable as a parse tree.In this work, we propose a unified architecture based on Sequence to Sequence models and Pointer Generator network to handle both simple and complex queries. Unlike other works, our approach does not impose any restriction on the semantic parse schema. Furthermore, experiments show that it achieves state of the art performance on three publicly available datasets (ATIS, SNIPS, Facebook TOP), relatively improving between 3.4% and 13.2% in exact match accuracy over any previous systems. Finally, we show the effectiveness of our approach on two internal datasets. |
Social Network-A (4)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Adversarial Attack on Community Detection by Hiding Individuals Jia Li (The Chinese University of Hong Kong), Honglei Zhang (Georgia Institute of Technology), Zhichao Han (The Chinese University of Hong Kong), Yu Rong (Tencent AI Lab), Hong Cheng (The Chinese University of Hong Kong) and Junzhou Huang (Tencent AI Lab).
AbstractIt has been demonstrated that adversarial graphs, i.e., graphs with imperceptible perturbations added, can cause deep graph models to fail on node/graph classification tasks. In this paper, we extend adversarial graphs to the community detection problem which is much more difficult. We focus on black-box attack and aim to hide targeted individuals from the detection of deep graph community detection models, which has many applications in real world scenarios, for example, protecting personal privacy in social networks and understanding camouflage patterns in transaction networks. We propose an iterative learning framework that takes turns to update two modules: one working at the constrained graph generation and the other at the surrogate community detection model. We also find that the adversarial graphs generated by our method can be transferred to other learning based community detection models. |
11:00-11:30 |
Beyond Rank-1: Discovering Rich Community Structure in Multi-Aspect Graphs Ekta Gujral (University of California, Riverside), Ravdeep Pasricha (University Of California Riverside) and Evangelos Papalexakis (University of California Riverside).
AbstractHow are communities in real multi-aspect or multi-view graphs structured? How we can effectively and concisely summarize and explore those communities in a high-dimensional, multi-aspect graph without losing important information? State-of-the-art studies focused on patterns in single graphs, identifying structures in a single snapshot of a large network or in time-evolving graphs and stitch them over time.However, to the best of our knowledge, there is no method that discovers and summarizes community structure from a multi-aspect graph, by jointly leveraging information from all aspects. State-of-the-art in multi-aspect/tensor community extraction is limited to discovering clique structure in the extracted communities, or even worse, imposing a clique structure where it does not exist.In this paper, we bridge that gap by empowering tensor-based methods to extract rich community structure from multi-aspect graphs. In particular, we introduce cLL1, a novel constrained Block Term Tensor Decomposition, that is generally capable of extracting higher than rank-1 but still interpretable structure from a multi-aspect dataset. Subsequently, we propose RICHCOM, a community structure extraction and summarization algorithm that leverages cLL1 to identify rich community structure (e.g., cliques, stars, chains, etc) while leveraging higher-order correlations between the different aspects of the graph.Our contributions are four-fold: (a) Novel algorithm : we develop cLL1, an efficient framework to extract rich and interpretable structure from general multi-aspect data; (b) Graph summarization and exploration : we provide cLL1B, a summarization, and encoding scheme to discover and explore structures of communities identified by cLL1; (c) Multi-aspect graph generator: we provide a simple and effective way synthetic multi-aspect graph generator, and (d) Real-world utility: we present empirical results on small and large real datasets that demonstrate performance on par or superior to existing state-of-the-art. |
11:30-12:00 |
Provably and Efficiently Approximating Near-cliques using the Turán Shadow: PEANUTS Shweta Jain (University of California, Santa Cruz) and C. Seshadhri (University of California, Santa Cruz).
AbstractClique and near-clique counts are important graph properties with applications in graph generation, graph modeling, graph analytics, community detection among others. They are the archetypal examples of dense subgraphs. While there are several different definitions of near-cliques, most of them share the attribute that they are cliques that are missing a small number of edges. Clique counting is itself considered a challenging problem. Counting near-cliques is significantly harder more so since the search space for near-cliques is orders of magnitude larger than that of cliques.We give a formulation of a near-clique as a clique that is missing a constant number of edges. We exploit the fact that a near-clique contains a smaller clique, and use techniques for clique sampling to count near-cliques. This method allows us to count near-cliques with 1 or 2 missing edges, in graphs with tens of millions of edges. To the best of our knowledge, there was no known efficient method for this problem, and we obtain a 10x-100x speedup over existing algorithms for counting near-cliques.Our main technique is a space efficient adaptation of the Tur\'{a}n Shadow sampling approach, recently introduced by Jain and Seshadhri (WWW 2017). This approach constructs a large recursion tree (called the Tur\'{a}n Shadow) that represents cliques in a graph. We design a novel algorithm that builds an estimator for near-cliques, using a online, compact construction of the Tur\'{a}n Shadow. |
12:00-12:15 |
Clustering with a faulty oracle Kasper Green Larsen (Aarhus University), Michael Mitzenmacher (Harvard University) and Charalampos Tsourakakis (Boston University).
AbstractClustering, i.e., finding groups in the data, is a problem that permeates multiple fields of science and engineering. Recently, the problem of clustering with a noisy oracle has drawn attention due to various applications including crowdsourced entity resolution verroios2015entity, and predicting signs of interactions in large-scale online social networks leskovec2010signed,leskovec2010predicting. Here, we consider the following fundamental model for two clusters as proposed by Mitzenmacher and Tsourakakis mitzenmacher2016predicting, and Mazumdar and Saha mazumdar2017clustering; there exist $n$ items, belonging to two unknown groups. We are allowed to query any pair of nodes whether they belong to the same cluster or not, but the answer to the query is corrupted with some probability $0\delta=1-2q>0$ be the bias.In this work, we provide a polynomial time algorithm that recovers all signs correctly with high probability in the presence of noise with $O(n\log n{\delta^2}+\log^2 n{\delta^6})$ queries. This is the best known result for this problem for all but tiny $\delta$, improving on the currect state-of-the-art due to Mazumdar and Saha mazumdar2017clustering. |
12:15-12:30 |
One2Multi Graph Autoencoder for Multi-view Graph Clustering Shaohua Fan (Beijing University of Posts and Telecommunications), Xiao Wang (Beijing University of Posts and Telecommunications), Chuan Shi (Beijing University of Posts and Telecommunications), Emiao Lu (Tencent), Ken Lin (Tencent) and Bai Wang (Beijing University of Posts and Telecommunications).
AbstractMulti-view graph clustering, which seeks a partition of the graph with multiple views that often provide more comprehensive yet complex information, has received considerable attention in recent years. Although some efforts have been made for multi-view graph clustering and achieve decent performances, most of them employ shallow model to deal with the complex relation within multi-view graph, which may seriously restrict the capacity for modeling multi-view graph information. In this paper, we make the first attempt to employ deep learning technique for attributed multi-view graph clustering, and propose a novel task-guided One2Multi graph autoencoder clustering framework. The One2Multi graph autoencoder is able to learn node embeddings by employing one informative graph view and content data to reconstruct multiple graph views. Hence, the shared feature representation of multiple graphs can be well captured. Furthermore, a self-training clustering objective is proposed to iteratively improve the clustering results. By integrating the self-training and autoencoder's reconstruction into a unified framework, our model can jointly optimize the cluster label assignments and embeddings suitable for graph clustering. Experiments on real-world attributed multi-view graph datasets well validate the effectiveness of our model. |
User Modeling-A (4)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Next Point-of-Interest Recommendation on Resource-Constrained Mobile Devices Qinyong Wang (The University of Queensland), Hongzhi Yin (The University of Queensland), Tong Chen (The University of Queensland), Zi Huang (The University of Queensland), Hao Wang (Alibaba AI Labs), Yanchang Zhao (CSIRO) and Quoc Viet Hung Nguyen (Griffith University).
AbstractIn the modern tourism industry, next point-of-interest (POI) recommendation is one of the most important mobile services as it effectively aids hesitating travelers to decide the next POI to visit. Currently, most next POI recommender systems are built upon a cloud-based paradigm, where the recommendation models (usually deep learning-based) are trained and deployed on the powerful cloud servers. When a recommendation request is made by a user via mobile devices, the current contextual information (e.g., location and time) will be uploaded to the cloud servers to help the well-trained models generate personalized recommendation results. However, in reality, this paradigm heavily relies on high-quality network connectivity, and is subject to high energy footprint in the operation and increasing privacy concerns among the public.To bypass the defects of cloud-based recommendation paradigm, in this paper, we propose a novel Light Location Recommender System (LLRec) to perform next POI recommendation locally on resource-constrained mobile devices. To make LLRec fully compatible with the limited computing resources and memory space, we leverage FastGRNN, a lightweight but effective gated Recurrent Neural Network (RNN) as its main building block, and significantly compress the model size by adopting the tensor-train composition in the embedding layer. As a compact model, LLRec maintains its robustness via an innovative teacher-student training framework, where a powerful teacher model is trained on the cloud to learn essential knowledge from available contextual data, and the simplified student model LLRec is trained under the guidance of the teacher model. The final LLRec is downloaded and deployed on users' mobile devices to generate accurate recommendations solely utilizing users' local data. As a result, LLRec significantly reduces the dependency on cloud servers, thus allowing for next POI recommendation in a stable, cost-effective and secure way. Extensive experiments on two large-scale recommendation datasets further demonstrate the superiority of our proposed solution. |
11:00-11:30 |
Reinforced Negative Sampling over Knowledge Graph for Recommendation Xiang Wang (National University of Singapore), Yaokun Xu (Southeast University), Xiangnan He (University of Science and Technology of China), Yixin Cao (National University of Singapore), Meng Wang (HeFei University of Technology) and Tat-Seng Chua (National University of Singapore).
AbstractProperly handling missing data is a fundamental challenge in recommendation. Most present work performs negative sampling from missing data to supply the training of recommender models with negative signals. Nevertheless, existing negative sampling strategies, either static or dynamic ones, are insufficient to yield high-quality negative samples — both informative to model training and reflective of user real tastes.In this work, we hypothesize that item knowledge graph (KG), which provides rich and unbiased relations among users, items, and KG entities, could be useful to infer informative and factual negative samples. We develop a new negative sampling model, Knowledge Graph Policy Network (KGPolicy), which works as a reinforcement learning agent to explore high-quality negatives. Specifically, by conducting our designed exploring operations, it navigates from the target positive interaction, adaptively receives attribute-based negative signals, and ultimately yields a potential negative item to train the recommender. Empirically, matrix factorization (MF) equipped with KGPolicy achieves significant improvements over both state-of-the-art sampling methods like DNS and IRGAN, and KG-enhanced recommender models like RippleNet and KGAT. Further analysis on how knowledge graph facilitates the recommender learning provides insights of knowledge-aware negative sampling. Code and parameter settings will be released upon acceptance. |
11:30-12:00 |
Future Data Helps Training: Modeling Future Contexts for Session-based Recommendation Fajie Yuan (Tencent), Xiangnan He (University of Science and Technology of China), Haochuan Jiang (Tencent), Guibing Guo (Northeastern University), Jian Xiong (tencent), Zhezhao Xu (tencent) and Yilin Xiong (Tencent).
AbstractSession-based recommender systems have attracted much attention recently. To capture the sequential dependencies, existing methods resort either to data augmentation techniques or left-to-right style autoregressive training. Since these methods are aimed to model the sequential nature of user behaviors, they ignore the future data of a target interaction when constructing the prediction model for it. However, we argue that the future interactions after a target interaction, which are also available during training, provide valuable signal on user preference and can be used to enhance the recommendation quality.Properly integrating future data into model training, however, is non-trivial to achieve, since it disobeys machine learning principles and can easily cause data leakage. To this end, we propose a new encoder-decoder framework named Gap-filling based Recommender (GRec), which trains the encoder and decoder by a gap-filling mechanism. Specifically, the encoder takes a partially-complete session sequence (where some items are masked by purpose) as input, and the decoder predicts these masked items conditioned on the encoded representation. We instantiate the general GRec framework using convolutional neural network with sparse kernels, giving consideration to both accuracy and efficiency. We conduct experiments on two real-world datasets covering short-, medium-, and long-range user sessions, showing that GRec significantly outperforms the state-of-the-art sequential recommendation methods. More empirical studies verify the high utility of modeling future contexts under our GRec framework. |
12:00-12:15 |
Recommending Themes for Ad Creative Design via Visual-Linguistic Representations Yichao Zhou (University of California, Los Angeles), Shaunak Mishra (Yahoo Research), Manisha Verma (Yahoo Research), Narayan Bhamidipati (Yahoo Research) and Wei Wang (University of California, Los Angeles).
AbstractThere is a perennial need in the online advertising industry to refresh ad creatives, i.e., images and text used for enticing online users towards a brand. Such refreshes are required to reduce the likelihood of ad fatigue among online users, and to incorporate insights from other successful campaigns in related product categories. However, given a brand, to come up with themes for a new ad is a painstaking and time consuming process for creative strategists. Among other things, strategists typically draw inspiration from the images and text used for past ad campaigns, as well as world knowledge on the brands. To automatically infer ad themes via such multimodal sources of information in past ad campaigns, we propose a theme (keyphrase) recommender system for ad creative strategists. In particular, the theme recommender is based on aggregating results from a visual question answering (VQA) task, which ingests the following: (i) ad images, (ii) text associated with the ads as well as Wikipedia pages on the brands in the ads, and (iii) questions around the ad. To harness the multimodal nature of the above inputs, we leverage transformer based cross-modality encoders to train visual-linguistic representations for our VQA task. We study two formulations for the VQA task along the lines of classification and ranking; via experiments on a public dataset, we show that cross-modal representations lead to significantly better classification accuracy and ranking precision-recall metrics. Specifically, cross-modal representations show better performance compared to separate image and text representations. In addition, the use of multimodal information shows a significant lift over using only textual or visual information. Finally, we share creative strategy insights on selected product categories in the public dataset using our approach. |
12:15-12:30 |
A Multimodal Variational Encoder-Decoder Framework for Micro-video Popularity Prediction Jiayi Xie (Wuhan University), Yaochen Zhu (Wuhan University), Zhibin Zhang (Wuhan University), Jian Peng (Wuhan University), Jing Yi (Wuhan University), Yaosi Hu (Wuhan University), Hongyi Liu (Wuhan University) and Zhenzhong Chen (Wuhan University).
AbstractRecently, popularity prediction for user generated contents (UCGs) has received substantial attention among researchers. As a particular form of UCGs, micro-videos in real-world applications are usually accompanied with several contents, such as title, tags, background music. Unlike movies which is published officially, micro-videos made and uploaded arbitrarily by online users are personalized, and thus their quality cannot be guaranteed. For example, textual modality can be irrelevant to the visual modality for the purpose of eye-catching, or even missing. Besides, whether certain video comes into fashion after its release is also affected by lots of external uncertainties. Thus, the mapping from feature space to popularity space is essentially non-deterministic, and such randomness poses a great challenge for the popularity prediction of micro-videos. In light of this, we propose a multimodal variational encoder-decoder framework that can explicitly capture the randomness. Specifically, features of different modalities are stochastically embedded into hidden representations, which is then fused together by Bayesian reasoning such that information from all modalities is well utilized. Then, the learned hidden representation is fed into a recurrent neural network as a warm start to predict the popularity sequence of certain micro-video. Experiments conducted in the real-world dataset we collected demonstrate the effectiveness of our proposed model in the micro-video popularity prediction task. |
Society (3)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-10:45 |
The Structure of Social Influence in Recommender Networks Pantelis Pipergias Analytis (University of Southern Denmark), Daniel Barkoczi (University of Southern Denmark), Philipp Lorenz-Spreen (Max Planck Institute for Human Development) and Stefan Herzog (Max Planck Institute for Human Development).
AbstractThe ability of people to influence the opinion of others on matters of taste varies greatly—both in the offline world and in recommender systems. What are the mechanisms underlying this striking inequality? We use the weighted k-nearest-neighbor algorithm to represent an array of social learning strategies and show—using network theory—how this gives rise to networks of social influence in six real-world domains of taste. By doing so, we show three novel results that apply both to offline advice taking and online recommender settings. First, influential individuals have mainstream tastes and high dispersion in their taste similarity with others. Second, the fewer people an individual or algorithm consults (i.e., the lower k) and the more sensitive an individual or algorithm is to how similar other people are, the smaller the group of people with substantial influence. Third, the influence networks that emerge are hierarchically organized. Our results shed new light on classic empirical findings in communication and network science and can help improve our understanding of social influence in the offline and online world. |
10:45-11:15 |
FairRec: Two-Sided Fairness for Personalized Recommendations in Two-Sided Platforms Gourab K Patro (Indian Institute of Technology Kharagpur), Arpita Biswas (Indian Institute of Science Bangalore), Niloy Ganguly (Indian Institute of Technology Kharagpur), Krishna P. Gummadi (MPI-SWS) and Abhijnan Chakraborty (Max Planck Institute for Software Systems).
AbstractMajor online platforms today (such as Amazon, Netflix, Spotify, LinkedIn, AirBnB) can be thought of as two-sided markets with producers and customers of goods and services. Traditionally, search and recommendation services in these platforms have focused on maximizing customer satisfaction by tailoring the results according to the personalized preferences of individual customers. However, our investigation reveals that such customer-centric design of these services may lead to unfair distribution of exposure to the producers and adversely impact their well-being. As more and more people are depending on such platforms to earn a living, it is important to ensure fairness to both producers and customers. In this work, by mapping the problem of personalized recommendation to the problem of fair allocation of indivisible goods, we propose to provide fairness guarantees for both sides. More formally, our proposed FairRec algorithm guarantees at least Maxi-Min Share (MMS) exposure for majority of the producers, and Envy-Free upto One Good (EF1) fairness for all the customers. Extensive evaluations over multiple real-world datasets show the effectiveness of FairRec in ensuring two-sided fairness while incurring little loss in overall recommendation quality. |
11:15-11:45 |
Ten Social Dimensions of Conversations and Relationships Minje Choi (University of Michigan), Luca Maria Aiello (Nokia Bell Labs), Varga Krisztian (Nokia Bell Labs) and Daniele Quercia (Nokia Bell Labs).
AbstractDecades of social science research identified ten fundamental dimensions that provide the conceptual building blocks to describe the nature of human relationships. Yet, it is not clear to what extent these concepts are expressed in everyday language and what role they have in shaping observable dynamics of social interactions. After annotating conversational text through crowdsourcing, we train NLP tools to detect the presence of these types of interaction from conversations, and apply them to 160M messages written by geo-referenced Reddit users, 290k emails from the Enron corpus and 300k lines of dialogue from movie scripts. We show that social dimensions can be predicted purely from conversations with an AUC up to 0.98, and the combination of the predicted dimensions suggests both the types of relationships people entertain (conflict vs. support) and the types of real-world communities (wealthy vs. deprived) they shape. |
11:45-12:15 |
Quantifying Engagement with Citations on Wikipedia Tiziano Piccardi (Ecole Polytechnique Fédérale de Lausanne), Miriam Redi (Wikimedia Foundation), Giovanni Colavizza (University of Amsterdam) and Robert West (Ecole Polytechnique Fédérale de Lausanne).
AbstractWikipedia, the free online encyclopedia that anyone can edit, is one of the most visited sites on the Web and a common source of information for many users. As an encyclopedia, Wikipedia is not a source of original information, but was conceived as a gateway summary of secondary sources: according to Wikipedia's guidelines, most facts must be backed up by reliable sources that reflect the full spectrum of views on the topic. Although citations lie at the very heart of Wikipedia, little is known about how users interact with them. To close this gap, we built client-side instrumentation for logging all clicks on links leading from English Wikipedia articles to cited references during one month, and conducted the first ever analysis of readers interaction with citations on Wikipedia. We find that overall engagement with citations is low: about one in 300 page views results in a reference click (0.3% overall; 0.6% on desktop; 0.1% on mobile). A causal analysis of the factors associated with reference clicking reveals that clicks occur more frequently on shorter pages and on pages of lower quality, suggesting that references are consulted more commonly when Wikipedia itself does not contain the information sought by the user. Moreover, we observe that references about life events (births, deaths, marriages, etc.) are particularly popular. Taken together, our findings open the door to a deeper understanding of Wikipedia's role in a global information economy where reliability is ever less certain, and source attribution ever more vital. |
12:15-12:30 |
Learning Model-Agnostic Counterfactual Explanations for Tabular Data Martin Pawelczyk (University of Tuebingen), Klaus Broelemann (Schufa AG) and Gjergji Kasneci (University of Tuebingen).
AbstractCounterfactual explanations can be obtained by identifying the smallest change made to a feature vector to qualitatively influence a prediction in a positive way from a user’s viewpoint; for example, from ’loan rejected’ to ’awarded’ or from ’high risk of cardiovascular disease’ to ’low risk’. Previous approaches would not ensure that the produced counterfactuals be proximate (i.e., not local outliers) and connected to regions with substantial data density (i.e., close to correctly classified observations), two requirements known as counterfactual faithfulness. These requirements are fundamental when making suggestions to individuals that are indeed attainable. Our contribution is twofold. First, drawing ideas from the manifold learning literature, we develop a framework, called C-CHVAE, that generates faithful counterfactuals. Second, we suggest to complement the catalog of counterfactual quality measures [13] using a criterion to quantify the degree of difficulty for a certain counterfactual suggestion. Our real world experiments suggest that faithful counterfactuals come at the cost of higher degrees of difficulty. |
Security (3)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Dirty Clicks: A Study of the Usability and Security Implications of Click-related Behaviors on the Web Iskander Sanchez-Rola (University of Deusto, NortonLifeLock Research Group), Davide Balzarotti (EURECOM), Christopher Kruegel (UC Santa Barbara), Giovanni Vigna (UC Santa Barbara) and Igor Santos (University of Deusto).
AbstractWeb pages have evolved into very complex dynamic applications, which are often very opaque and difficult for non-experts to understand. At the same time, security researchers push for more transparent web applications, which can help users in taking important security-related decisions about which information to disclose, which link to visit, and which online service to trust.In this paper, we look at one of the most simple but also most representative aspects that captures the struggle between these opposite demands: a mouse click. In particular, we present the first comprehensive study of the possible security and privacy implications that clicks can have from a user perspective, analyzing the disconnect that exists between what is shown to users and what actually happens after. We started by identifying and classifying possible problems. We then implemented a crawler that performed nearly 2.5M clicks looking for signs of misbehavior. We analyzed all the interactions created as a result of those clicks, and discovered that the vast majority of domains are putting users at risk by either obscuring the real target of links or by not providing sufficient information for users to make an informed decision. We conclude the paper by proposing a set of countermeasures. |
11:00-11:30 |
Beyond the Front Page: Measuring Third Party Dynamics in the Field Tobias Urban (Institute for Internet Security), Martin Degeling (Ruhr-University Bochum), Thorsten Holz (Ruhr-Universität Bochum) and Norbert Pohlmann (Institute for Internet Security).
AbstractIn the modern Web, service providers often heavily rely on third parties to run their services. For example, they use ad networks to finance their services, externally hosted libraries to quickly develop them, and analytical services to gain insights into users' behavior. This can lead to a situation where service providers do not know which third parties will be embedded, for example when these third parties request additional content as it is common in real-time ad auctions. In this paper, we present a large-scale measurement study to analyzes the magnitude of these new challenges. To better reflect the connectedness of third parties, we measured their relations in a model we call third party trees, which reflects the loading dependencies of all third parties embedded into a given website. Using this notion, we show that including a single third party can lead to the subsequent loading of several further parties. Our data shows that embedding a third party might lead to branches of depth of up to eight. Furthermore, our findings indicate that the services that are embedded on a page load are not always deterministic and 93% of the analyzed websites embedded third parties that are located in regions that might not be in line with the current legal framework. An important finding of our study is that previous work that mostly focused on landing pages of websites only measured a lower bound as subsites show a significantly increase of privacy invasive techniques. For example, our results show an increase of used cookies by 36%. |
11:30-12:00 |
AutoNav: Evaluation and Automatization of Web Navigation Policies Benjamin Eriksson (Chalmers University of Technology) and Andrei Sabelfeld (Chalmers University of Technology).
AbstractUndesired navigation in browsers powers a significant class of attacks on web applications. In a move to mitigate risks associated with undesired navigation, the security community has proposed a standard that gives control to web pages to restrict navigation. The standard draft introduces a new navigate-to directive of the Content Security Policy (CSP). The directive is currently being implemented by mainstream browsers. This paper is a first evaluation of navigate-to, focusing on security, performance, and automatization of navigation policies. We present new vulnerabilities introduced by the directive into the web ecosystem, opening up for at- tacks such as probing to detect if users are logged in to other websites or have active shopping carts, bypassing third- party cookie blocking, exfiltrating secrets, as well as leaking browsing history. Unfortunately, the directive triggers vulnerabilities even in websites that do not use the directive in their policies. We identify both specification- and implementation- level vulnerabilities and propose countermeasures to mitigate both. To aid developers in configuring navigation policies, we develop and implement AutoNav, an automated black-box mechanism to infer navigation policies. AutoNav leverages the benefits of origin-wide policies in order to improve security without degrading performance. We evaluate the viability of navigate-to and AutoNav by an empirical study on Alexa’s top 10,000 websites. |
12:00-12:15 |
An Empirical Study of Android Security Bulletins in Different Vendors Sadegh Farhang (The Pennsylvania State University), Mehmet Bahadir Kirdan (Technical University of Munich), Aron Laszka (University of Houston) and Jens Grossklags (Technical University of Munich).
AbstractMobile devices encroach on almost every activity of our lives including work and leisure, and contain a wealth of personal and sensitive information. It is, therefore, imperative that these devices uphold high security standards. A key aspect is the security of the underlying operating system platform. In particular, Android, the most dominant platform in this ecosystem with more than one billion active devices and its openness, which allows different vendors to adopt it, plays a critical role. Like other platforms, Android maintains security via monthly security patches and announces them via the Android security bulletin. To absorb this information successfully across the Android ecosystem, impeccable coordination by many different vendors is required.In this paper, we perform a comprehensive study of 3,174 Android related vulnerabilities and study to which degree they are reflected in the Android security bulletin, as well as in the security bulletins of leading vendors: Samsung, LG, and Huawei. In our analysis, we focus on the metadata of these security bulletins (e.g., timing, affected layers, severity, and CWE data) to better understand commonalities and differences among vendors. Some of our findings are: (i) the studied vendors in the Android ecosystem have adopted different structures for vulnerability reporting, (ii) vendors are less likely to react with delay for CVEs with Android Git repository references, (iii) vendors handle Qualcomm-related CVEs different from the rest of external layer CVEs. |
12:15-12:30 |
MineThrottle: Defending against Wasm In-Browser Cryptojacking Weikang Bian (The Chinese University of Hong Kong), Wei Meng (The Chinese University of Hong Kong) and Mingxue Zhang (The Chinese University of Hong Kong).
AbstractIn-browser cryptojacking is an urgent threat to web users, where the attackers abuse the users' local computing resources without obtaining the users' consent. Many in-browser mining programs are developed in WebAssembly (Wasm) for its great performance. Several prior works have measured cryptojacking in the wild and proposed detection methods using static features and dynamic features. However, there exists no good defense mechanism within the user's browser to stop the malicious drive-by mining behavior. The users still primarily depend on ad blocking software that relies on community-maintained blacklists which can be easily bypassed.In this work, we propose MineThrottle, a browser based defense mechanism to defend against Wasm cryptojacking by leveraging block-level semantic information of a program. We show that the cryptocurrency mining Wasm programs exhibit very different block-level semantic information from other Wasm programs (e.g., games). In particular, the majority of computation workload are spent in a small number of basic blocks, and the instructions used are significantly different from the other basic blocks. MineThrottle instruments Wasm code on the fly to label mining related code blocks and detect mining behavior using block-level program profiling. It then throttles drive-by mining behavior based on a user-configurable policy. Our evaluation of MineThrottle with the Alexa top 1M websites demonstrates that it can accurately detect and mitigate in-browser cryptojacking with both a low false positive rate and a low false negative rate. |
Search (2)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Selective Weak Supervision for Neural Information Retrieval Kaitao Zhang (Tsinghua University), Chenyan Xiong (Carnegie Mellon University; Microsoft), Zhenghao Liu (Tsinghua University) and Zhiyuan Liu (Tsinghua University).
AbstractThis paper democratizes neural information retrieval to scenarios where large scale relevance training signals are not available. We revisit the classic IR intuition that anchor-document relation approximates query-document relevance and propose a reinforcement weak supervision selection method, ReInfoSelect, which learns to select anchor-document pairs that best train neural ranking models, guided by only a handful of human relevance labels. ReInfoSelect uses the NDCG on the target relevance benchmark as the reward and learns to classify whether each anchor-document pair should be used as a training signal (action). It iterates through anchor-document pairs and converges when the neural ranker's performance peaks on target relevance benchmarks. Our experiments on ClueWeb09-B and Robust04 demonstrate the necessity and effectiveness of ReInfoSelect in leveraging anchor data as weak supervision. On these TREC benchmarks, the neural rankers trained with our ReInfoSelect significantly outperform feature-based learning to rank and match the training effectiveness of Bing User Clicks, while ReInfoSelect only uses publicly available anchor data. Our human evaluation confirms that ReInfoSelect effectively leverages the reward from neural rankers to select anchors that are more similar to search queries and linked documents that are more relevant to the anchor. |
11:00-11:30 |
Leveraging Passage-level Cumulative Gain for Document Ranking Zhijing Wu (Tsinghua University), Jiaxin Mao (Tsinghua University), Yiqun Liu (Tsinghua University), Jingtao Zhan (Tsinghua University), Yukun Zheng (Tsinghua University), Min Zhang (Tsinghua University) and Shaoping Ma (Tsinghua University).
AbstractDocument ranking is one of the most studied but challenging problems in information retrieval (IR) researches. A number of existing document ranking models capture relevance signals at the whole document level. Recently, more and more researches have begun to address this problem from fine-grained document modeling. Several works leveraged fine-grained passage-level relevance signals in ranking models. However, most of these works focus on context-independent passage-level relevance signals and ignore the context information, which may lead to inaccurate estimation of passage-level relevance. In this paper, we investigate how information gain accumulates with passages when users sequentially read a document. We propose the context-aware Passage-level Cumulative Gain (PCG), which aggregates relevance scores of passages and avoids the need to formally split a document into independent passages. Next, we incorporate the patterns of PCG into a BERT-based sequential model called Passage-level Cumulative Gain Model (PCGM) to predict the PCG sequence. Finally, we apply PCGM to the document ranking task. Experimental results on two public ad hoc retrieval benchmark datasets show that PCGM outperforms most existing ranking models and also indicates the effectiveness of PCG signals. We believe that this work contributes to improving ranking performance and providing more explainability for document ranking. |
11:30-11:45 |
RLIRank: Learning to Rank with Reinforcement Learning for Dynamic Search Jianghong Zhou (Emory University) and Eugene Agichtein (Emory University).
AbstractTo support complex search tasks, where the initial information requirements are complex or may change during the search, a search engine must adapt the information delivery as the user’s information requirements evolve. To support this dynamic ranking paradigm effectively, search result ranking must incorporate both the user feedback received, and the information displayed so far. To address this problem, we introduce a novel reinforcement learning-based approach, RLIRank. We first build an adapted reinforcement learning framework to integrate the key components of the dynamic search. Then, we implement a new Learning to Rank (LTR) model for each iteration of the dynamic search, using a recurrent LongShort Term Memory neural network (LSTM), which estimates the gain for each next result, learning from each previously ranked document. To incorporate the user’s feedback, we develop a word-embedding variation of the classic Rocchio Algorithm, to help guide the ranking towards the high-value documents. Those innovationsenableRLIRankto outperform the previously reported methods from the TREC Dynamic Domain Tracks 2017 and exceed all the methods in the 2016 TREC Dynamic Domain after multiple search iterations, advancing the state of the art for dynamic search. |
11:45-12:00 |
Stabilizing Neural Search Ranking Models Ruilin Li (Georgia Institute of Technology), Zhen Qin (Google), Xuanhui Wang (Google), Suming J. Chen (Google) and Donald Metzler (Google).
AbstractNeural search ranking models, which have been actively studied in the information retrieval community, have also been widely adopted in real-world industrial applications. However, due to the high non-convexity and stochastic nature of neural model formulations, the obtained models are unstable in the sense that model predictions can significantly vary across two models trained with the same configuration. In practice, new features are continuously introduced and new model architectures are explored to improve model effectiveness. In these cases, the instability of neural models leads to unnecessary document ranking changes for a large fraction of queries. Such changes lead to an inconsistent user experience and also adds noise to online experiment results, thus slowing down the model development life-cycle. How to stabilize neural search ranking models during model update is an important but largely unexplored problem. Motivated by trigger analysis, we suggest balancing the trade-off between performance improvements and the number of affected queries. We formulate this as an optimization problem where the objective is to maximize the average effect over the affected queries. We propose two heuristics and one theory-guided method to solve the optimization problem. Our proposed methods are evaluated on two of the world's largest personal search services: Gmail search and Google Drive search. Empirical results show that our proposed methods are highly effective in optimizing the proposed objective and are applicable to different model update scenarios. |
Mobile (3)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
A First Look at Commercial 5G Performance on Smartphones Arvind Narayanan (University of Minnesota), Eman Ramadan (University of Minnesota), Jason Carpenter (University of Minnesota), Qingxu Liu (University of Minnesota), Yu Liu (University of Minnesota), Feng Qian (University of Minnesota) and Zhi-Li Zhang (University of Minnesota).
AbstractWe conduct to our knowledge a first measurement study of commercial mmWave 5G performance on smartphones by closely examining 5G networks of three carriers (two mmWave carriers, one mid-band 5G carrier) in three U.S. cities. We conduct extensive field tests on 5G performance in diverse urban environments. We systematically analyze the handoff mechanisms in 5G and their impact on network performance, and explore the feasibility of using location and possibly other environmental information to predict the network performance. We also study the app performance (web browsing, HTTP download, and volumetric video streaming) over 5G. Our study consumes more than 15 TB of data. Conducted when 5G just made its debut, it provides a "baseline" for studying how 5G performance evolves, and identifies key research directions on improving 5G users' experience in a cross-layer manner. |
11:00-11:30 |
MadDroid: Characterising and Detecting Devious Ad Content for Android Apps Tianming Liu (Beijing University of Posts and Telecommunications), Haoyu Wang (Beijing University of Posts and Telecommunications), Li Li (Monash University), Xiapu Luo (The Hong Kong Polytechnic University), Feng Dong (Beijing University of Posts and Telecommunications), Yao Guo (Peking University), Liu Wang (Beijing University of Posts and Telecommunications), Tegawendé F. Bissyandé (SnT, University of Luxembourg) and Jacques Klein (University of Luxembourg).
AbstractAdvertisement drives the economy of the mobile app ecosystem. As a key component in the mobile ad business model, mobile ad content has been overlooked by the research community, which poses a number of threats, e.g., propagating malware and undesirable contents. To understand the practice of these devious ad behaviors, we perform a large-scale study on the app contents harvested through automated app testing. In this work, we first provide a comprehensive categorization of devious ad contents, including five kinds of behaviors belonging to two categories: ad loading content and ad clicking content. Then, we propose MadDroid, a framework for automated detection of devious ad contents. MadDroid leverages an automated app testing framework with a sophisticated ad view exploration strategy for effectively collecting ad-related network traffic and subsequently extracting ad contents. We then integrate dedicated approaches into the framework to identify devious ad contents. We have applied MadDroid to 40,000 Android apps and found that roughly 6\% of apps deliver devious ad contents, e.g., distributing malicious apps that cannot be downloaded via traditional app markets. Experiment results indicate that devious ad contents are prevalent, suggesting that our community should invest more effort into the detection and mitigation of devious ads towards building a trustworthy mobile advertising ecosystem. |
11:30-12:00 |
Mobile App Squatting Yangyu Hu (BUPT), Haoyu Wang (Beijing University of Posts and Telecommunications), Ren He (Beijing University of Posts and Telecommunications), Li Li (Monash University), Gareth Tyson (Queen Mary University of London), Ignacio Castro (Queen Mary University of London), Yao Guo (Peking University), Lei Wu (Zhejiang University) and Guoai Xu (Beijing University of Posts and Telecommunications).
AbstractDomain squatting, the adversarial tactic where attackers register domain names that mimic popular ones, has been observed for decades. However, there has been growing anecdotal evidence that this style of attack has spread to other domains. In this paper, we explore the presence of squatting attacks in the mobile app ecosystem. In ``App Squatting'', attackers release apps with identifiers (e.g., app, package or developer name) that are confusingly similar to those of popular apps or well-known Internet brands. This paper presents the first in-depth measurement study of app squatting to show its prevalence and implications. We first identify 11 common deformation approaches of app squatters and propose \textit {``AppCrazy''}, a tool for automatically generating variations of app identifiers. We have applied AppCrazy to the top-500 most popular apps in Google Play, generating 224,322 deformation keywords which we then use to test for app squatters on popular markets. Through this, we confirm the scale of the problem, identifying 10,553 squatting apps (an average of over 20 squatting apps for each legitimate one). Our investigation reveals that more than 51\% of the squatting apps are malicious, with some being extremely popular (up to 10 million downloads). Meanwhile, we also find that app markets have not been successful in identifying and eliminating squatting apps. Our findings demonstrate the urgency to identify and prevent app squatting abuses. To this end, we have publicly released all the identified squatting apps, as well as our tool AppCrazy. |
Web Mining-B (3)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Keywords Generation Improves E-Commerce Session-based Recommendation Yuanxing Liu (Harbin Institute of Technology), Zhaochun Ren (Shandong University), Wei-Nan Zhang (Harbin Institute of Technology), Wanxiang Che (Harbin Institute of Technology), Ting Liu (Harbin Institute of Technology) and Dawei Yin (JD.com).
AbstractBy exploring fine-grained user behaviors, session-based recommendation predicts a user's next action from short-term behavior sessions. Most of previous work learns about a user's implicit behavior by merely taking the last click action as the supervision signal. However, in e-commerce scenarios, large-scale products with elusive click behaviors make such task challenging because of the low inclusiveness problem, i.e., many relevant products that satisfy the user's shopping intention are neglected by recommenders. Since similar products with different IDs may share the same intention, we argue that the textual information (e.g., keywords of product titles) from sessions can be used as additional supervision signals to tackle above problem through learning more shared intention within similar products. Therefore, to improve the performance of e-commerce session-based recommendation, we explicitly infer the user's intention by generating keywords entirely from the click sequence in the current session. In this paper, we propose the e-commerce session-based recommendation model with keywords generation (abbreviated as ESRM-KG) to integrate keywords generation into e-commerce session-based recommendation. Specifically, the ESRM-KG model firstly encodes an input action sequence into a high dimensional representation; then it presents a bi-linear decoding scheme to predict the next action in the current session; synchronously, the ESRM-KG model addresses incepts the high dimensional representation of its encoder to generate explainable keywords for the whole session. We carried out extensive experiments in the context of click prediction on a large-scale real-world e-commerce dataset. Our experimental results show that the ESRM-KG model outperforms state-of-the-art baselines with the help of keywords generation. % We also show the effectiveness of the generated keywords with a case study and error analysis. We also discuss how keywords generation helps the e-commerce session-based recommendation with case studies and error analysis. |
11:00-11:30 |
LightRec: a Memory and Search-Efficient Recommender System Defu Lian (University of Science and Technology of China), Haoyu Wang (University at Buffalo), Zheng Liu (MSRA), Jianxun Lian (MSRA), Enhong Chen (University of Science and Technology of China) and Xing Xie (MSRA).
AbstractDeep recommender system has achieved remarkable improvements in recent years. Despite its superior ranking precision, the running efficiency and memory consumption turn out to be severe bottlenecks in reality. To overcome both limitations, we propose LightRec, a lightweight recommender system which enjoys fast online inference and economic memory consumption. The backbone of LightRec is a total of $B$ codebooks, each of which is composed of $W$ latent vectors, known as codewords. On top of such a structure, LightRec will have an item represented as additive composition of $B$ codewords, which are optimally selected from each of the codebooks. To effectively learn the codebooks from data, we devise an end-to-end learning workflow, where challenges on the inherent differentiability and diversity are conquered by the proposed techniques. In addition, to further improve the representation quality, several distillation strategies are employed, which better preserves user-item relevance scores and relative ranking orders. LightRec is extensively evaluated with four real-world datasets, which gives rise to two empirical findings: 1) compared with those the state-of-the-art lightweight baselines, LightRec achieves over 11\% relative improvements in terms of recall performance; 2) compared to conventional recommendation algorithms, LightRec merely incurs negligible accuracy degradation while leads to more than 27x speedup in top-k recommendation. |
11:30-12:00 |
A Generalized and Fast-converging Non-negative Latent Factor Model for Predicting User Preferences in Recommender Systems Ye Yuan (Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China), Xin Luo (Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China), Mingsheng Shang (Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China) and Di Wu (Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China).
AbstractRecommender systems (RSs) commonly describe its user-item preferences with a high-dimensional and sparse (HiDS) matrix filled with non-negative data. A non-negative latent factor (NLF) model relying on a single latent factor-dependent, non-negative and multiplicative update (SLF-NMU) algorithm is frequently adopted to process such an HiDS matrix. However, an NLF model mostly adopts Euclidean distance for its objective function, which is naturally a special case of α-β-divergence. Moreover, it frequently suffers slow convergence. For addressing these issues, this study proposes a generalized and fast-converging non-negative latent factor (GFNLF) model. Its main idea is two-fold: a) adopting α-β-divergence for its objective function, thereby enhancing its representation ability for HiDS data; b) deducing its momentum-incorporated non-negative multiplicative update (MNMU) algorithm, thereby achieving its fast convergence. Empirical studies on two HiDS matrices emerging from real RSs demonstrate that with carefully-tuned hyperparameters, a GFNLF model outperforms state-of-the-art models in both computational efficiency and prediction accuracy for missing data of an HiDS matrix. |
12:00-12:30 |
Efficient Non-Sampling Factorization Machines for Optimal Context-Aware Recommendation Chong Chen (Tsinghua University), Min Zhang (Tsinghua University), Weizhi Ma (Tsinghua University), Yiqun Liu (Tsinghua University) and Shaoping Ma (Tsinghua University).
AbstractTo provide more accurate recommendation, it is important to go beyond modeling user-item interactions and take context information into account. Factorization Machines (FM) with negative sampling is a popular solution for context-aware recommendation. However, it can be insufficient as sampling is not robust and usually leads to non-optimal performance in practical. While several recent efforts have enhanced FM with deep learning architectures for modelling high-order feature interactions, they either focus on rating prediction task only, or typically adopt the negative sampling strategy for optimizing the ranking performance. Due to the dramatic fluctuation of sampling, it is reasonable to argue that these sampling-based FM methods are still suboptimal for ranking tasks.In this paper, we propose to learn FM without sampling for ranking tasks, which is particularly intended for context-aware recommendation. Despite soundness, such a non-sampling strategy poses strong efficiency challenge in learning the model. To address this, we design a new ideal framework named Efficient Non-Sampling Factorization Machines (ENSFM). ENSFM not only seamlessly connects the relationship between FM and Matrix Factorization (MF), but also resolves the challenging efficiency issue via novel designs of memorization strategies. Through extensive experiments on three real-world public datasets, we show that 1) the proposed ENSFM consistently and significantly outperforms the state-of-the-art methods on context-aware Top-K recommendation, and 2) ENSFM achieves significant advantages in training efficiency, which makes it more applicable to real-world large-scale systems. Moreover, the empirical results indicate that a proper learning method is even more important than advanced neural network structures for Top-K recommendation task. |
Semantics (2)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Open Knowledge Enrichment for Long-tail Entities Ermei Cao (Nanjing University), Difeng Wang (Nanjing University), Jiacheng Huang (Nanjing University) and Wei Hu (Nanjing University).
AbstractKnowledge bases (KBs) have gradually become a valuable asset for many AI applications. While many current KBs are quite large, they are widely acknowledged as incomplete, especially lacking facts of long-tail entities, e.g., less famous persons. Existing approaches enrich KBs mainly on completing missing links or filling missing values. However, they only tackle a part of the enrichment problem and lack specific considerations about long-tail entities. In this paper, we propose a full-fledged approach to knowledge enrichment, which predicts missing properties and infers true facts of long-tail entities from the open Web. Prior knowledge from popular entities is leveraged to improve every enrichment step. Our experiments on the synthetic and real-world datasets and comparison with related work demonstrate the feasibility and superiority of the proposed approach. |
11:00-11:30 |
Correcting Knowledge Base Assertions Jiaoyan Chen (University of Oxford), Xi Chen (Jarvis Lab Tencent), Ian Horrocks (University of Oxford), Ernesto Jimenez-Ruiz (City, University of London; University of Oslo) and Erik B. Myklebust (Norwegian Institute for Water Research; University of Oslo).
AbstractThe usefulness and usability of knowledge bases (KBs) is often limited by quality issues. One common issue is the presence of erroneous assertions, often caused by lexical or semantic confusion. We study the problem of correcting such assertions, and present a general correction framework which combines lexical matching, semantic embedding, soft constraint mining and semantic consistency checking. The framework is evaluated using DBpedia and an enterprise medical KB. |
11:30-12:00 |
LOVBench: Ontology Ranking Benchmark Niklas Kolbe (University of Luxembourg), Pierre-Yves Vandenbussche (Elsevier), Sylvain Kubler (Université de Lorraine) and Yves Le Traon (University of Luxembourg).
AbstractOntology search and ranking are key building blocks to establish and reuse shared conceptualisations of domain knowledge on the Web. However, the effectiveness of proposed ontology ranking models is difficult to compare since these are often evaluated on diverse datasets that are limited by their static nature and scale. In this paper, we first introduce the LOVBench dataset as a benchmark for ontology term ranking. With inferred relevance judgments for more than 7000 queries, LOVBench is large enough to perform a comparison study using learning to rank (LTR) with complex ontology ranking models. Instead of relying on relevance judgments from a few experts, we consider implicit feedback from many actual users collected from the Linked Open Vocabularies (LOV) platform. Our approach further enables continuous updates of the benchmark, capturing the evolution of ontologies' relevance in an ever-changing data community. Second, we compare the performance of several feature configurations from the literature using LOVBench in LTR settings and discuss the results in the context of the observed real-world user behaviour. Our experimental results show that feature configurations which are (i) well-suited to the user behaviour, (ii) cover all features types, and (iii) consider decomposition of features can significantly improve the ranking performance. |
12:00-12:30 |
LOREM: Language-consistent Open Relation Extraction from Unstructured Text Tom Harting (Delft University of Technology), Sepideh Mesbah (Delft University of Technology) and Christoph Lofi (Delft University of Technology).
AbstractWe introduce a Language-consistent multi-lingual Open Relation Extraction Model (LOREM) for finding relation tuples of any type between entities in unstructured texts. LOREM does not rely on language-specific knowledge or external NLP tools such as translators or PoS-taggers, and exploits information and structures that are consistent over different languages. This allows our model to be easily extended with only limited training efforts to new languages, but also provides a boost to performance for a given single language. An extensive evaluation performed on 5 languages shows that LOREM outperforms state-of-the-art mono-lingual and cross-lingual open relation extractors. Moreover, experiments on languages with no or only little training data indicate that LOREM generalizes to other languages than the languages that it is trained on. |
Social Network-B (1)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Query-Efficient Correlation Clustering David García-Soriano (ISI Foundation), Konstantin Kutzkov (IT University of Copenhagen), Francesco Bonchi (Fondazione ISI) and Charalampos Tsourakakis (Harvard University).
AbstractCorrelation clustering is arguably the most natural formulation of clustering. Given $n$ objects and a pairwise similarity measure, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters. A main drawback of correlation clustering is that it requires as input the $\Theta(n^2)$ pairwise similarities. This is often infeasible to compute or even just to store. In this paper we study query-efficient algorithms for correlation clustering. Specifically, we devise a correlation clustering algorithm that, given a budget of $Q$ queries, attains a solution whose expected number of disagreements is at most $3\cdot \opti + O(n^3{Q})$, where $\opti$ is the optimal cost of the instance. Its running time is $O(Q)$, and can be easily made non-adaptive with the same guarantees. Up to constant factors, our algorithm yields a provably optimal trade-off between the number of queries $Q$ and the worst-case error attained, even for adaptive algorithms. Finally, we perform an experimental study of our proposed method on both synthetic and real data, showing the scalability and the accuracy of our algorithm. |
11:00-11:30 |
Structural Deep Clustering Network Deyu Bo (Beijing University of Posts and Telecommunications), Xiao Wang (Beijing University of Posts and Telecommunications), Chuan Shi (Beijing University of Posts and Telecommunications), Meiqi Zhu (Beijing University of Posts and Telecommunications), Emiao Lu (Tencent Ltd) and Peng Cui (Tsinghua University).
AbstractClustering is a fundamental task in data analysis. Recently, deep clustering, which derives inspiration primarily from deep learning approaches, achieves state-of-the-art performance and has attracted considerable attention. Current deep clustering methods usually boost the clustering results by means of the powerful representation ability of deep learning, e.g., autoencoder, suggesting that learning an effective representation for clustering is a crucial requirement. The strength of deep clustering methods is to extract the useful representations from the data itself, rather than the structure of data, which recessives scarce attention in representation learning. Motivated by the great success of Graph Convolutional Network (GCN) in encoding the graph structure, we propose a Structural Deep Clustering Network (SDCN) to integrate the structural information into deep clustering. Specifically, we design a delivery operator to transfer the representations learned by autoencoder to the corresponding GCN layer, and a dual self-supervised mechanism to unify these two different deep neural architectures and guide the update of the whole model. In this way, the multiple structures of data, from low-order to high-order, are naturally combined with the multiple representations learned by autoencoder. Furthermore, we theoretically analyze the delivery operator, i.e., with the delivery operator, GCN improves the autoencoder-specific representation as a high-order graph regularization constraint and autoencoder helps alleviate the over-smoothing problem in GCN. Through comprehensive experiments, we demonstrate that our propose model can perform consistently better over the state-of-the-art techniques. |
11:30-12:00 |
Near-Perfect Recovery in the One-Dimensional Latent Space Model Yu Chen (University of Pennsylvania), Sampath Kannan (University of Pennsylvania) and Sanjeev Khanna (University of Pennsylvania).
AbstractSuppose a graph $G$ is stochastically created by uniformly sampling vertices along a line segment and connecting each pair of vertices with a probability that is a known decreasing function of their distance. We ask if it is possible to reconstruct the actual positions of the vertices in $G$ by only observing the generated unlabeled graph. We study this question for two natural edge probability functions --- one where the probability of an edge decays exponentially with the distance and another where this probability decays only linearly. We initiate our study with the weaker goal of recovering only the order in which vertices appear on the line segment. For a segment of length $n$ and a precision parameter $\delta$, we show that for both exponential and linear decay edge probability functions, there is an efficient algorithm that correctly recovers (up to reflection symmetry) the order of all vertices that are at least $\delta$ apart, using only $O(n{\delta ^ 2})$ samples (vertices). Building on this result, we then show that $O(n^2 \log n{\delta ^2})$ vertices (samples) are sufficient to additionally recover the location of each vertex on the line to within a precision of $\delta$. We complement this result with an $\Omega (n^{1.5}{\delta})$ lower bound on samples needed for reconstructing positions (even by a computationally unbounded algorithm), showing that the task of recovering positions is information-theoretically harder than recovering the order. We give experimental results showing that our algorithm recovers the positions of almost all points with great accuracy. |
User Modeling-B (1)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
"What Apps Did You Use?": Understanding the Long-term Evolution of Mobile App Usage Tong Li (The Hong Kong University of Science and Technology; University of Helsinki), Mingyang Zhang (The Hong Kong University of Science and Technology), Hancheng Cao (Stanford University), Yong Li (Tsinghua University), Sasu Tarkoma (University of Helsinki) and Pan Hui (The Hong Kong University of Science and Technology; University of Helsinki).
AbstractThe prevalence of smartphones has promoted the popularity of mobile apps in recent years. Although lots of efforts have been made to understand mobile app usage, existing studies are based primarily on short-term datasets with limited time spans, e.g., a few months. As a result, many fundamental facts on the long-term evolution of mobile app usage are yet unknown. In this paper, we aim to gain insight into the way how mobile app usage evolves across a long-term period. We first introduce an app usage collection platform named Carat, from which we gathered detailed app usage records of 1,465 mobile users over six years from 2012 to 2017 around the globe. We then conduct the first study on the long-term evolution processes from both macro-level, i.e., app-category usage, and micro-level, i.e., exact app usage. We discover that, in both levels, there is a growth stage triggered by the development of technologies. Also, there is a plateau stage in both levels caused by high correlations across app categories and the Pareto effect of app usage, respectively. Additionally, the evolution of exact app usage undergoes an elimination stage since the fierce intra-competition. Nevertheless, the diversity of app-category usage and app usage exhibits opposite trends: the diversity of app-category usage declines, while app usage diversifies. Our study provides useful implications for app developers, market intermediaries, and service providers. |
11:00-11:30 |
Financial Defaulter Detection on Online Credit Payment viaMulti-view Attributed Heterogeneous Information Network Qiwei Zhong (Alibaba Group, Hangzhou China), Yang Liu (Institute of Computing Technology, Chinese Academy of Sciences), Xiang Ao (Institute of Computing Technology, Chinese Academy of Sciences), Binbin Hu (Ant Financial Services Group, Hangzhou China), Jinghua Feng (Alibaba Group, Hangzhou China), Jiayu Tang (Alibaba Group, Hangzhou China) and Qing He (Institute of Computing Technology, Chinese Academy of Sciences).
AbstractDefault user detection plays one of the backbones in credit risk forecasting and management. It aims at, given a set of corresponding features, e.g., patterns extracted from trading behaviors, predicting the polarity indicating whether a user will fail to make required payments in the future. Recent efforts attempted to incorporate attributed heterogeneous information network (AHIN) for extracting complex interactive features of users and achieved remarkable success on discovering specific default users such as fraud, cash-out users, etc. In this paper, we consider default users, a more general concept in credit risk, and propose a multi-view attributed heterogeneous information network based approach coined MAHINDER to remedy the special challenges. First, multiple views of user behaviors are adopted to learn personal profile due to the endogenous aspect of financial default. Second, local behavioral patterns are specifically modeled since financial default is adversarial and accumulated. With the real datasets contained 1.38 million users on Alibaba platform, we investigate the effectiveness of MAHINDER, and the experimental results exhibit the proposed approach is able to improve AUC over 2.8% and Recall@Precision=0.1 over 13.1% compared with the state-of-the-art methods. Meanwhile, MAHINDER has as good interpretability as tree-based methods like GBDT, which buoys the deployment in online platforms. |
11:30-12:00 |
Do Podcasts and Music Compete with One Another? Understanding Users’ Audio Streaming Habits Ang Li (University of Pittsburgh), Alice Wang (Spotify), Zahra Nazari (Spotify), Praveen Chandar (Spotify) and Benjamin Carterette (Spotify).
AbstractOver the past decade, podcasts have been one of the fastest growing online streaming media. Many online audio streaming platforms such as Pandora, Spotify, etc. that traditionally focused on music content have started to incorporate services related to podcasts. Although incorporating new media types such as podcasts has created tremendous opportunities for these streaming platforms to expand their content offering, it also introduces new challenges. Since the functional use of podcasts and music may largely overlap for many people, the two types of content may compete with one another for the finite amount of time that users may allocate for audio streaming. As a result, incorporating podcast listening may influence and change the way users have originally consumed music. Adopting quasi-experimental techniques, the current study assesses the causal influence of adding a new class of content on user listening behavior by using large scale observational data collected from a widely used audio streaming platform. %Specifically, we investigate the change and characterize the differences of users listening habits for podcast versus music after the influence. Our results demonstrate that podcast and music consumption compete slightly but do not replace one another -- users open another time window to listen to podcasts. In addition, users who have added podcasts to their music listening demonstrate significantly different consumption habits for podcasts vs. music in terms of the streaming time, duration and frequency. Taking all the differences as input features to a machine learning model, we demonstrate that a podcast listening session is predictable at the start of a new listening session. Our study provides a novel contribution for online audio streaming and consumption services to understand their potential consumers and to best support their current users with an improved recommendation system. |
12:00-12:30 |
Algorithmic Effects on the Diversity of Consumption on Spotify Ashton Anderson (University of Toronto), Lucas Maystre (Spotify, Inc.), Ian Anderson (Spotify, Inc.), Rishabh Mehrotra (Spotify, Inc.) and Mounia Lalmas (Spotify, Inc.).
AbstractOn many online platforms, users can engage with millions of pieces of content, which they discover either by searching on their own or through algorithmically-generated recommendations. The user experience, in turn, is largely shaped by the content that users interact with. In this work, we study the user experience on Spotify, a popular music streaming service, through the lens of diversity: how coherent the set of items a user consumes is, and find it is a fundamental attribute. We construct high-fidelity embeddings of millions of songs based on listening behavior on Spotify, and use these embeddings to quantify how musically diverse every user is. We find that musical diversity is strongly associated with important metrics, such as user conversion and retention. On the other hand, we find that algorithmically-driven listening through recommendations pushes users towards being less musically diverse. Furthermore, we study users who become more diverse in their consumption over time and find that they do so by reducing their algorithmic consumption and increasing their organic consumption. Finally, we deploy a randomized experiment to further shed light on the relationship between recommendation and musical diversity.Our work illuminates a central tension in online platforms: how to recommend content that users are likely to enjoy in the short-term while simultaneously ensuring they can remain diverse in their consumption in the long-term. |
Research Tracks (5)
Web Mining-A (5)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
NERO: A Neural Rule Grounding Framework for Label-Efficient Relation Extraction Wenxuan Zhou (University of Southern California), Hongtao Lin (Pinterest Inc.), Bill Yuchen Lin (University of Southern California), Ziqi Wang (Tsinghua University), Junyi Du (University of Southern California), Leonardo Neves (Snapchat Inc.) and Xiang Ren (University of Southern California).
AbstractDeep neural models for relation extraction tend to be less reliable when perfectly labeled data is limited, despite their success in label-sufficient scenarios. Instead of seeking more instance-level labels from human annotators, here we propose to annotate frequent surface patterns to form labeling rules. These rules can be automatically mined from large text corpora and generalized via a soft rule matching mechanism. Prior works use labeling rules in an exact matching fashion, which inherently limits the coverage of sentence matching and results in the low-recall issue. In this paper, we present a neural approach to ground rules for RE, namedNero, which jointly learns a relation extraction module and a soft matching module. One can employ any neural relation extraction models as the instantiation for the RE module. The soft matching module learns to match rules with semantically similar sentences such that raw corpora can be automatically labeled and leveraged by the RE module (in a much better coverage) as augmented supervision, in addition to the exactly matched sentences. Extensive experiments and analysis on two public and widely-used datasets demonstrate the effectiveness of the proposedNeroframework, comparing with both rule-based and semi-supervised methods. Through user studies, we find that the time efficiency for a human to annotate rules and sentences are similar (0.30 vs. 0.35 min per label). In particular, NERO’s performance using 270 rules is comparable to the models trained using 3,000 labeled sentences, yielding a 9.5x speedup. Moreover, Nero can predict for unseen relations at test time and provide interpretable predictions. We will release our code to the community for future research in this direction. |
14:00-14:30 |
Discriminative Topic Mining via Category-Name Guided Text Embedding Yu Meng (University of Illinois at Urbana-Champaign), Jiaxin Huang (University of Illinois Urbana-Champaign), Guangyuan Wang (University of Illinois at Urbana-Champaign), Zihan Wang (University of Illinois at Urbana-Champaign), Chao Zhang (Georgia Institute of Technology), Yu Zhang (University of Illinois at Urbana-Champaign) and Jiawei Han (University of Illinois at Urbana-Champaign).
AbstractMining a set of meaningful and distinctive topics automatically from massive text corpora has broad applications. Topic models, which discover latent topics via modeling the corpus generative process, have proven fruitful on this task. However, such purely unsupervised approaches often generate topics that do not fit the user’s particular need and yield suboptimal performances on downstream tasks. To this end, we propose a new task, discriminative topic mining, which leverages a set of user provided category names to mine distinctive topics from text corpora. This new task not only helps the user understand clearly and distinctively the topics he/she is most interested in, but also benefits directly keyword-driven classification tasks. We develop a novel category-name guided text embedding method CatE for discriminative topic mining. We conduct a comprehensive set of experiments to show that CatE mines high-quality set of topics guided by category names only, and benefits a variety of downstream applications including weakly-supervised classification and lexical entailment direction identification. |
14:30-15:00 |
Generating Multi-hop Reasoning Questions to Improve Machine Reading Comprehension Jianxing Yu (Sun Yat-sen University), Xiaojun Quan (Sun Yat-sen University), Qinliang Su (Sun Yat-sen University) and Jian Yin (Sun Yat-sen University).
AbstractThis paper focuses on the topic of multi-hop question generation, which aims to generate questions needed reasoning over multiple sentences and relations to obtain answers. In particular, we first build an entity graph to integrate various entities scattered on text by capturing their contextual relations. We then extract the sub-graph satisfying certain conditions on the relations and reasoning type, so as to obtain the reasoning chain for each question. Guided by the chain, we propose a holistic generator-evaluator network to form the questions, where such guidance helps ensure the reasonability of generated questions which need multi-hop deduction to correspond to the answers. The generator is a sequence-to-sequence model, designed with several techniques to make the questions syntactically and semantically valid. The evaluator optimizes the generator network by employing a hybrid mechanism combined of supervised and reinforced learning. Experimental results on HotpotQA data set demonstrate the effectiveness of our approach, where the generated samples can be used as pseudo training data to alleviate the data shortage problem for neural network and help learn the state-of-the-arts for multi-hop machine comprehension. |
15:00-15:15 |
Anchored Model Transfer and Soft Instance Transfer for Cross-Task Cross-Domain Learning: A Study Through Aspect-Level Sentiment Classification Yaowei Zheng (Beihang University), Richong Zhang (Beihang University), Suyuchen Wang (Beihang University), Samuel Mensah (Beihang University) and Yongyi Mao (University of Ottawa).
AbstractSupervised learning relies heavily on readily available labeled data to infer an effective classification function. However, proposed methods under the supervised learning paradigm are faced with scarcity of labeled data within domains, and are not generalized enough to adapt to other tasks. Transfer learning has proved to be a worthy choice to address these issues, by allowing knowledge to be shared across domains and tasks. In this paper, we propose two transfer learning methods Anchored Model Transfer (AMT) and Soft Instance Transfer (SIT), which are both based on multi-task learning, and account for model transfer and instance transfer, and can be combined into a common framework. We demonstrate the effectiveness of AMT and SIT for aspect-level sentiment classification showing the competitive performance against baseline models on benchmark datasets. Interestingly, we show that the integration of both methods AMT+SIT achieve state-of-the-art performance on the same task. |
15:15-15:30 |
Unsupervised Dual-Cascade Learning with Pseudo-Feedback Distillation for Query-Focused Extractive Summarization Haggai Roitman (IBM Research AI), Guy Feigenblat (IBM Research AI), Doron Cohen (IBM Research AI), Odellia Boni (IBM Research AI) and David Konopnicki (IBM Research AI).
AbstractWe propose Dual-CES -- a novel unsupervised, query-focused, multi-document extractive summarizer. Dual-CES builds on top of the Cross Entropy Summarizer (CES) and is designed to better handle the tradeoff between saliency and focus in summarization. To this end, Dual-CES employs a two-step dual-cascade optimization approach with saliency-based pseudo-feedback distillation. Overall, Dual-CES significantly outperforms all other state-of-the-art unsupervised alternatives. Dual-CES is even shown to be able to outperform strong supervised summarizers. |
Social Network-A (5)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Keyword Search over Knowledge Graphs via Static and Dynamic Hub Labelings Yuxuan Shi (Nanjing University), Gong Cheng (Nanjing University) and Evgeny Kharlamov (Bosch Center for Artificial Intelligence).
AbstractKeyword search is a prominent approach to querying Web data that has been extensively studied. For graph-structured data, a widely accepted semantics for keywords is based on group Steiner trees. For this NP-hard problem, existing algorithms with provable quality guarantees have prohibitive run time on large graphs. In this paper, we propose a series of practical approximation algorithms with a guaranteed quality of computed answers and very low run time. Our algorithms rely on Hub Labeling (HL), a structure that labels each vertex in a graph with a list of vertices reachable from it, which we use to compute distances and shortest paths. We devise two HLs: a conventional static HL that uses a new heuristic to improve the existing pruned landmark labeling, and a novel dynamic HL that inverts and aggregates query-relevant static labels to more efficiently process vertex sets. We show that our approach allows to compute a reasonably good approximation of answers to keyword queries in milliseconds on knowledge graphs with millions of vertices. |
14:00-14:30 |
ASER: A Large-scale Eventuality Knowledge Graph Hongming Zhang (The Hong Kong University of Science and Technology), Xin Liu (The Hong Kong University of Science and Technology), Haojie Pan (The Hong Kong University of Science and Technology), Yangqiu Song (The Hong Kong University of Science and Technology) and Cane Wing-Ki Leung (Wisers AI Lab).
AbstractUnderstanding human's language requires complex world knowledge. However, existing large-scale knowledge graphs mainly focus on knowledge about entities while ignoring knowledge about activities, states, or events, which are used to describe how entities or things act in the real world. To fill this gap, we develop ASER (activities, states, events, and their relations), a large-scale eventuality knowledge graph extracted from more than 11-billion-token unstructured textual data. ASER contains 15 relation types belonging to five categories, 194-million unique eventualities, and 64-million unique edges among them. Both human and extrinsic evaluations demonstrate the quality and effectiveness of ASER. |
14:30-15:00 |
paper2repo: GitHub Repository Recommendation for Academic Papers Huajie Shao (University of Illinois at Urbana-Champaign), Dachun Sun (University of Illinois at Urbana-Champaign), Jiahao Wu (University of Illinois Urbana-Champaign), Zecheng Zhang (University of Illinois at Urbana-Champaign), Aston Zhang (Amazon), Shuochao Yao (University of Illinois at Urbana-Champaign), Shengzhong Liu (University of Illinois at Urbana-Champaign), Tianshi Wang (University of Illinois at Urbana-Champaign), Chao Zhang (Georgia Institute of Technology) and Tarek Abdelzaher (University of Illinois at Urbana-Champaign).
AbstractGitHub has become a popular social application platform, where a large number of users post their open-source projects. In particular, an increasing number of researchers release repositories of source code related to their research papers in order to attract more people to follow their work. Motivated by this trend, we describe a novel item-item cross-platform recommender system, paper2repo, that recommends relevant repositories on GitHub that match a given paper in an academic search system such as Microsoft Academic. The key challenge is to identify the similarity between an input paper and its related repositories across the two platforms, without the benefit of human labeling. Towards that end, paper2repo integrates text encoding and constrained graph convolutional networks (GCN) to automatically learn and map the embeddings of papers and repositories into the same space, where proximity offers the basis for recommendation. To make our method more practical in real-life systems, labels used for model training are computed automatically from features of user actions on GitHub. In machine learning, such automatic labeling is often called distant supervision. To the authors' knowledge, this is the first distant-supervised cross-platform (paper to repository) matching system. We evaluate the performance of paper2repo on real-world data sets collected from GitHub and Microsoft Academic. Results demonstrate that it outperforms other state of the art recommendation methods. |
15:00-15:15 |
Twitter User Location Inference Based on Representation Learning and Label Propagation Hechan Tian (State Key Laboratory of Mathematical Engineering and Advanced Computing), Meng Zhang (State Key Laboratory of Mathematical Engineering and Advanced Computing), Xiangyang Luo (State Key Laboratory of Mathematical Engineering and Advanced Computing), Fenlin Liu (State Key Laboratory of Mathematical Engineering and Advanced Computing) and Yaqiong Qiao (State Key Laboratory of Mathematical Engineering and Advanced Computing).
AbstractSocial network user location prediction technology has been widely used in various geospatial applications like public health monitoring and local advertising recommendation. Due to insufficient consideration of relationships between users and location indicative words, most of existing prediction methods estimate label propagation probabilities solely based on statistical features, such as mention frequency and the number of common followed users, resulting in large location prediction error. In this paper, a Twitter user location prediction method based on representation learning and label propagation is proposed. Firstly, the heterogeneous connection relation graph is constructed based on relationships between Twitter users and relationships between users and location indicative words, and relationships unrelated to geographic attributes are filtered. Then, vector representations of users are learnt by using a series of user node sequences generated from the connection relation graph. Finally, label propagation probabilities between adjacent users are calculated based on vector representations, and the locations of unknown users are predicted through iterative label propagation. Experiments on two representative Twitter datasets - GeoText and TwUs, show that the proposed method can accurately calculate label propagation probabilities based on vector representations and improve the accuracy of location prediction. Compared with existing typical Twitter user location prediction methods - GCN and MLP-TXT+NET, the median error distance of the proposed method is reduced by 18% and 16%, respectively. |
15:15-15:30 |
Are These Comments Triggering? Predicting Triggers of Toxicity in Online Discussions Hind Almerekhi (Qatar Foundation), Haewoon Kwak (Qatar Computing Research Institute), Joni Salminen (Qatar Computing Research Institute, HBKU; and Turku School of Economics) and Bernard Jansen (Qatar Computing Research Institute, Hamad Bin Khalifa University).
AbstractManaging the safety of online discussions from toxicity is a challenge that online communities struggle with. Therefore, identifying the causes or triggers of toxicity is essential for preventing toxic comments from manifesting in online discussions. In this research, we begin with defining toxicity triggers within discussion threads as non-toxic contributions that lead to other toxic comments. Then, we build an LSTM neural network for toxicity trigger detection using more than 221 thousand submissions containing more than 2.2 million comments from Reddit. The prediction model includes text-based features and derives features from past studies that pertain to shifts in sentiment, topic flow, and discussion context across comments in discussion threads. Our findings show that triggers of toxicity contain identifiable features, such as named entities and that incorporating shift features with the discussion context improves the performance of the prediction model by 6%, achieving an overall AUC score of 0.87. Topic and sentiment shifts frequently occur in discussions that contain toxicity triggers, indicating that shift analyses combined with the discussion context are useful for toxicity trigger detection in online discussions. We discuss implications for online communities and also provide a rich dataset for further analysis of online toxicity and its root causes. |
User Modeling-A (5)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Deep Global and Local Generative Model for Recommendation Huafeng Liu (Beijing Jiaotong University), Jingxuan Wen (Beijing Jiaotong University), Zhicheng Wu (Beijing Jiaotong University), Jiaqi Wang (Beijing Jiaotong University), Liping Jing (Beijing Jiaotong University) and Jian Yu (Beijing Jiaotong University).
AbstractDeep generative model, especially variational auto-encoder (VAE), has been successfully employed by more and more recommendation systems. The reason is that it combines the flexibility of probabilistic generative model with the powerful non-linear feature representation ability of deep neural networks. The existing VAEbased recommendation models are usually proposed under global assumption by incorporating simple priors, e.g., a single Gaussian, to regularize the latent variables. This strategy, however, is ineffective when the user is simultaneously interested in different kinds of items, i.e., the user’s preference may be highly diverse. In this paper, thus, we propose a Deep Global and Local Generative Model for recommendation to consider both local and global structure among users (DGLGM) under the Wasserstein auto-encoder framework. Besides keeping the global structure like the existing model, DGLGM adopts a non-parametric Mixture Gaussian distribution with several components to capture the diversity of the users’ preferences. Each component is corresponding to one local structure and its optimal size can be determined via the automatic relevance determination technique. These two parts can be seamlessly integrated and enhance each other. The proposed DGLGM can be efficiently inferred by minimizing its penalized upper bound with the aid of local variational optimization technique. Meanwhile, we theoretically analyze its generalization error bounds to guarantee its performance in sparse feedback data with diversity. By comparing with the state-of-the-art methods, the experimental results demonstrate that DGLGM consistently benefits the recommendation system in top-N recommendation task. |
14:00-14:30 |
A Category-Aware Deep Model for Successive POI Recommendation on Sparse Check-in Data Fuqiang Yu (Shandong University), Lizhen Cui (Shandong University), Wei Guo (Shandong University), Xudong Lu (Shandong University), Qingzhong Li (Shandong University) and Hua Lu (Aalborg University).
AbstractIn location-based social networks (LBSNs), considerable amounts of POI check-in data have been accumulated. As a result, successive point-of-interest (POI) recommendation is increasingly popular. Existing successive POI recommendation methods only predict where user will go next, ignoring when this behavior will occur. In this work, we focus on predicting POIs that will be visited by users in the next 24 hours, a more meaningful and rational task. Moreover, as check-in data in LBSN is very sparse,it is challenging to accurately capture user preferences in temporal patterns. In this paper, we propose a category-aware deep model CatDM that incorporates POI category and geographical influence to reduce search space to overcome data sparsity. We design two deep encoders based on LSTM to model the time series data. The first encoder captures user preferences in POI categories, whereas the second encoder exploits user preferences in POIs. Considering clock influence in the second encoder, we divide each user’s check-in history into several different time windows and develop a personalized attention mechanism for each window to facilitate CatDM to exploit temporal patterns. Moreover, to sort candidate set, we consider four specific dependencies: user-POI, user-category, POI-time and POI-user current preferences. Extensive experiments are conducted on two large real datasets. The experimental results demonstrate that our CatDM outperforms the state-of-the-art models for successive POI recommendation on sparse data. |
14:30-15:00 |
Mining Implicit Entity Preference from User-Item Interaction Data for Knowledge Graph Completion via Adversarial Learning Gaole He (Renmin University of China), Junyi Li (School of Information, Renmin University of China), Xin Zhao (Renmin University of China, School of Information), Peiju Liu (Peking University) and Ji-Rong Wen (School of Information, Renmin University of China).
AbstractThe task of Knowledge Graph Completion (KGC) aims to automatically infer the missing fact information in Knowledge Graph (KG). In this paper, we take a new perspective that aims to leverage rich user-item interaction data (user interaction data for short) for improving the KGC task. Our work is inspired by the observation that many KG entities correspond to online items in application systems. However, the two kinds of data sources have very different intrinsic characteristics, and it is likely to hurt the original representation performance using simple fusion strategy.To address this challenge, we propose a novel adversarial learning approach for leveraging user interaction data for the KGC task. Our generator is isolated from user interaction data, and improves itself according to the feedback from the discriminator. The discriminator takes the learned useful information from user interaction data as input, and gradually enhances the evaluation capacity in order to identify the fake samples generated by the generator. To discover implicit entity preference of users, we design an elaborate collaborative learning algorithms based on graph neural networks, which will be jointly optimized with the discriminator. Such an approach is effective to alleviate the issues about data heterogeneity and semantic complexity for the KGC task. Extensive experiments on three real-world datasets have demonstrated the effectiveness of our approach on the KGC task. |
15:00-15:15 |
Influence Function based Data Poisoning Attacks to Top-N Recommender Systems Minghong Fang (Iowa State University), Neil Zhenqiang Gong (Duke University) and Jia Liu (Iowa State University).
AbstractRecommender system is an essential component of web services to engage users. Popular recommender systems model user preferences and item properties using a large amount of crowdsourced user-item interaction data, e.g., rating scores; then top-N items that match the best with a user's preference are recommended to the user, where the matching is determined by the modeled user preference and item properties. In this work, we show that an attacker can launch a data poisoning attack to a recommender system, i.e., an attacker can spoof a recommender system to make recommendations as the attacker desires via injecting fake users with carefully crafted user-item interaction data. Specifically, an attacker can spoof a recommender system to recommend a target item to as many normal users as possible. We focus on matrix factorization based recommender systems because they have been widely deployed in the industry. Given the number of fake users the attacker can inject, we formulate the crafting of rating scores for the fake users as an optimization problem, whose objective is to maximize the number of normal users to whom the target item is recommended. However, this optimization problem is challenging to solve as it is a non-convex integer programming problem. To address the challenge, we develop several techniques to indirectly solve the optimization problem. For instance, we leverage influence function to select a subset of normal users who are influential to the recommendations and solve our formulated optimization problem based on these influential users. We show the effectiveness of our attacks on two benchmark datasets. Moreover, we show that even if the recommender system detects fake users based on statistical analysis of their rating scores, our attacks are still effective as the detector misses a large fraction of fake users. |
15:15-15:30 |
Few-Shot Learning for New User Recommendation in Location-based Social Networks Ruirui Li (University of California, Los Angeles), Xian Wu (University of Notre Dame), Xiusi Chen (University of California, Los Angeles) and Wei Wang (University of California, Los Angeles).
AbstractThe proliferation of GPS-enabled devices, such as smartphones, establishes the prosperity of location-based social networks (LBSNs), which results in a tremendous amount of user check-ins. These check-ins bring in preeminent opportunities to understand users' preferences and facilitate matching between users and businesses. However, the user check-ins are extremely sparse due to the huge user and business bases, which makes matching a daunting task. In this work, we investigate the recommendation problem in the context of identifying potential new customers for businesses in LBSNs. In particular, we focus on investigating the geographical influence, composed of geographical convenience and geographical dependency. In addition, we leverage metric-learning-based few-shot learning to fully utilize the user check-ins and facilitate the matching between users and businesses. To evaluate our proposed method, we conduct a series of experiments to extensively compare with 13 baselines using two real-world datasets. The results demonstrate that the proposed method outperforms all these baselines by a significant margin. |
Society (4)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-13:45 |
War of Words: The Competitive Dynamics of Legislative Processes Victor Kristof (Ecole Polytechnique Fédérale de Lausanne), Matthias Grossglauser (Ecole Polytechnique Fédérale de Lausanne) and Patrick Thiran (Ecole Polytechnique Fédérale de Lausanne).
AbstractA body of law is an example of a dynamic corpus of text documents that is jointly maintained by a group of editors, who compete and collaborate in complex constellations. Our goal is to develop predictive models for this process, thereby shedding light on the competitive dynamics of parliamentarians making laws. For this purpose, we curated a dataset of 450 000 legislative edits introduced by European parliamentarians over the last ten years. An edit modifies the status quo of a law, and may be in competition with another edit if it modifies the same part of that law. We propose a model for predicting the success of such edits, in the face of both inertia of the status quo, and competition between overlapping edits. We include various features of the parliamentarians and of the edits to analyze the dynamics of the legislative process. The parameters of this model can be interpreted in terms of the influence of parliamentarians and of the controversy of laws. We show that the intrinsic influence of the parliamentarians helps them pass edits for laws of high controversy, but is of lesser importance for laws of low controversy. We finally show that incorporating additional latent features further boosts the predictive power by 14%, and that these features lend themselves to meaningful interpretation. |
13:45-14:15 |
Keeping out the Masses: Understanding the Popularity and Implications of Internet Paywalls Panagiotis Papadopoulos (Brave Software Inc.), Peter Snyder (Brave Software Inc.), Dimitrios Athanasakis (Brave Software Inc.) and Benjamin Livshits (Brave Software Inc., Imperial College London).
AbstractFunding the production quality online content is a pressing problem for content producers. The most common current funding method, online advertising, is rife with well known performance and privacy harms, and an intractable subject-agent conflict; many users do not want to see advertisements, depriving the site of needed funding.Because of these negative aspects of advertisement-based-funding, paywalls are an increasingly popular alternative for websites. This shift to an increasingly ``pay-for-access'' web is one that has potentially huge implications for the web and society. Instead of a system where information (nominally) flows freely, paywalls create a web where high quality information is available to fewer and fewer people, leaving the other web users with less information, and possibly of lower quality and less accurate. Despite the potential significance of a move from an ``advertising-but-open'' web to a “paywalled” web, we find this issue understudied.This work addresses this gap in our understanding by measuring how widely paywalls have been adopted, what kinds of sites use paywalls, and the distribution of policies enforced by paywalls. A partial list of our findings include that (i) paywall use has increased, and at an increasing rate (2x more paywalls every 6 months), (ii) paywall adoption differs by country (e.g., 18.75% in US, 12.69% in Australia), (iii) paywall deployment significantly changes how users interact with the site (e.g., higher bounce rates, less incoming links), (iv) the median cost of an annual paywall access is 108 USD per site, and (v) paywalls are in general trivial to circumvent. Finally, we present the design of a novel, automated system for detecting whether a site uses a paywall, through the combination of runtime browser instrumentation and repeated programmatic interactions with the site. We intend this classifier to augment future, longitudinal measurements of paywall use and behavior. |
14:15-14:45 |
Herding a Deluge of Good Samaritans: How GitHub Projects Respond to Increased Attention Danaja Maldeniya (University of Michigan), Ceren Budak (University of Michigan), Lionel P. Robert Jr. (University of Michigan) and Daniel M. Romero (University of Michigan).
AbstractCollaborative crowdsourcing is a well-established model of work in the information economy. This is apparent nowhere more than in the case of open source software development. The structure and operating dynamics of these virtual and loosely-knit teams differ from traditional organizations. As a result, little is known about how their behavior may change in response to an increase in external attention. To understand these changes, we analyze millions of actions of thousands of contributors in over 1200 open source software projects that topped the GitHub Trending Projects page and thus experienced a large increase in attention, in comparison to a control group of projects identified through propensity score matching. In carrying out our research, we use the lens of organizational change management, which considers the challenges teams face during rapid growth and how they adapt their work routines, organizational structure, and management style. We show that, relative to the control group, trending results in an explosive growth in the effective team size. However, most newcomers make only shallow and transient contributions such as reporting and fixing a specific bug, while a few show levels of commitment matching that of the original members. In response, the original team transitions towards administrative roles, responding to requests and reviewing and integrating work done by newcomers. In the resulting aftermath, trending projects evolve towards a more distributed coordination model with newcomers becoming more central, albeit in limited ways. Additionally, project teams become more modular with subgroups specializing in different aspects of the project. We discuss broader implications for collaborative crowdsourcing teams that face attention shocks. |
14:45-15:15 |
Don’t Let Me Be Misunderstood: Comparing Intentions and Perceptions in Online Discussions Jonathan P. Chang (Cornell University), Justin Cheng (Facebook) and Cristian Danescu-Niculescu-Mizil (Cornell University).
AbstractDiscourse involves two perspectives: a person's intention in making an utterance and others' perception of that utterance. Previous studies of online discussions have largely taken the latter third-party perspective, e.g., relying on crowdsourced labels to quantify properties like sentiment and subjectivity. By contrast, in this work we present a computational framework for exploring both perspectives: the speaker's intentions and how they are perceived.Intention is, however, difficult to capture as only the actual author of an utterance knows their intention with certainty. To address this, we combine logged data about public comments on Facebook with a survey of almost 20,000 people about their intentions in writing these comments or about their perceptions of comments that others had written. In particular, we focus on judgments of whether a comment was stating a fact or an opinion, since prior work has shown that these are often confused.We show that intentions and perceptions diverge in consequential ways. People are more likely to perceive opinions than to intend them, and linguistic cues that signal how an utterance is intended can differ from those that signal how it will be perceived. Furthermore, this misalignment between intentions and perceptions can be linked to the future health of a conversation: when a comment whose author intended to share a fact is misperceived as sharing an opinion, the subsequent conversation is more likely to derail into uncivil behavior than when the comment is perceived as intended. Altogether, these findings may inform the design of discussion platforms that better promote positive interactions. |
Security (4)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Go See a Specialist? Predicting Cybercrime Sales on Online Anonymous Markets from Vendor and Product Characteristics Rolf van Wegberg (Delft University of Technology), Fieke Miedema (Delft University of Technology), Ugur Akyazi (Delft University of Technology), Arman Noroozian (Delft University of Technology), Bram Klievink (Leiden University) and Michel van Eeten (Delft University of Technology).
AbstractMany cybercriminal entrepreneurs lack the skills and techniques to provision certain parts of their business model, leading them to outsource these parts to specialized criminal vendors. Online anonymous markets, from Silk Road to AlphaBay, have been used to search for these products and contract with their criminal vendors. While one listing of a product generates high sales numbers, another identical listing fails to sell. In this paper, we investigate which factors determine the performance of cybercrime products. Does success depend on the characteristics of the product or of the vendor? Or neither?To answer this question, we analyze scraped data on the business-to-business cybercrime segments of the AlphaBay market (2015-2017), consisting of 7,543 listings from 1,339 vendors which have been sold at least 126,934 times. We construct variables to capture price and product differentiators, like refund policies and customer support. We capture the influence of vendor characteristics by identifying five distinct vendor profiles based on latent profile analysis of six properties, such as experience and reputation. We leverage these product and vendor characteristics to empirically predict the number of sales of cybercrime solutions, whilst controlling for the lifespan and the type of solution. We find that all vendor profiles - either positively or negatively - influence sales. Consistent with earlier insights into carding forums, we identify prevalent product differentiators to be influencing the relative success of a product. While all these product differentiators do correlate significantly with product performance, their explanatory power is lower than that of vendor profiles. When outsourcing, the vendor seems to be of more importance to the buyers than product differentiators or the price. |
14:00-14:30 |
Filter List Generation for Underserved Regions Alexander Sjösten (Chalmers University of Technology), Peter Snyder (Brave Software Inc.), Antonio Pastor (Universidad Carlos III de Madrid), Panagiotis Papadopoulos (Brave Software Inc.) and Benjamin Livshits (Brave Software Inc & Imperial College of London).
AbstractFilter lists play a large and growing role in protecting and assisting web users. The vast majority of popular filter lists are crowd-sourced, where a large number of people manually label resources related to undesirable web resources (e.g. ads, trackers, paywall libraries), so that they can be blocked by browsers and extensions.Because only a small percentage of web users participate in the generation of filter lists, a crowd-sourcing strategy works well for blocking either uncommon resources that appear on "popular" websites, or resources that appear on a large number of "unpopular" websites. A crowd-sourcing strategy will performs poorly for parts of the web with small "crowds", such as regions of the web serving languages with (relatively) few speakers.This work addresses this problem through the combination of two novel techniques: (i) deep browser instrumentation that allows for the accurate generation of request chains, in a way that is robust in situations that confuse existing measurement techniques, and (ii) an ad classifier that uniquely combines perceptual and page-context features to remain accurate across multiple languages.We apply our unique two-step filter list generation pipeline to three regions of the web that currently have poorly maintained filter lists: Sri Lanka, Hungary, and Albania. We generate new filter lists that complement existing filter lists. Our complementary lists block an additional 2,270 of ad and ad-related resources (1,901 unique) when applied to 6,475 pages targeting these three regions.We hope that this work can be part of an increased effort at ensuring that the security, privacy, and performance benefits of web resource blocking can be shared with all users, and not only those in dominant linguistic or economic regions. |
14:30-14:45 |
I’ve Got Your Packages: Harvesting Customers’ Delivery Order Information using Package Tracking Number Enumeration Attacks Simon Woo (skku), Hyoungshick Kim (Sungkyunkwan University), Hanbin Jang (SKKU) and Woojung Ji (SKKU).
AbstractA package tracking number (PTN) is widely used to monitor and track a shipment. Usually, a package tracking number, which is a sequence of digits, is associated with information about a sender and a receiver, as well as the package delivery status. Through the lenses of security and privacy, however, a package tracking number can possibly reveal certain personal information, leading to privacy breaches.In this work, we examine the privacy issues associated with online package tracking systems used in the top three most popular package delivery service providers in the world (FedEx, DHL, and UPS) and found that those websites provide users' personal data with a PTN. Moreover, we discovered that PTNs are highly structured and predictable via PTN enumeration attacks. Further, such users' personal data from PTNs can be massively collected. We found that there is no security policy to limit the number of consecutive attempts in package tracking services. We experimented and analyzed more than one million package tracking records obtained from Fedex, DHL, and UPS, and showed that within 5 attempts, an attacker can efficiently guess more than 90\% of PTNs for FedEx and DHL, and close to 50\% of PTNs for UPS exploiting consecutive PTN patterns. In addition, we present two practical concrete case studies: 1) to infer business transactions information and 2) to uniquely identify recipients. We demonstrate that some companies can intentionally obtain their competitors' business and customer information with massively collected PTNs. Also, we found that more than 109 recipients can be uniquely identified with less than 10 comparisons by linking the PTN information with the online people search service, Whitepages. Our research is the first to uncover how PTNs can be used to leak other personal information, and to reveal the fact that current PTNs system can be misused to jeopardize user privacy. |
14:45-15:15 |
The Pod People: Understanding Manipulation of Social Media Popularity via Reciprocity Abuse Janith Weerasinghe (New York University), Bailey Flanigan (Drexel University), Aviel Stein (Drexel University), Damon McCoy (New York University) and Rachel Greenstadt (New York University).
AbstractOnline Social Network (OSN) Users' demand to increase their account popularity has driven the creation of an underground ecosystem that provides services or techniques to help users manipulate content curation algorithms. One method of subversion that has recently emerged occurs when users form groups, called pods, to facilitate reciprocity abuse, where each member reciprocally interacts with content posted by other members of the group. We collect 1.8 million Instagram posts that were posted in pods hosted on Telegram. We first summarize the properties of these pods and how they are used, uncovering that they are easily discoverable by Google search and have a low barrier to entry. We then create two machine learning models for detecting Instagram posts that have gained interaction through two different kinds of pods, achieving 0.91 and 0.94 AUC, respectively. Finally, we find that pods are effective tools for increasing users' Instagram popularity, we estimate that pod utilization leads to a significantly increased level of likely organic comment interaction on users' subsequent posts. |
Search (3)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Generating Clarifying Questions for Information Retrieval Hamed Zamani (Microsoft), Susan Dumais (Microsoft), Nick Craswell (Microsoft), Paul Bennett (Microsoft) and Gord Lueck (Microsoft).
AbstractSearch queries are often short, and the underlying user intent may be ambiguous. This makes it challenging for search engines to predict possible intents, only one of which may pertain to the current user. To address this issue, search engines often diversify the result list and present documents relevant to multiple intents of the query. An alternative approach is to ask the user a question to clarify her information need. Asking clarifying questions is particularly important for scenarios with ``limited bandwidth'' interfaces, such as voice-only and small-screen devices. In addition, our user studies and large-scale online experiment show that asking clarifying questions is also useful in web search. Although some recent studies have pointed out the importance of asking clarifying questions, generating clarifying question for open-domain search tasks remains unstudied and is the focus of this paper. Lack of training data even within major search industry for this task makes it challenging. To mitigate this issue, we first identify a taxonomy of clarification for open-domain search queries by analyzing large-scale query reformulation data sampled from Bing search logs. This taxonomy leads us to a set of question templates and a simple yet effective slot filling algorithm. We further use this model as a source of weak supervision to automatically generate clarifying questions for training. Furthermore, we propose supervised and reinforcement learning models for generating clarifying questions learned from weak supervision data. We also investigate methods for generating candidate answers for each clarifying question, so users can select from a set of pre-defined answers. Human evaluation of the clarifying questions and candidate answers for hundreds of search queries demonstrates the effectiveness of the proposed solutions. |
14:00-14:30 |
Leading Conversational Search by Suggesting Useful Questions Corbin Rosset (Microsoft), Chenyan Xiong (Microsoft), Xia Song (Microsoft), Daniel Campos (Microsoft), Nick Craswell (Microsoft), Saurabh Tiwary (Microsoft) and Paul Bennett (Microsoft).
Abstract``People Also Ask'' question suggestion is a popular feature in commercial search engines and a crucial gateway to lead users to more conversational search experiences. This paper fundamentally studies this question suggestion function, including offline metrics, suggestion models, weak supervision data, and online experiments. We first establish a novel offline evaluation metric, Usefulness, which reaches beyond just relevance and requires leading the search session with more conversational ``next-turn'' suggestions. We construct the first public benchmark dataset for Useful question suggestion. Then we develop two suggestion systems, a BERT retrieval model and a GPT-2 generation model. To guide the suggestion models to provide more Useful questions, we invent a new inductive training method that guides suggestion models to more next-turn suggestions using weak supervisions from mined coherent and informative search sessions in the search log. Our offline experiments demonstrate the crucial role our ``next-turn'' inductive training plays in improving Usefulness over a strong online system. Our online A/B evaluation shows that our ``next-turn'' focused question suggestions receive 8\% more user clicks than the previous system. |
14:30-14:45 |
Matching Cross Network for Learning to Rank in Personal Search Zhen Qin (Google), Zhongliang Li (Google), Michael Bendersky (Google) and Donald Metzler (Google).
AbstractRecent neural ranking algorithms focus on learning semantic matching between query and document terms. However, practical learning to rank systems typically rely on a wide range of side information beyond query and document textual features, like location, user context, etc. It is common practice to concatenate all of these features and rely on deep models to learn a complex representation.We study how to effectively and efficiently combine textual information from queries and documents with other useful but less prominent side information for learning to rank. We conduct synthetic experiments to show that: 1) neural networks are inefficient at learning the interaction between two prominent features (e.g., query and document embedding features) in the presence of other less prominent features; 2) direct application of a state-of-art method for higher-order feature generation is also inefficient at learning such important interactions.Based on the above observations, we propose a simple but effective matching cross network (MCN) method for learning to rank with side information. MCN conducts an element-wise multiplication matching of query and document embeddings and leverages a technique called latent cross to effectively learn the interaction between matching output and all side information. The approach is easy to implement, adds minimal parameters and latency overhead to standard neural ranking architectures, and can be used for efficient end-to-end training.We conduct extensive experiments using two of the world's largest personal search engines, Gmail and Google Drive search, and show that each proposed component adds meaningful gains against a strong production baseline with minimal latency overhead, thereby demonstrating the practical effectiveness and efficiency of the proposed approach. |
14:45-15:00 |
End-to-End Deep Attentive Personalized Item Retrieval for Online Content-sharing Platforms Jyun-Yu Jiang (University of California, Los Angeles), Tao Wu (Google), Georgios Roumpos (Google), Heng-Tze Cheng (Google), Xinyang Yi (Google), Ed Chi (Google), Harish Ganapathy (Google), Nitin Jindal (Google), Pei Cao (Google) and Wei Wang (University of California, Los Angeles).
AbstractModern online content-sharing platforms host billions of items like music, videos, and products uploaded by various providers for users to discover items of their interests. To satisfy the information needs, the task of effective item retrieval (or item search ranking) given user search queries has become one of the most fundamental problems to online content-sharing platforms. Moreover, the same query can represent different search intents for different users, so personalization is also essential for providing more satisfactory search results. Different from other similar research tasks, such as ad-hoc retrieval and product retrieval with copious words and reviews, items in content-sharing platforms usually lack sufficient descriptive information and related meta-data as features. In this paper, we propose the end-to-end deep attentive model (EDAM) to deal with personalized item retrieval for online content-sharing platforms using only discrete personal item history and queries. Each discrete item in the personal item history of a user and its content provider are first mapped to embedding vectors as continuous representations. A query-aware attention mechanism is then applied to identify the relevant contexts in the user history and construct the overall personal representation for a given query. Finally, an extreme multi-class softmax classifier aggregates the representations of both query and personal item history to provide personalized search results. We conduct extensive experiments on a large-scale real-world dataset with hundreds of million users from one of large-scale online content-sharing platforms. The experimental results demonstrate that our proposed approach significantly outperforms several competitive baseline methods. It is also worth mentioning that this work utilizes a massive dataset from a real-world commercial content-sharing platform for personalized item retrieval to provide more insightful analysis from the industrial aspects. |
Mobile (4)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Privacy-preserving AI Services Through Data Decentralization Christian Meurisch (TU Darmstadt), Bekir Bayrak (TU Darmstadt) and Max Mühlhäuser (TU Darmstadt).
AbstractUser services increasingly base their actions on AI models, e.g., to offer personalized and proactive support. However, the underlying AI algorithms require a continuous stream of personal data---leading to privacy issues, as these algorithms typically run in the provider's cloud, and thus, users have to share data out of their sovereign territory. Current privacy-preserving concepts are either not applicable to such AI-based services or to the disadvantage of any party. This paper presents PrivAI, a new decentralized and privacy-by-design platform for overcoming the need for sharing user data to benefit from personalized AI services. In short, PrivAI complements existing approaches to personal data stores, but strictly enforces the confinement of raw user data. PrivAI further addresses the resulting challenges by (1) dividing AI algorithms into cloud-based general model training and a subsequent local personalization step, and by (2) loading confidential AI models into a trusted execution environment, and thus, protecting provider's intellectual property. Our experiments show the feasibility and effectiveness of PrivAI with comparable performance as currently-practiced approaches. |
14:00-14:30 |
CellRep: Usage Representativeness Modeling and Correction Based on Multiple City-Scale Cellular Networks Zhihan Fang (Rutgers University), Guang Wang (Rutgers University), Shuai Wang (Southeast University), Chaoji Zuo (Rutgers University), Fan Zhang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences) and Desheng Zhang (Rutgers University).
AbstractUnderstanding representativeness in cellular web logs at city scale is essential for web applications. Most of the existing work on cellular web analyses or applications is built upon data from a single network in a city, which may not be representative of the overall usage patterns since multiple cellular networks coexist in most cities in the world. In this paper, we conduct the first comprehensive investigation of multiple cellular networks in a city with a 100\% user penetration rate. We study web usage patterns (se.g., internet access services) correlation and difference between diverse cellular networks in terms of spatial and temporal dimensions to quantify the representativeness of web usage from a single network in usage patterns of all users in the same city. Moreover, relying on three external datasets, we study the correlation between the representativeness and contextual factors (e.g., Point-of-Interest, population, and mobility) to explain the potential causalities for the representativeness difference. We found that contextual diversity is a key reason for representativeness difference, and representativeness has a significant impact on the performance of real-world applications. Based on the analysis results, we further design a correction model to address the bias of single cellphone networks and improve representativeness by 45.8\%. |
14:30-15:00 |
What is the Human Mobility in a New City: Transfer Mobility Knowledge Across Cities Tianfu He (Harbin Institute of Technology), Jie Bao (JD Intelligent City Research), Ruiyuan Li (JD Intelligent City Research), Sijie Ruan (JD Intelligent City Research), Yanhua Li (Worcester Polytechnic Institute (WPI)), Li Song (Meituan-Dianping), Hui He (Room B618,Dorm.10, HIT,Harbin,China,150001) and Yu Zheng (JD Intelligent City Research).
AbstractHuman mobility, e.g., GPS trajectories of vehicles, sharing bikes, and mobile devices, reflects people's travel patterns and preferences, which are especially crucial for urban applications such as urban planning and business location selection.However, collecting a large set of human mobility data is not easy because of the privacy and commercial concerns, as well as the high cost to deploy sensors and a long time to collect the data, especially in newly developed cities.Realizing this, in this paper, based on the intuition that the human mobility is driven by the mobility intentions reflected by the origin and destination (or OD) features, as well as the preference to select the path between them, we investigate the problem to generate mobility data for a new target city, by transferring knowledge from mobility data and multi-source data of the source cities.Our framework contains three main stages: 1)~{\em mobility intention transfer}, which learns a latent unified mobility intention distribution across the source cities, and transfers the model of the distribution to the target city; 2)~{\em OD generation}, which generates the OD pairs in the target city based on the transferred mobility intention model, and 3)~{\em path generation}, which generates the paths for each OD pair, based on a utility model learned from the real trajectory data in the source cities. Also, a demo of our trajectory generator is publicly available online for two city regions. Extensive experiment results over four regions in China validate the effectiveness of the proposed solution. Besides, an on-field case study is presented in a newly developed region, i.e., Xiongan, China. With the generated trajectories in the new city, many trajectory mining techniques can be applied. |
Web Mining-B (4)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Learning to Classify: A Flow-Based Relation Network for Encrypted Traffic Classification Wenbo Zheng (School of software Engineering, Xi'an Jiao Tong University), Shaocong Mo (Zhejiang University) and Yang Zhao (Xi'an Jiaotong University).
AbstractAs the size and source of network traffic increase, so does the challenge of monitoring and analyzing network traffic. The challenging problems of classifying encrypted traffic are the imbalanced property of network data, and overly dependent on data size. In this paper, we propose an application of a meta-learning approach to address these problems, named RBRN. The RBRN is an end-to-end classification model that learns representative features from the raw flows and then classifies them in a unified framework. Moreover, we design hallucinator to produce additional training samples for the imbalanced classification, and then focus on meta-learning to classify unseen categories from few labeled samples. We validate the effectiveness of the RBRN on the real-world network traffic dataset, and the experimental results demonstrate that the RBRN can achieve an excellent classification performance and outperform other methods on encrypted traffic classification. |
14:00-14:30 |
Adaptive Probabilistic Word Embedding Shuangyin Li (Department of Computer Science, South China Normal University), Yu Zhang (Southern University of Science and Technology), Rong Pan (Sun Yat-sen University) and Kaixiang Mo (The Hong Kong University of Science and Technology).
AbstractWord embeddings have been widely used and proven to be effective in many natural language processing and text modeling tasks. It is obvious that one ambiguous word could have very different semantics in various contexts, which is called polysemy. Most existing works aim at generating only one single embedding for each word while a few works build a limited number of embeddings to present different meanings for each word. However, it is hard to determine the exact number of senses for each word as the word meaning is dependent on contexts. To address this problem, we propose a novel Adaptive Probabilistic Word Embedding (APWE) model, where the word polysemy is defined over a latent interpretable semantic space. Specifically, at first each word is represented by an embedding in the latent semantic space and then based on the proposed APWE model, the word embedding can be adaptively adjusted and updated based on different contexts to obtain the tailored word embedding. Empirical comparisons with state-of-the-art models demonstrate the superiority of the proposed APWE model. |
14:30-15:00 |
The POLAR Framework: Polar Opposites Enable Interpretability of Pre-Trained Word Embeddings Binny Mathew (IIT Kharagpur), Sandipan Sikdar (RWTH Aachen University), Florian Lemmerich (RWTH Aachen University) and Markus Strohmaier (RWTH Aachen University & GESIS).
AbstractWe introduce ‘POLAR’ - a framework that adds interpretability to pre-trained word embeddings via the adoption of semantic differentials. Semantic differentials are a psychometric construct for measuring the semantics of a word by analysing its position on a scale between two polar opposites (e.g., cold - hot, soft - hard). The original idea of our approach is to transform existing, pre-trained word embeddings via semantic differentials to a new “polar” space where dimensions are interpretable. The framework allows for selecting discriminative dimensions from a set of polar dimensions provided by an oracle. We show that the interpretable dimensions selected by our framework align with human judgement. We also demonstrate the effectiveness of our framework by deploying it to various downstream tasks where our interpretable word embeddings achieve a performance that is comparable to the original word embeddings. These results together demonstrate that interpretability could be added to word embeddings without compromising on the performance. Our work is relevant for researchers or engineers interested in interpreting trained word embeddings. |
15:00-15:15 |
Domain Adaptation with Category Attention Network for Deep Sentiment Analysis Dongbo Xi (Institute of Computing Technology, Chinese Academy of Sciences), Fuzhen Zhuang (Institute of Computing Technology, Chinese Academy of Sciences), Ganbin Zhou (Tencent), Xiaohu Cheng (Tencent), Fen Lin (Tencent) and Qing He (Institute of Computing Technology, Chinese Academy of Sciences).
AbstractDomain adaptation tasks such as cross-domain sentiment classification aim to utilize existing labeled data in the source domain and unlabeled or few labeled data in the target domain to improve the performance in the target domain via reducing the shift between the data distributions. Existing cross-domain sentiment classification methods need to distinguish pivots, i.e., the domain-shared sentiment words, and non-pivots, i.e., the domain-specific sentiment words, for excellent adaptation performance. In this paper, we first design a Category Attention Network (CAN), and then propose a model named CAN-CNN to integrate CAN and a Convolutional Neural Network (CNN). On the one hand, the model regards pivots and non-pivots as unified category attribute words and can automatically capture them to improve the domain adaptation performance; on the other hand, the model makes an attempt at interpretability to learn the transferred category attribute words. Specifically, the optimization objective of our model has three different components: 1) the supervised classification loss; 2) the distributions loss of category feature weights; 3) the domain invariance loss. Finally, the proposed model is evaluated on three public sentiment analysis datasets and the results demonstrate that CAN-CNN can outperform other various baseline methods. |
15:15-15:30 |
Review-guided Helpful Answer Identification in E-commerce Wenxuan Zhang (The Chinese University of Hong Kong), Wai Lam (The Chinese University of Hong Kong), Yang Deng (The Chinese University of Hong Kong) and Jing Ma (The Chinese University of Hong Kong).
AbstractProduct-specific question answering platforms can greatly help to address the concerns of potential customers. However, the user-provided answers on such platforms often vary a lot in their qualities. Helpfulness votes from the community can indicate the overall quality of the answer, but they are often missing. Accurately predicting the helpfulness of an answer to a given question and thus identifying helpful answers is becoming a demanding need. Since the helpfulness of an answer depends on multiple perspectives instead of only topical relevance investigated in typical QA tasks, common answer selection algorithms are insufficient for tackling this task. In this paper, we propose the Review-guided Answer Helpfulness Prediction (RAHP) model that not only considers the interactions between QA pairs but also investigates the opinion coherence between the answer and crowds' opinions reflected in the reviews, which is another important factor to identify helpful answers. Moreover, we tackle the task of determining opinion coherence as a language inference problem and explore the utilization of pre-training strategy to transfer the textual inference knowledge obtained from a specifically designed trained network. Extensive experiments conducted on real-world data across seven product categories show that our proposed model achieves superior performance on the prediction task. |
Semantics (3)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
TaxoExpan: Self-supervised Taxonomy Expansion with Position-Enhanced Graph Neural Network Jiaming Shen (University of Illinois at Urbana-Champaign), Zhihong Shen (Microsoft), Chenyan Xiong (Microsoft), Chi Wang (Microsoft), Kuansan Wang (Microsoft) and Jiawei Han (University of Illinois at Urbana-Champaign).
AbstractTaxonomy consists of machine-interpretable semantics and provides valuable knowledge for many web applications. For example, online retailers (e.g., Amazon and eBay) use taxonomies for product recommendation, and web search engines (e.g., Google and Bing) leverage taxonomies to enhance query understanding. Enormous efforts have been made on constructing taxonomies either manually or semi-automatically. However, with the fast-growing volume of web content, existing taxonomies will become outdated and fail to capture emerging knowledge. Therefore, in many applications, dynamic expansions of an existing taxonomy are in great demand. In this paper, we study how to expand an existing taxonomy by adding a set of new concepts. We propose a novel self-supervised framework, named TaxoExpan, which automatically generates a set of pairs from the existing taxonomy as training data. Using such self-supervision data, TaxoExpan learns a model to predict whether a query concept is the direct hyponym of an anchor concept. We develop two innovative techniques in TaxoExpan: (1) a position-enhanced graph neural network that encodes the local structure of an anchor concept in the existing taxonomy, and (2) a noise-robust training objective that enables the learned model to be insensitive to the label noise in the self-supervision data. Extensive experiments on three large-scale datasets from different domains demonstrate both the effectiveness and the efficiency of TaxoExpan for taxonomy expansion. |
14:00-14:15 |
Enhanced-RCNN: An Efficient Method for Learning Sentence Similarity Shuang Peng (Ant Financial Services Group), Hengbin Cui (Ant Financial Services Group), Niantao Xie (MOE Key Laboratory of Computational Linguistics, Peking University), Sujian Li (MOE Key Laboratory of Computational Linguistics, Peking University), Jiaxing Zhang (Ant Financial Services Group) and Xiaolong Li (Ant Financial Services Group).
AbstractLearning sentence similarity is a fundamental research topic and has been explored using various deep learning methods recently. In this paper, we further propose an enhanced recurrent convolutional neural network (Enhanced-RCNN) model for learning sentence similarity. Compared to the state-of-the-art BERT model, the architecture of our proposed model is far less complex. Experimental results show that our similarity learning method outperforms the baselines and achieves the competitive performance on two real-world paraphrase identification datasets. |
14:15-14:45 |
Generalizing Tensor Decomposition for N-ary Relational Knowledge Bases Yu Liu (Tsinghua University), Quanming Yao (4Paradigm) and Yong Li (Tsinghua University).
AbstractWith the rapid development of knowledge bases (KBs), link prediction task, which completes KBs with missing facts, has been broadly studied in especially binary relational KBs (a.k.a knowledge graph) with powerful tensor decomposition related methods. However, the ubiquitous n-ary relational KBs with higher-arity relational facts are paid less attention, in which existing translation based and neural network based approaches have weak expressiveness and high complexity in modeling various relations. Tensor decomposition has not been considered for n-ary relational KBs, while directly extending tensor decomposition related methods of binary relational KBs to the n-ary case does not yield satisfactory results due to exponential model complexity and their strong assumptions on binary relations. To generalize tensor decomposition for n-ary relational KBs, in this work, we propose GETD, a generalized model based on Tucker decomposition and Tensor Ring decomposition. The existing negative sampling technique is also generalized to the n-ary case for GETD. In addition, we theoretically prove that GETD is fully expressive to completely represent any KBs. Extensive evaluations on two representative n-ary relational KB datasets demonstrate the superior performance of GETD, significantly improving the state-of-the-art methods by over 15%. Moreover, GETD further obtains the state-of-the-art results on the benchmark binary relational KB datasets. |
14:45-15:15 |
Dynamic Graph Convolutional Networks for Entity Linking Junshuang Wu (Beihang University), Richong Zhang (Beihang University), Yongyi Mao (University of Ottawa), Hongyu Guo (National Research Council Canada, Ottawa, Canada), Masoumeh Soflaei Shahrbabak (University of Ottawa) and Jinpeng Huai (Beihang University).
AbstractEntity linking, which maps named entity mentions in a document into the proper entities in a given knowledge base, has shown to significantly benefit from modeling the entity relatedness through Graph Convolutional Networks (GCN). Nevertheless, existing GCN entity linking models fail to take into account the fact that the structured graph for a set of entities not only depends on the contextual information of the given document but also adaptively changes on different aggregation layers of the GCN network, resulting in insufficiency in terms of capturing the relatedness between entities. In this paper, we propose a dynamic GCN architecture to effectively cope with this challenge. The graph structure in our model is dynamically computed and modified during training. Through aggregating information from dynamic linked nodes, our GCN model can collectively identify the entity mappings between the document and the knowledge base and to efficiently capture the topical coherence among various entity mentions in the entire document. Empirical studies on benchmark entity linking data sets confirm the superior performance of our proposed strategy and the benefits of the dynamic graph structure. |
Social Network-B (2)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
GraphGen: A Scalable Approach to Domain-agnostic Labeled Graph Generation Nikhil Goyal (IIT Delhi), Harsh Jain (IIT Delhi) and Sayan Ranu (IIT Delhi).
AbstractGraph generative models have been extensively studied in the data mining literature. While traditional techniques are based on generating structures that adhere to a pre-decided distribution, recent techniques have shifted towards learning this distribution directly from the data. While learning-based approaches have imparted significant improvement in quality, some limitations remain to be addressed. First, learning graph distributions introduces additional computational overhead, which limits their scalability to large graph databases. Second, many techniques only learn the structure and do not address the need to also learn node and edge labels, which encode important semantic information and influence the structure itself. Third, existing techniques often incorporate domain-specific rules and lack generalizability. Fourth, the experimentation of existing techniques is not comprehensive enough due to either using weak evaluation metrics or focusing primarily on synthetic or small datasets. In this work, we develop a domain-agnostic, scalable technique called GraphGen to overcome all of these limitations. GraphGen converts graphs to sequences using minimum DFS codes. Minimum DFS codes are canonical labels and capture the graph structure precisely along with the label information. The complex joint distributions between structure and semantic labels are learned through a novel LSTM architecture. Extensive experiments on million-sized, real graph datasets show GraphGen to be $4$ times faster on average than state-of-the-art techniques while being $40$ times better in quality across a comprehensive set of $10$ different metrics. |
14:00-14:30 |
Smaller, Faster & Lighter KNN Graph Constructions Rachid Guerraoui (Ecole Polytechnique Fédérale de Lausanne), Anne-Marie Kermarrec (EPFL, Mediego), Olivier Ruas (Peking University) and Francois Taiani (Univ Rennes, Inria, CNRS, IRISA).
AbstractWe propose GoldFinger, a new compact and fast-to-compute binary representation of datasets to approximate Jaccard’s index. We illustrate the effectiveness of GoldFinger on the emblematic big data problem of K-Nearest-Neighbor (KNN) graph construction and show that GoldFinger can drastically accelerate a large range of existing KNN algorithms with little to no overhead. As a side effect, we also show that the compact representation of the data protects users’ privacy for free by providing k-anonymity and l-diversity. Our extensive evaluation of the resulting approach on several realistic datasets shows that our approach delivers speedups of up to 78.9% compared to the use of raw data while only incurring a negligible to moderate loss in terms of KNN quality. To convey the practical value of such a scheme, we apply it to item recommendation, and show that the loss in recommendation quality is negligible. |
14:30-14:45 |
Negative Purchase Intent Identification in Twitter Samed Atouati (Télécom Paristech), Xiao Lu (BNP Paribas Asset Management) and Mauro Sozio (Télécom ParisTech).
AbstractSocial networks users often express their discontent with a product or a service from a company. Such a reaction is more pronounced in the aftermath of a corporate scandal such as a corruption scandal or food poisoning in a chain restaurant. In our work, we focus on identifying negative purchase intent in a tweet, i.e. the intent of a user of not purchasing any product or consuming any service from a company. We develop a binary classifier for such a task, which consists of a generalization of logistic regression leveraging the locality of purchase intent in posts from Twitter. We conduct an extensive experimental evaluation against state-of-the-art approaches on a large collection of tweets, showing the effectiveness of our approach in terms of F1 score. We also provide some preliminary results on which kinds of corporate scandals might affect the purchase intent of customers the most. |
14:45-15:00 |
Using Cliques with Higher-order Spectral Embeddings Improves Graph Visualizations Huda Nassar (Stanford University), David Gleich (Purdue University), Austin Benson (Cornell University), Shweta Jain (University of California Santa Cruz) and Caitlin Kennedy (Purdue University).
AbstractIn the simplest setting, graph visualization is the problem of producing a set of two-dimensional coordinates for each node that meaningfully shows connections and communities in a graph. Among other uses, having a meaningful layout is often useful to help interpret the results from network sciences tasks such as community detection and link prediction. There are several existing graph visualization techniques in the literature that are based on spectral methods, graph embeddings, or optimizing graph distances. Despite the large number of methods, it is still often challenging or extremely time consuming to produce meaningful layouts of graphs with hundreds of thousands of vertices. Existing methods often either fail to produce a visualization in a meaningful time window, or produce a layout colorfully called a ``hairball'', which looks like a filled ellipse with small hairs emerging that does not illustrate any internal structure in the graph. Here, we show that adding higher-order information based on cliques to a classic eigenvector based graph visualization techniques enables it to produce meaningful plots of large graphs. We further evaluate these visualizations along a number of graph visualization metrics and we find that it outperforms existing techniques on a metric that uses random walks to measure the local structure. Finally, we show many examples of how our algorithm successfully produces layouts of large networks. |
User Modeling-B (2)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
RLPer: A Reinforcement Learning Model for Personalized Search Jing Yao (Renmin University of China), Zhicheng Dou (Renmin University of China), Jun Xu (Renmin University of China) and Ji-Rong Wen (Renmin University of China).
AbstractPersonalized search improves generic ranking models by taking the user interests into consideration and returning more accurate search results to individual users. In recent years, machine learning and deep learning techniques have been successfully applied in personalized search. Most existing personalization models simply regard the search history as a static set of user behaviours and learn fixed ranking strategies based on the recorded data. Though improvements have been observed, it is obvious that these methods ignore the dynamic nature of the search process: search is a sequence of interactions between the search engine and the user. During the search process, the user interests may dynamically change. It would be more helpful if a personalized search model could track the whole interaction process and update its ranking strategy continuously. In this paper, we propose a reinforcement learning based personalization model, referred to as RLPer, to track the sequential interactions between the users and search engine with a hierarchical Markov Decision Process (MDP). In RLPer, the model (agent) interacts with the user (environment) through returning document list in the high level MDP, while it samples document pairs under each query as training data to update the ranking model in the low level. Experimental results on query logs from a commercial search engine verified that our proposed model can significantly outperform state-of-the-art personalized search models. |
14:00-14:30 |
Personalized Ranking with Importance Sampling Defu Lian (University of Science and Technology of China), Qi Liu (University of Science and Technology of China) and Enhong Chen (University of Science and Technology of China).
AbstractAs the task of predicting a personalized ranking on a set of items, item recommendation has become an important way to address information overload. Optimizing ranking loss aligns better with the ultimate goal of item recommendation, so many ranking-based methods were proposed for item recommendation, such as collaborative filtering with Bayesian Personalized Ranking (BPR) loss, and Weighted Approximate-Rank Pairwise (WARP) loss. However, the ranking-based methods can not consistently beat regression-based models with the gravity regularizer. The key challenge in ranking-based optimization is difficult to fully use the limited number of negative samples, particularly when they are not so informative. To this end, we propose a new ranking loss based on importance sampling so that more informative negative samples can be better used. We then design a series of negative samplers from simple to complex, whose informativeness of negative samples is from less to more. With these samplers, the loss function is very easy to use and can be optimized by popular solvers. The proposed algorithms are evaluated with five real-world datasets of varying size and difficulty. The results show that they consistently outperform the state-of-the-art item recommendation algorithms, and the relative improvements with respect to NDCG@50 are more than 19.2\% on average. Moreover, the loss function is verified to make better use of negative samples and to require fewer negative samples when they are more informative. |
14:30-15:00 |
Field-aware Calibration: A Simple and Empirically Strong Method for Reliable Probabilistic Predictions Feiyang Pan (Institute of Computing Technology, Chinese Academy of Sciences), Xiang Ao (Institute of Computing Technology, Chinese Academy of Sciences), Pingzhong Tang (Tsinghua University), Min Lu (Tencent), Dapeng Liu (Tencent), Lei Xiao (Tencent) and Qing He (Institute of Computing Technology, CAS).
AbstractIt is often observed that the probabilistic predictions given by a machine learning system can disagree with averaged actual outcomes on specific subsets of data, which is also known as the issue of miscalibration. It is responsible for the unreliability of practical machine learning systems. For example, in an online advertising system, an ad can receive a click-through rate prediction of 0.1 over some population of users where its actual click rate is 0.15. In such cases, the probabilistic predictions have to be fixed before deployment.In this paper, we first introduce an evaluation metric for calibration, coined field-level calibration error, that measures bias in predictions over the input fields that the decision-maker concerns. We show that existing post-hoc calibration methods have limited improvements in the new field-level metric and completely fail to improve other non-calibration metrics such as the AUC score. To this end, we propose Neural Calibration, a simple yet powerful post-hoc calibration method that learns to calibrate by making full use of the field-aware information over the validation set. We present extensive experiments on five large-scale datasets, including a default prediction dataset, an insurance dataset, and three user response prediction tasks for advertising. The results showed that Neural Calibration significantly improves against uncalibrated predictions in common metrics, including the negative log-likelihood, Brier score, AUC, as well as the field-level calibration error, and consistently outperforms existing methods. |
15:00-15:30 |
Improving Learning Outcomes with Gaze Tracking and Automatic Question Generation Rohail Syed (University of Michigan), Kevyn Collins-Thompson (University of Michigan), Paul Bennett (Microsoft Research AI), Mengqiu Teng (University of Michigan), Shane Williams (Microsoft Research AI), Wendy Tay (Independent) and Shamsi Iqbal (Microsoft Research AI).
AbstractAs AI technology advances, the opportunity to improve educational outcomes by integrating AI technology with an overall learning experience offers promise. We investigate forward-looking interactive reading experiences that leverage both automatic question generation and attention signals, such as gaze tracking, to improve short- and long-term learning outcomes. We aim to expand the known pedagogical benefits of adjunct questions to more general reading scenarios, by investigating the benefits of adjunct questions generated only after, and based on, the participant's gaze attention behavior when reading an article. We compare manually-written and Automatic Question Generation (AQG) as potential question sources. We further investigate gaze and reading patterns indicative of low vs high learning in both short- and long-term scenarios (one-week followup). We show AQG-generated adjunct questions have promise as a way to scale to a wide variety of reading material where the cost of manually curating questions may be prohibitive. |
Research Tracks (6)
Web Mining-A (6)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
Discovering Mathematical Objects of Interest – A Study of Mathematical Notations André Greiner-Petter (University of Wuppertal), Moritz Schubotz (University of Wuppertal), Fabian Müller (FIZ Karlsruhe), Corinna Breitinger (University of Konstanz), Howard Cohl (National Institute of Standards and Technology), Akiko Aizawa (National Institute of Informatics) and Bela Gipp (University of Wuppertal).
AbstractMathematical notation, i.e., the writing system used to communicate concepts in mathematics, encodes valuable information for a variety of information search and retrieval systems. Yet, mathematical notations remain mostly unutilized by today's systems. In this paper, we present the first in-depth study on the distributions of mathematical notation in two large scientific corpora: the open-access arXiv (2.5B mathematical objects) and the mathematical reviewing service zbMATH (61M mathematical objects). Our study lays a foundation for future research projects on mathematical information retrieval for large scientific corpora. Further, we demonstrate the relevance of our results to a variety of use-cases. For example, to assist semantic extraction systems, to improve scientific search engines, and to facilitate specialized math recommendation systems.The contributions of our presented research are as follows: (1) we present the first distributional analysis of mathematical formulae on arXiv and zbMATH; (2) we retrieve relevant mathematical objects for given textual search queries (i.e., linking $P_{n}^{(\alpha, \beta)}\!\left(x\right)$ with `Jacobi polynomial'); (3) we extend zbMATH's search engine by providing relevant mathematical formulae; and (4) we exemplify the applicability of the results by presenting auto-completion for math inputs as the first contribution to math recommendation systems. To expedite future research projects, we make our source code and the data available. |
16:30-17:00 |
eDarkFind: Unsupervised Multi-view Learning for Sybil Account Detection Ramnath Kumar (BITS Pilani Hyderabad Campus), Shweta Yadav (Knoesis), Raminta Daniulaityte (Wright State University), Francois Lamy (Mahidol University), Krishnaprasad Thirunarayan (Knoesis), Usha Lokala (Knoesis) and Amit Sheth (Knoesis).
AbstractDarknet, crypto markets are online marketplaces using crypto currencies (e.g., Bitcoin, Monero) and advanced encryption techniques to offer anonymity to vendors and consumers trading for illegal goods or services. The exact volume of substances advertised and sold through these crypto markets is difficult to assess, at least partially, because vendors tend to maintain multiple accounts (or Sybil accounts) within and across different crypto markets. Linking these different accounts will allow us to accurately evaluate the volume of substances advertised across the different crypto markets by each vendor. In this paper, we present a multi-view unsupervised framework (eDarkFind) that helps modeling vendor characteristics and facilitates Sybil account detection. We employ a multi-view learning paradigm to generalize and improve the performance by exploiting the diverse views from multiple rich sources such as BERT, stylometric, and location representation. Our model is further tailored to take advantage of domain-specific knowledge such as the Drug Abuse Ontology to take into consideration the substance information. We performed extensive experiments and demonstrated that the multiple views obtained from diverse sources can be effective in linking Sybil accounts. Our proposed eDarkFind model achieves an accuracy of 98% on three real-world datasets which shows the generality of the approach. |
17:00-17:30 |
#Outage: Detecting Power and Communication Outages from Social Network Udit Paul (University of California, Santa Barbara), Alexander Ermakov (University of California, Santa Barbara), Michael Nekrasov (University of California, Santa Barbara), Vivek Adarsh (University of California, Santa Barbara) and Elizabeth Belding (University of California, Santa Barbara).
AbstractNatural disasters are increasing worldwide at an alarming rate.To aid relief operations during and post disaster, humanitarian organizations rely on various types of situational information such as missing, trapped or injured people and damaged infrastructure in an area. Crucial and timely identification of infrastructure and utility damage is critical to properly plan and execute search and rescue operations. However, in the wake of natural disasters, real time identification of this information becomes challenging. This research investigates the use of tweets posted on the Twitter social media platform to detect power and communication outages during natural disasters. We first curate a data set of 18,097 tweets based on domain-specific keywords obtained using Latent Dirichlet Allocation. We annotate the gathered data set to separate the tweets into different types of outage-related events: power outage,communication outage and both power-communication outage. We analyze the tweets to identify information such as popular words,length of words and hashtags as well as sentiments that are associated with tweets in these outage-related categories. Furthermore,we apply machine learning algorithms to classify these tweets into their respective categories. Our results show that simple classifiers such as the boosting algorithm are able to classify outage related tweets from unrelated tweets with close to 100% f1-score. Addition-ally, we observe that the transfer learning model, BERT, is able to classify different categories of outage-related tweets with close to 90%accuracy in less than 90 seconds of training and testing time,demonstrating that tweets can be mined in real-time to assist first responders during natural disasters |
17:30-17:45 |
Voice-based Reformulation of Community Answers Simone Filice (Amazon), Nachshon Cohen (Amazon) and David Carmel (Amazon).
AbstractCommunity Question Answering (CQA) websites, such as Stack Exchange or Quora, allow users to freely ask questions and obtain answers from other users, i.e., the community.Personal assistants, such as Amazon Alexa or Google Home, can also exploit CQA data to answer a broader range of questions and increase customers' engagement. However, the voice-based interaction poses new challenges to the Question Answering scenario. Even assuming that we are able to retrieve a previously asked question that perfectly matches the user's query, we cannot simply read its answer to the user. A major limitation is the answer length. Reading these answers to the user is cumbersome and boring. Furthermore, many answers contain non-voice-friendly parts, such as images, or URLs.In this paper, we define the Answer Reformulation task and propose a novel solution to automatically reformulate a community provided answer making it suitable for a voice interaction. Results on a manually annotated dataset extracted from Stack Exchange show that our models improve strong baselines. |
17:45-18:00 |
VRoC: Variational Autoencoder-aided Multi-Task Rumor Classifier Based on Text Mingxi Cheng (University of Southern California), Shahin Nazarian (University of Southern California) and Paul Bogdan (University of Southern California).
AbstractSocial media have evolved to be popular and applicable to almost every aspect of our lives. The convenience of posting online not only benefits individual users, but also fosters various fast-spreading rumors. The rapid and wide expansion of rumors may cause persistent adverse impacts. Researchers therefore put great effort to reduce the negative impacts of rumors. A rumor classification system is designed to detect, track, and verify rumors on social media. It typically includes four components, namely rumor detector, rumor tracker, stance classifier and veracity classifier. Prior works tackled some of the components either individually or jointly. An efficient, high performance framework that can realize all four functions is in great need. To address this, we propose VRoC, a tweet-level variational autoencoder based rumor classification system. VRoC includes a co-train engine that trains variational autoencoders (VAEs) and rumor classification components. This helps the VAEs to tune their latent representations to be classifier-friendly. We also show that VRoC is able to classify unseen rumors with high levels of accuracy. Under PHEME dataset, VRoC consistently outperforms several state-of-the-art techniques, on both observed and unobserved rumors, by up to 26.9%, in terms of macro-F1 scores. |
Social Network-A (6)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
Efficient Maximal Balanced Clique Enumeration in Signed Networks Zi Chen (ECNU), Long Yuan (Nanjing University of Science and Technology), Xuemin Lin (UNSW), Lu Qin (UTS) and Jianye Yang (Hunan University).
AbstractClique is one of the most fundamental models for cohesive subgraph mining in network analysis. Existing clique model mainly focuses on unsigned networks. In real world, however, many applications are modeled as signed networks with positive and negative edges. As the signed networks hold their own properties different from the unsigned networks, the existing clique model is inapplicable for the signed networks. Motivated by this, we propose the balanced clique model that considers the most fundamental and dominant theory, structural balance theory, for signed networks, and study the maximal balanced clique enumeration problem which computes all the maximal balanced cliques in a given signed network. We show that the maximal balanced clique enumeration problem is NPhard. A straightforward solution for the maximal balanced clique enumeration problem is to treat the signed network as two unsigned networks and leverage the off the-shelf techniques for unsigned networks. However, such a solution is inefficient for large signed networks. To address this problem, in this paper, we first propose a new maximal balanced clique enumeration algorithm by exploiting the unique properties of signed networks. Based on the new proposed algorithm, we devise two optimization strategies to further improve the efficiency of the enumeration. We conduct extensive experiments on large real and synthetic datasets. The experimental results demonstrate the efficiency, effectiveness and scalability of our proposed algorithms. |
16:30-17:00 |
Clustering in graphs and hypergraphs with categorical edge labels Ilya Amburg (Cornell University), Nate Veldt (Cornell University) and Austin Benson (Cornell University).
AbstractModern graph or network datasets often contain rich structure that goes beyond simple pairwise connections between nodes. This call for complex representations that can capture, for instance, edges of different types as well as so-called "higher-order interactions" that involve more than two nodes at a time. This brings a need for methods that can meaningfully analyze data with such rich structure. Here, we develop a scalable computational framework for the fundamental problem of clustering graphs with categorical edge labels, targeting the setting where clusters correspond to groups of nodes that tend to participate in the same type or category of interaction. Our approach seamlessly generalizes to hypergraphs, enabling analysis of higher-order interactions with categorical hyperedges, and our objective functions can be optimized in polynomial time in the special case of two categorical labels. Although minimizing our objective becomes NP-hard in the multi-label case, we develop effective approximation algorithms based on linear programming and multiway-cut techniques. We show that our algorithms readily outperform competitive baselines in both synthetic and real-world data. |
17:00-17:30 |
Flowless: Extracting Densest Subgraphs Without Flow Computations Digvijay Boob (Georgia Institute of Technology), Yu Gao (Georgia Institute of Technology), Richard Peng (Georgia Institute of Technology), Saurabh Sawlani (Georgia Institute of Technology), Charalampos Tsourakakis (Boston University), Di Wang (Georgia Institute of Technology) and Junxing Wang (CMU).
AbstractThe problem of finding dense components of a graph is a major primitive in graph mining and data analysis. The {\em densest subgraph problem} (DSP) that asks to find a subgraph with maximum average degree forms a basic primitive in dense subgraph discovery with applications ranging from community detection to unsupervised discovery of biological network modules gionis2015dense. The DSP is exactly solvable in polynomial time using maximum flows goldberg1984finding,gallo1989fast,khuller2009finding. Due to the high computational cost of maximum flows, Charikar's greedy approximation algorithm is usually preferred in practice due to its linear time and linear space complexity asahiro2000greedily,charikar2000greedy. It constitutes a key algorithmic idea in scalable solutions for large-scale dynamic graphs bahmani2012densest,bhattacharya2015space. However, its output density can be a factor 2 off the optimal solution.In this paper we design {\sc Greedy++}, an iterative peeling algorithm that improves upon the performance of Charikar's greedy algorithm significantly. Our iterative greedy algorithm is able to output near-optimal and optimal solutions fast by adding a few more passes to Charikar's greedy algorithm. Furthermore {\sc Greedy++} is more robust against the structural heterogeneities (e.g., skewed degree distributions) in real-world datasets. An additional property of our algorithm is that it is able to assess {\em quickly}, without computing maximum flows, whether Charikar's approximation quality on a given graph instance is closer to the worst case theoretical guarantee of $1{2}$ or to optimality. We also demonstrate that our method has significant efficiency advantage over the maximum flow based exact optimal algorithm. For example, our algorithm achieves $\sim$145$\times$ speedup on average across a variety of real-world graphs while finding subgraphs of density that are at least 90\% as dense as the densest subgraph. |
17:30-17:45 |
How Much and When Do We Need Higher-order Information in Hypergraphs? A Case Study on Hyperedge Prediction Se-eun Yoon (Korea Advanced Institute of Science and Technology), Hyungseok Song (Korea Advanced Institute of Science and Technology), Kijung Shin (Korea Advanced Institute of Science and Technology) and Yung Yi (Korea Advanced Institute of Science and Technology).
AbstractHypergraphs provide a natural way of representing interactions that occur in groups. Different downstream tasks and computational convenience motivate an extensive array of prior work to adopt some form of abstraction and simplification of complex higher-order group interactions in hypergraphs, showing the value of using hypergraphs in many graph tasks. However, the following question has yet to be addressed: How much abstraction of group interactions is sufficient in solving a hypergraph task, and how different such results become across different datasets? This question, if properly answered, provides a useful engineering guideline on how to appropriately trade off between complexity in representation of higher-order group interactions and accuracy of solving a task involving hypergraphs. To this end, we propose a method of incrementally representing group interactions using a notion of n-projected graph whose accumulation contains the information on up to n-way interactions, and quantify the accuracy of solving a given task as n grows for various datasets. As a downstream task, we consider hyperedge prediction, an extension of link prediction, which, we believe, is a canonical task for evaluating graph models. Through extensive experiments on 15 real-world datasets, we draw the following messages: (a) Diminishing returns: small n is enough to achieve accuracy comparable with near-perfect approximations, (b) Troubleshooter: as the task becomes more challenging, higher n brings more benefit, and (c) Irreducibility: datasets whose pairwise interactions do not tell much about higher-order interactions lose much accuracy when reduced to pairwise abstractions. |
17:45-18:00 |
Deconstruct Densest Subgraphs Lijun Chang (The University of Sydney) and Miao Qiao (The University of Auckland).
AbstractIn this paper, we aim to understand the distribution of the densest subgraphs of a given graph under the density notion of average-degree. We show that the structures, the relationships and the distributions of all the densest subgraphs of a graph G can be encoded in O(L) space in an index called the ds-Index. Here L denotes the maximum output size of a densest subgraph of G. More importantly, ds-Index can report all the minimal densest subgraphs of G collectively in O(L) time and can enumerate all the densest subgraphs of G with an O(L) delay. Besides, the construction of ds-Index costs no more than finding a single densest subgraph using the state-of-the-art approach. Our empirical study shows that for web-scale graphs with billions of edges, the ds-Index can be constructed in several minutes on an ordinary commercial machine. |
User Modeling-A (6)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
Frozen Binomials on the Web: Word Ordering and Language Conventions in Online Text Katherine Van Koevering (Cornell University), Austin Benson (Cornell University) and Jon Kleinberg (Cornell University).
AbstractThere is inherent information captured in the order in which we write words in a list. The orderings of binomials - lists of two words separated by "and" or "or" - has been studied for more than a century. These binomials are common across many areas of speech, in both formal and informal text. In the last century, numerous explanations have been given to describe what order people use for these binomials, from differences in semantics to differences in phonology. These rules describe primarily "frozen" binomials that exist in exactly one ordering and have lacked large-scale trials to determine efficacy.Text in online social media such as Reddit provides an unique opportunity to study these lists in the context of informal text at a very large scale. In this work, we expand the view of binomials to include a large-scale analysis of both frozen and non-frozen binomials in a quantitative way. Using this data we then demonstrate that most previously proposed rules are ineffective at predicting binomial ordering. By tracking the order of these binomials across time and communities we are able to establish additional, unexplored dimensions central to these predictions and demonstrate the global structure of the binomials across communities.Expanding beyond the question of individual binomials, we then explore the global structure of binomials in various communities, establishing a new model for these lists and analyzing this structure for non-frozen and frozen binomials. Additionally, novel analysis of trinomials - lists of length three - suggests that none of the analysis of binomials applies in these cases. Finally, we demonstrate how large data sets gleaned from the web can be used in conjunction with older theories and work to expand and improve on old questions. |
16:30-17:00 |
Condition Aware and Revise Transformer for Question Answering Xinyan Zhao (University of Science and Technology of China), Feng Xiao (University of Science and Technology of China), Haoming Zhong (WeBank.com), Jun Yao (WeBank.com) and Huanhuan Chen (University of Science and Technology of China).
AbstractThe study of question answering has received increasing attention in recent years. This work focuses on providing an answer that compatible with both user intent and conditioning information corresponding to the question, such as delivery status and stock information in e-commerce. However, these conditions may be wrong or incomplete in real-world applications. Although existing question answering systems have considered the external information, such as categorical attributes and triples in knowledge base, they all assume that the external information is correct and complete. To alleviate the effect of defective condition values, this paper proposes condition aware and revise Transformer (CAR-Transformer) CAR-Transformer (1) revises each condition value based on the whole conversation and original conditions values, and (2) it encodes the revised conditions and utilizes the conditions embedding to select an answer. Experimental results on a real-world customer service dataset demonstrate that the CAR-Transformer can still select an appropriate reply when conditions corresponding to the question exist wrong or missing values, and substantially outperforms baseline models on automatic and human evaluations. The proposed CAR-Transformer can be extended to other NLP tasks which need to consider conditioning information. |
17:00-17:30 |
Conversational Contextual Bandit: Algorithm and Application Xiaoying Zhang (The Chinese University of Hong Kong), Hong Xie (College of Computer Science, Chongqing University), Hang Li (Bytedance Inc.) and John C.S. Lui (The Chinese University of Hong Kong).
AbstractContextual bandit algorithms provide principled online learning solutions to balance the exploration-exploitation trade-off in various applications such as recommender systems. However, the learning speed of the traditional contextual bandit algorithms is often slow due to the need for extensive exploration. This poses a critical issuein applications like recommender systems, since users may need to provide feedbacks to a lot of uninterested items. To accelerate the learning speed, we generalize the contextual bandit to conversational contextual bandit. The conversational contextual bandit leverages not only behavioral feedbacks on arms (e.g., articles in news recommendation), but also occasional conversational feedbacks on key-terms from the user. Here, a key-term can relate to a subset of arms, for example, a category of articles in news recommendation, etc. We design a new bandit algorithm, which we call the Conversational UCB algorithm (ConUCB), to address two challenges in conversational contextual bandit: (1) Which key-terms to select to conduct conversation; (2) How to leverage conversational feedbacks to accelerate the speed of bandit learning. We theoretically prove that ConUCB can achieve a smaller regret upper bound than the traditional contextual bandit algorithm LinUCB, which implies a faster learning speed. Experiments on synthetic data, as well as real datasets from Yelp and Toutiao, demonstrate the efficacy of the ConUCB algorithm. |
17:30-17:45 |
IART: Intent-aware Response Ranking with Transformers in Information-seeking Conversation Systems Liu Yang (Google; University of Massachusetts Amherst), Minghui Qiu (Alibaba), Chen Qu (University of Massachusetts Amherst), Cen Chen (Ant Financial Services Group), Jiafeng Guo (Institute of Computing Technology), Yongfeng Zhang (Rutgers University), Bruce Croft (University of Massachusetts Amherst) and Haiqing Chen (Alibaba Group).
AbstractPersonal assistant systems, such as Apple Siri, Google Now, Amazon Alexa, and Microsoft Cortana, are becoming ever more widely used. Understanding user intent such as clarification questions, potential answers and user feedback in information-seeking conversations is critical for retrieving good responses. In this paper, we analyze user intent patterns in information-seeking conversations and propose an intent-aware neural response ranking model IART, which refers to ''Intent-Aware Ranking with Transformers''. IART is built on top of the integration of user intent modeling and the recent breakthroughs in language representation learning with the Transformer architecture that relies entirely on a self-attention mechanism instead of recurrent nets. It incorporates intent-aware utterance attention to derive an importance weighting scheme of utterances in conversation context with the aim of better conversation history understanding. We conduct extensive experiments with three information-seeking conversation data sets including both standard benchmarks and commercial data. Our proposed model outperforms all baseline methods with respect to a variety of metrics. We also perform case studies and analysis of learned user intent and its impact on response ranking in information-seeking conversations to provide interpretation of results. We will open source the code of our model. |
17:45-18:00 |
PARS: Peers-aware Recommender System Huiqiang Mao (Alibaba Group), Yanzhi Li (City University of Hong Kong), Chenliang Li (Wuhan University), Di Chen (Alibaba Group), Xiaoqing Wang (Alibaba Group) and Yuming Deng (Alibaba Group).
AbstractThe presence or absence of one item in a recommendation list will affect the demand for other items because customers are often willing to switch to other items if their most preferred items are not available. The cross-item influence, called “peer effects”, has been largely ignored in the literature. In this paper, we develop a peer-aware recommender system, named PARS. We apply a ranking based choice model to capture the cross-item influence and solve the resultant MaxMin problem with a decomposition algorithm. The MaxMin model solves for the recommendation decision in the meanwhile of estimating users’ preferences towards the items, which yields high-quality recommendations robust to input data variation, as the theoretical analysis shows. Experimental results illustrate that PARS outperforms a few frequently used methods in practice. An online evaluation with a flash sales scenario at Taobao also shows that PARS delivers significant improvements in terms of both conversion rates and user value. |
Crowdsourcing (2)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
Crowd Teaching with Imperfect Labels Yao Zhou (University of Illinois at Urbana-Champaign), Arun Reddy Nelakurthi (Samsung Electronics), Ross Maciejewski (Arizona State University), Wei Fan (Tencent America) and Jingrui He (University of Illinois at Urbana-Champaign).
AbstractThe need for annotated labels to train machine learning models led to a surge in crowdsourcing. Given a noisy labeled set, how can we leverage the label information obtained from amateur crowd workers to denoise the data? Also, is it possible to teach the crowd workers using the noisy labeled set and improve their performance? In this paper, we answer both questions via a novel adaptive and interactive teaching framework, which uses visual explanations to simultaneously teach and gauge the confidence level of the crowd workers. In particular, the teacher performs teaching using an empirical risk minimizer learned from a noisy labeled set; the workers are assumed to have a forgetting behavior and their learning rate depends on the interpretation difficulty of the teaching item. Furthermore, we also show that the empirical risk minimizer used by the teacher is a reliable and realistic substitute for the unknown target concept by utilizing the unbiased surrogate loss. Finally, the performance of the proposed framework is demonstrated through experiments on multiple real-world image and text data sets. |
16:30-17:00 |
Modeling and Aggregation of Complex Annotations via Annotation Distances Alexander Braylan (The University of Texas at Austin) and Matthew Lease (The University of Texas at Austin).
AbstractModeling annotators and their labels is useful for ensuring data quality. However, while many models have been proposed to handle binary or categorical labels, prior methods do not generalize to complex annotation tasks (e.g., open-ended text, multivariate, or structured responses) without devising new models for each specific task. To obviate the need for task-specific modeling, we propose to model distances between labels, rather than the labels themselves. Our models are agnostic to the distance function; we leave it to the requesters to specify an appropriate distance function for their given annotation task. We propose three models, including a Bayesian hierarchical extension of multidimensional scaling. Results show the generality and effectiveness of our models across four diverse, complex annotation tasks: sequence labeling, translation, syntactic parsing, and element ranking |
17:00-17:30 |
Inferring Passengers’ Interactive Choices on Public Transits via MA-AL: Multi-Agent Apprenticeship Learning Mingzhou Yang (Xi'an Jiaotong University), Yanhua Li (Worcester Polytechnic Institute), Xun Zhou (The University of Iowa), Hui Lu (Guangzhou University), Zhihong Tian (Guangzhou University) and Jun Luo (Lenovo Research, Hong Kong).
AbstractPublic transports, such as subway lines and buses, offer affordable ride-sharing services and reduce the road network traffic. Extracting people's preferences from their public transit choices is non-trivial. When people travel by public transits, they make sequences of transit choices, and their rewards are usually influenced by the other people's choices, so this process can be seen as a Markov Game (MG). In this paper, we make the first effort to model travelers’ preferences of making transit choices using MGs. Based on the discovery that passengers usually never change their policies, we propose novel algorithms to extract the reward functions from the observed, deterministic equilibrium joint policy of all agents in a general-sum MG to infer travelers' preferences. First, we assume we have the access to the entire joint policy. We characterize the set of all reward functions for which the given joint policy is a Nash equilibrium policy. In order to remove the degeneracy of the solution, we then attempt to pick reward functions so as to maximize the deviation from the the observed policy to the sub-optimal policy of each agent. This result in a skillfully solvable linear programming algorithm of the multi-agent inverse reinforcement learning (MA-IRL) problem. Then, we deal with the case where we have access to the equilibrium joint policy through an actual trajectory. We propose an iterative algorithm inspired by single-agent apprenticeship learning algorithms and the cyclic coordinate descent approach. Then, we validate our algorithms using a simple discrete problem. Finally, under the assumption that the actual joint policy is Nash equilibrium and the passengers' reward functions are linear with the decision-making features, we use the proposed algorithms on a unique real-world dataset (from Shenzhen, China) to extract passengers' preferences. |
Health (3)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
Leveraging Sentiment Distributions to Distinguish Figurative From Literal Health Reports on Twitter Rhys Biddle (University of Technology, Sydney), Aditya Joshi (CSIRO), Shaowu Liu (University of Technology, Sydney), Cecile Paris (CSIRO) and Guandong Xu (University of Technology, Sydney).
AbstractHarnessing data from social media to monitor health events is a promising avenue for public health surveillance. A key step is the detection of reports of a disease (referred to as 'health mention classification') amongst tweets that mention disease words. Primary work shows that figurative usage of disease words may prove to be challenging for health mention classification. Since the experience of a disease is associated with a negative sentiment, we present a method that utilises sentiment information to improve health mention classification. Specifically, our classifier for health mention classification combines pre-trained contextual word representations with sentiment distributions of words in the tweet. For our experiments, we extend a benchmark dataset of tweets for health mention classification, by adding over 14k manually annotated tweets across existing and new diseases. We also additionally annotate each tweet with a label that indicates if the disease words are used in a figurative sense. Our classifier outperforms current SOTA approaches in detecting both health-related and figurative tweets that mention disease words. We also show that tweets containing disease words are mentioned figuratively more often than in a health-related context, proving to be challenging for classifiers targeting health-related tweets. |
16:30-17:00 |
Domain-Guided Task Decomposition with Self-Training for Detecting Personal Events in Social Media Payam Karisani (Emory University), Eugene Agichtein (Emory University) and Joyce Ho (Emory University).
AbstractMining social media content for tasks such as detecting personal experiences or events, suffer from lexical sparsity, insufficient training data, and inventive lexicons. To reduce the burden of creating extensive labeled data and improve classification performance, we propose to perform these tasks in two steps: 1. Decomposing the task into domain-specific sub-tasks by identifying key concepts, thus utilizing human domain understanding; and 2. Combining the results of learners for each key concept using co-training to reduce the requirements for labeled training data. We empirically show the effectiveness and generality of our approach, Co-Decomp, using three representative social media mining tasks, namely Personal Health Mention detection, Crisis Report detection, and Adverse Drug Reaction monitoring. The experiments show that our model is able to outperform the state-of-the-art text classification models--including those using the recently introduced BERT model--when small amounts of training data are available. |
17:00-17:15 |
Bursts of Activity: Temporal Patterns of Help-Seeking and Support in Online Mental Health Forums Taisa Kushner (University of Colorado - Boulder) and Amit Sharma (Microsoft).
AbstractRecent years have seen a rise in technology-based platforms for mental health, in particular social media platforms which seek to provide peer-to-peer support to individuals suffering from mental distress. Studies on the impact of these platforms have historically tracked interactions on a single-post thread, or longitudinally over months or years of usage, however, it is often not clear how an individual's mental health changes across this time. We show a unique characteristic of activity on one such mental health platform, Talklife, which shows that people engage on this platform in ``bursts'' and ``breaks'' of activity, similar to online search behavior for health.We formalize the notion of bursts based on median activity of each user and propose bursts as a natural unit of analysis for tracking and understanding change in psychosocial well-being in an online mental health community. We then study the characteristics of a burst which lead to positive outcomes for an individual, based on a definition of positive cognitive change.We find that users who undergo a positive cognitive change over a burst of activity are more likely to engage with others at a higher rate through posting replies on other’s posts, participate in increased complex support and lower simple support when replying to others, and have increased post diversity while maintaining similarity between the categories they post replies and original posts in. We also study how a user's behavior changes before and after they experience a moment of change.Lastly, features which correlate to users experiencing moments of cognitive change are robustly tested against self-reported changes in mood to determine two actionable suggestions for improving user experience: persistence within a burst, and giving complex emotional support to others. This work has implications for how we think about user interactions with online mental health platforms, user churn, and retention. |
17:15-17:45 |
DyCRS: Dynamic Interpretable Postoperative Complication Risk Scoring Wen Wang (Carnegie Mellon University), Han Zhao (Carnegie Mellon University), Honglei Zhuang (University of Illinois at Urbana-Champaign), Rema Padman (Carnegie Mellon University) and Nirav Shah (NorthShore University HealthSystem and University of Chicago Pritzker School of Medicine).
AbstractEarly identification of patients at risk for postoperative complications can facilitate timely workups and treatments and improve health outcomes. Currently, a widely-used surgical risk calculator online web system developed by the American College of Surgeons (ACS) uses patients' static features, e.g. gender, age, to assess the risk of postoperative complications. However, the most crucial signals that reflect the actual postoperative physical conditions of patients are usually real-time dynamic signals, including the vital signs of patients (e.g., heart rate, blood pressure) collected from postoperative monitoring. In this paper, we develop a dynamic postoperative complication risk scoring framework (DyCRS) to detect the "at-risk" patients in a real-time way based on postoperative sequential vital signs and static features. DyCRS is based on adaptations of the Hidden Markov Model (HMM) that captures hidden states as well as observable states to generate a real-time, probabilistic, complication risk score. Evaluating our model using electronic health record (EHR) on elective Colectomy surgery from a major health system, we show that DyCRS significantly outperforms the state-of-the-art ACS calculator and real-time predictors with 50.16% area under precision-recall curve (AUCPRC) gain on average in terms of detection effectiveness. In terms of earliness, our DyCRS can predict 15hrs55mins earlier on average than clinician's diagnosis with the recall of 60% and a precision of 55%. Furthermore, Our DyCRS can extract interpretable patients' stages, which are consistent with previous medical postoperative complication studies. We believe that our contributions demonstrate significant promise for developing a more accurate, robust and interpretable postoperative complication risk scoring system, which can benefit more than 50 million annual surgeries in the US by substantially lowering adverse events and healthcare costs. |
Economics (2)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
Designing for Trust: A Behavioral Framework for Sharing Economy Platforms Natã M. Barbosa (University of Illinois at Urbana-Champaign), Emily Sun (Airbnb Inc.), Judd Antin (Airbnb Inc.) and Paolo Parigi (Airbnb Inc.).
AbstractTrust is a fundamental prerequisite in the growth and sustainability of sharing economy platforms. Many of such platforms rely on transactions that require trust actions to take place, such as entering a stranger's car or sleeping at a stranger's place. For this reason, understanding, measuring, and tracking trust can be of great benefit to such platforms, enabling them to identify trust behaviors, both online and offline, and identify groups which may benefit from trust-building interventions. In this work, we present the design and evaluation of a behavioral framework to measure a user's propensity to trust others on a sharing economy platform. We conducted an online experiment with 4,499 Airbnb users in the form of an investment game in order to capture users' propensity to trust other users on Airbnb. Then, we used the experimental data to generate both explanatory and predictive models of trust propensity. Our contribution is a framework that can be used to measure trust propensity in sharing economy platforms like Airbnb via online and offline signals. We discuss which affordances need to be in place so that sharing economy platforms can get signals of trust, in addition to how such a framework can be used to inform design around trust in the short and long term. |
16:30-17:00 |
HTML: Hierarchical Transformer-based Multi-task Learning for Volatility Prediction Linyi Yang (Insight Centre for Data Analytics, University College Dublin, Dublin, Ireland), Riuhai Dong (Insight Centre for Data Analytics, University College Dublin, Dublin, Ireland), Tin Lok James Ng (School of Mathematics and Applied Statistics, University of Wollongong, Australia) and Barry Smyth (Insight Centre for Data Analytics, University College Dublin, Dublin, Ireland).
AbstractThe volatility forecasting task refers to predicting the amount of variability in the price of a financial asset over a certain period. It is an important mechanism for evaluating the risk associated with an asset and, as such, is of significant theoretical and practical importance in financial analysis. While classical approaches have framed this task as a time-series prediction one – using historical pricing as a guide to future risk forecasting – recent advances in natural language processing have seen researchers turn to complementary sources of data, such as analyst reports, social media, and even the audio data from earnings calls. This paper proposes a novel hierarchical, transformer, multi-task architecture designed to harness the text and audio data from quarterly earnings conference calls to predict future price volatility in the short and long term. This includes a comprehensive comparison to a variety of baselines, which demonstrates very significant improvements in prediction accuracy, in the range 17% - 49% compared to the current state-of-the-art. In addition, we describe the results of an ablation study to evaluate the relative contributions of each component of our approach and the relative contributions of text and audio data with respect to prediction accuracy. |
17:00-17:15 |
One Picture Is Worth a Thousand Words? The Pricing Power of Images in e-Commerce Christof Naumzik (ETH Zurich) and Stefan Feuerriegel (ETH Zurich).
AbstractIn e-commerce, product presentations, and particularly images, are known to provide important information for user decision-making, and yet the relationship between images and prices has not been studied. To close this research gap, we suggest a tailored web mining framework, since one must quantify the relative contribution of image content in describing prices ceteris paribus. That is, one must account for the fact that such images inherently depict heterogeneous products. In order to isolate the pricing power of image content, we suggest a three-stage framework involving deep learning and statistical inference.Our empirical evaluation draws upon a comprehensive dataset of more than 20,000 real estate listings from Craigslist. We find that the image content describes a large portion of the variance in prices, even when controlling for location and common characteristics of apartments. A one standard deviation in the image variable is associated with a 14.45% increase in price. By utilizing a carefully designed instrumental variables estimation, we further set out to obtain causal estimates. Our empirical findings contribute to theory by quantifying the hedonic value of images and thus establishing a link between visual appearance and product pricing. Even though a positive relationship seems intuitive, we provide for the first time an empirical confirmation. Based on our large-scale computational study, we further yield evidence of a picture superiority effect: simply put, a beneficial image corresponds to the same price change as 2856.03 additional words in the textual description.In sum, images capture valuable information for users that goes beyond narrative explanations. As a direct implication, we aid online platforms and their users in assessing and improving the multi-modal presentation of their product offerings. Finally, we contribute to web mining by highlighting the importance of visual information. |
17:15-17:30 |
Multimodal Post Attentive Profiling for Influencer Marketing Seungbae Kim (University of California, Los Angeles), Jyun-Yu Jiang (University of California, Los Angeles), Masaki Nakada (University of California, Los Angeles), Jinyoung Han (Sungkyunkwan University) and Wei Wang (University of California, Los Angeles).
AbstractInfluencer marketing has become a key marketing method for brands in recent years. Hence, brands have been increasingly utilizing influencers' social networks to reach niche markets, and researchers have been studying various aspects of influencer marketing. However, brands have often suffered from searching and hiring the right influencers with specific interests/topics for their marketing due to a lack of available influencer data and/or limited capacity of marketing agencies. This paper proposes a multimodal deep learning model that uses text and image information from social media posts (i) to classify influencers into specific interests/topics (e.g., fashion, beauty) and (ii) to classify their posts into certain categories. We use the attention mechanism to select more relevant posts to influencers' topics thereby generating useful representations of influencers. We conduct experiments on the data from Instagram which is the most popular social media for influencer marketing. The experimental results show that our proposed model achieves 98\% and 96\% accuracy in classifying influencers and their posts, respectively. Our model significantly outperforms existing user profiling methods. By applying our proposed model to our dataset, which had been collected for 92 days from October 1st, 2018 to January 1st, 2019, we analyze the behavior characteristics of influencers in terms of their topics, size of potential customers, and their posting behaviors. We plan to release our influencer dataset that contains 33,935 influencers (labeled with specific topics) with their 10,180,500 posting information, which can be used in future research. |
Systems (2)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
The Fast and The Frugal: Tail Latency Aware Provisioning for Coping with Load Variations Adithya Kumar (The Pennsylvania State University), Iyswarya Narayanan (The Pennsylvania State University), Timothy Zhu (The Pennsylvania State University) and Anand Sivasubramaniam (The Pennsylvania State University).
AbstractSmall and medium sized enterprises use the cloud for running online, user-facing, tail latency sensitive applications with well-defined fixed monthly budgets. For these applications, adequate system capacity must be provisioned to extract maximal performance despite the challenges of uncertainties in load and request-sizes. In this paper, we address the problem of capacity provisioning under fixed budget constraints with the goal of minimizing tail latency.To tackle this problem, we propose building systems using a heterogeneous mix of low latency expensive resources and cheap resources that provide high throughput per dollar. As load changes through the day, we use more faster resources to reduce tail latency during low load periods and more cheaper resources to handle the high load periods. To achieve these tail latency benefits, we introduce novel heterogeneity-aware scheduling and autoscaling algorithms that are designed for minimizing tail latency. Using software prototypes and by running experiments on the public cloud, we show that our approach can outperform existing capacity provisioning systems by reducing the tail latency by as much as 45\% under fixed-budget settings. |
16:30-17:00 |
De-Kodi: Understanding the Kodi Ecosystem Marc Warrior (Northwestern University), Yunming Xiao (Northwestern University), Matteo Varvello (Brave Software) and Aleksandar Kuzmanovic (Northwestern University).
AbstractFree and open source media centers are currently experiencing a boom in popularity for the convenience and flexibility they offer users seeking to remotely consume digital content. This newfound fame is matched by increasing notoriety — for their potential to serve as hubs for illegal content—and a presumably ever-increasing network footprint. It is fair to say that a complex ecosystem has developed around Kodi, composed of millions of users, thousands of “add-ons” – Kodi extensions from from 3rd-party developers – and content providers. Motivated by these observations, this paper aims at conducing the first analysis of the Kodi ecosystem. Our rationale is to build some “crawling” software around Kodi which can automatically install an addon, explore its menu, and locate (video) content. This is challenging for many reasons. First, Kodi largely relies on visual information and user input which intrinsically complicates automation. Second, no central aggregators for Kodi addons exist. Third, the potential sheer size of this ecosystem requires a highly scalable crawling solution. We address these challenges with de-Kodi, a full fledged crawling system capable of discovering and crawling large cross-sections of Kodi’s decentralized ecosystem at tunable levels of depth and breadth. With de-Kodi, we discovered and tested over 9,000 distinct Kodi addons. Our results demonstrate de-Kodi, which we make available to the general public, to be a essential asset in studying one of the largest multimedia platforms in the world. Our work further serves as the first ever transparent and repeatable analysis of the Kodi ecosystem at large. |
17:00-17:15 |
Natural Language Annotations for Search Engine Optimization Porter Jenkins (The Pennsylvania State University), Jennifer Zhao (Pinterest, Inc.), Heath Vinicombe (Pinterest, Inc.) and Anant Subramanian (Pinterest, Inc.).
AbstractUnderstanding content at scale is a difficult but important problem for many platforms. Many previous studies focus on content understanding to optimize engagement with existing users. However, little work studies how to leverage better content understanding to attract new users. In this work, we build a framework for generating natural language content annotations and show how they can be used for search engine optimization. The proposed framework relies on an XGBoost model that labels ``pins'' with high probability phrases, and a logistic regression layer that learns to rank aggregated annotations for groups of content. The pipeline identifies keywords that are descriptive and contextually meaningful. We perform a large-scale production experiment deployed on the Pinterest platform and show that natural language annotations cause a 1-2% increase in traffic from leading search engines. This increase is statistically significant. Finally, we explore and interpret the characteristics of our annotations framework. |
17:15-17:30 |
Improved Touch-screen Inputting Using Sequence-level Prediction Generation Xin Wang (Baidu Research), Xu Li (Baidu Research), Jinxing Yu (Baidu Research), Mingming Sun (Baidu Research) and Ping Li (Baidu Research).
AbstractRecent years have witnessed the continuing growth of people's dependence on touchscreen devices. As a result, input speed with the onscreen keyboard has become crucial to communication efficiency and user experience. In this work, we formally discuss the general problem of input expectation prediction with touch-screen input method editor. Taken input efficiency as the optimization target, we proposed a neural end-to-end candidates generation solution to handle automatic correction, reordering, insertion, deletion as well as completion. Evaluation metrics are also discussed base on real use scenarios. For a more thorough comparison, we also provide a statistical strategy for mapping touch coordinate sequences to text input candidates. The proposed model and baselines are evaluated on a real-world dataset. The experiment shows that the proposed model outperform the all the baselines. |
Semantics (4)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
High Quality Candidate Generation and Sequential Graph Attention Network for Entity Linking Zheng Fang (Institute of Information Engineering, Chinese Academy of Sciences & University of Chinese Academy of Sciences), Yanan Cao (Institute of Information Engineering, Chinese Academy of Sciences), Ren Li (Institute of Information Engineering, Chinese Academy of Sciences & University of Chinese Academy of Sciences), Zhenyu Zhang (Institute of Information Engineering, Chinese Academy of Sciences), Yanbing Liu (Institute of Information Engineering, Chinese Academy of Sciences) and Shi Wang (Institute of Computing Technology, Chinese Academy of Sciences).
AbstractEntity Linking (EL) is a task for mapping mentions in text to corresponding entities in knowledge base (KB). This task usually includes candidate generation (CG) and entity disambiguation (ED) stages. Recent EL systems based on neural network models has achieved good performance, but they still face two challenges: (i) Previous studies evaluate their models without considering the differences between candidate entities. In fact, the quality (gold recall in particular) of candidate sets has an effect on the EL results. So, how to promote the quality of candidates needs more attention. (ii) In order to utilize the topical coherence among the referred entities, many graph and sequence models are proposed for collective ED. However, graph-based models treat all candidate entities equally which may introduce much noise information. On the contrary, sequence models can only observe previous referred entities, ignoring the relevance between the current mention and its subsequent entities. To address the first problem, we propose a multi-strategy based CG method to generate high recall candidate sets. For the second problem, we design a sequential Graph Attention Networks (SeqGAT) which combines the advantages of graph and sequence methods. In our model, mentions are dealt with in a sequence manner. Given the current mention, SeqGAT dynamically encodes both its previous referred entities and subsequent ones, and assign different importance to these entities. In this way, it not only makes full use of the topical consistency, but also reduce noise interference. We conduct experiments on different types of datasets and compare our method with previous EL system on the open evaluation platform. The comparison results show that our model achieves significant improvements over the state-of-the-art methods. |
16:30-16:45 |
Multi-Context Attention for Entity Matching Dongxiang Zhang (Zhejiang University), Yuyang Nie (University of Science and Technology of China), Sai Wu (Zhejiang University), Yanyan Shen (Shanghai Jiao Tong University) and Kian-Lee Tan (National University of Singapore).
AbstractEntity matching (EM) is a classic research problem that identifies data instances referring to the same real-world entity. Recent technical trend in this area is to take advantage of deep learning (DL) to automatically extract discriminative features. DeepER and DeepMatcher have emerged as two pioneering DL models for EM. However, these two state-of-the-art solutions simply incorporate vanilla RNNs and straightforward attention mechanisms. In this paper, we fully exploit the semantic context of embedding vectors for the pair of entity text descriptions. In particular, we propose an integrated multi-context attention framework that takes into account self-attention, pair-attention and global-attention from three types of context. The idea is further extended to incorporate attribute attention in order to support structured datasets. We conduct extensive experiments with 7 benchmark datasets that are publicly accessible. The experimental results clearly establish our superiority over DeepER and DeepMatcher in all the datasets. |
16:45-17:15 |
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction Paolo Rosso (University of Fribourg), Dingqi Yang (eXascale Infolab, University of Fribourg,) and Philippe Cudre-Mauroux (eXascale Infolab, University of Fribourg,).
AbstractKnowledge Graph (KG) embeddings are a powerful tool for predicting missing links in KGs. Existing embedding techniques typically represent a KG as a set of triplets, where each triplet (h, r, t) links two entities h and t through a relation r, and learn entity/relation embeddings from such triplets while preserving such a structure. However, this triplet representation oversimplifies the complex nature of the data stored in the KG, in particular for hyper-relational facts, where each fact contains not only a base triplet (h, r, t), but also the associated key-value pairs (k, v). Even though a few recent techniques tried to learn from such data by transforming a hyper-relational fact into an n-ary representation (i.e., a set of key-value pairs only without triplets), they result in suboptimal models as they are unaware of the triplet structure, which serves as the fundamental data structure in modern KGs and indeed preserves the essential information for link prediction. To address this issue, we propose HINGE, a hyper-relational KG embedding model, which directly learns from hyper-relational facts in a KG. HINGE captures not only the primary structural information of the KG encoded in the triplets, but also the correlation between each triplet and its associated key-value pairs. Our extensive evaluation shows the superiority of HINGE on various link prediction tasks over KGs. In particular, HINGE consistently outperforms not only the KG embedding methods learning from triplets only (by 0.81-41.45% depending on the link prediction tasks and settings), but also the methods learning from hyper-relational facts using the n-ary representation (by 13.2-84.1%). |
17:15-17:45 |
Expanding Taxonomies with Implicit Edge Semantics Emaad Manzoor (Carnegie Mellon University), Dhananjay Shrouty (Pinterest), Rui Li (Pinterest) and Jure Leskovec (Stanford).
AbstractCurated taxonomies enhance the performance of machine-learning systems via high-quality structured knowledge. However, manually curating a large and rapidly-evolving taxonomy is infeasible. In this work, we propose Arborist, an approach to automatically expand textual taxonomies by predicting the parents of new taxonomy nodes. Unlike previous work, Arborist handles the more challenging scenario of taxonomies with heterogeneous edge semantics that are unobserved.Arborist learns latent representations of the edge semantics along with embeddings of the taxonomy nodes to measure taxonomic relatedness between node pairs. Arborist is then trained by optimizing a large-margin ranking loss with a dynamic margin function. We propose a principled formulation of the margin function, which theoretically guarantees that Arborist minimizes an upper-bound on the shortest-path distance between the predicted parents and actual parents in the taxonomy. Via extensive evaluation on a curated taxonomy at Pinterest and several public datasets, we demonstrate that Arborist outperforms the state-of-the-art, achieving up to 59% in mean reciprocal rank and 83% in recall at 15. We also explore the ability of Arborist to infer nodes' taxonomic-roles, without explicit supervision on this task. |
Social Network-B (3)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
Domain Adaptive Multi-Modality Neural Attention Network for Financial Forecasting Dawei Zhou (University of Illinois at Urbana-Champaign), Lecheng Zheng (University of Illinois at Urbana-Champaign), Jianbo Li (Three Bridges Capital), Yada Zhu (IBM) and Jingrui He (University of Illinois at Urbana-Champaign).
AbstractFinancial time series analysis plays a central role in optimizing investment decisions and hedging market risks. This is a challenging task as the problems are always accompanied by dual-level (i.e.,data-level and task-level) heterogeneity. For instance, in stock price forecasting, a successful portfolio with bounded risks usually consists of a large number of stocks from diverse domains (e.g., utility, information technology, healthcare, etc.), and forecasting stocks in each domain can be treated as one task; within a portfolio, each stock is characterized by temporal data collected from multiple modalities (e.g., finance, weather, and news), which corresponds to the data-level heterogeneity. Furthermore, the finance industry follows highly regulated processes, which require prediction models to be interpretable, and the output results to meet compliance. Therefore, a natural research question is how to build a model that can achieve satisfactory performance on such multi-modality multi-task learning problems, while being able to provide comprehensive explanations for the end-users.To answer this question, in this paper, we propose a generic time series forecasting framework named Dandelion, which leverages the consistency of multiple modalities and explores the relatedness of multiple tasks using a deep neural network. In addition, to ensure the interpretability of the framework, we integrate a novel trinity attention mechanism, which allows the end-users to investigate the variable importance over three dimensions (i.e., tasks, modality and time). Extensive empirical results demonstrate that Dandelion achieves superior performance for financial market prediction across 396 stocks from 4 different domains over the past 15 years. In particular, two interesting case studies show the efficacy of Dandelion in terms of its profitability performance, and the interpretability of output results to end-users. |
16:30-16:45 |
Active Domain Transfer on Network Embedding Lichen Jin (Peking University), Yizhou Zhang (USC, contributed mainly in PKU), Guojie Song (Peking University) and Yilun Jin (The Hong Kong University of Science and Technology).
AbstractRecent works show that end-to-end, (semi-) supervised network embedding models can generate satisfactory vectors to represent network topology, and are even applicable to unseen graphs by inductive learning. However, domain mismatch between training and test network for inductive learning, as well as lack of labeled data often compromises the outcome of such methods. To make matters worse, while transfer learning and active learning techniques, being able to solve such problems correspondingly, have been well studied on regular i.i.d data, relatively few attention has been paid on networks. Consequently, we propose in this paper a method for active domain transfer on networks named active-transfer network embedding, named ATNE. In ATNE we jointly consider the influence of each node on the network from the perspectives of transfer and active learning, and hence design novel and effective influence scores combining both aspects in the training process to facilitate node selection. We demonstrate that ATNE is efficient and decoupled from the actual model used. Further extensive experiments show that ATNE outperforms state-of-the-art active node selection methods and shows versatility in different situations. |
16:45-17:00 |
On the Robustness of Cascade Diffusion under Node Attacks Alvis Logins (Aarhus University), Yuchen Li (Singapore Management University) and Panagiotis Karras (Aarhus University).
AbstractHow can we assess the ability of a network defined in probabilistic terms to maintain its functionality under failures? Network robustness has been studied extensively in the case of deterministic networks under threats to their connectivity. However, applications such as the online diffusion of information and the behavior of networked public raise the question about robustness in a probabilistic network. In this paper, we propose three novel robustness measures for networks hosting a stochastic diffusion process under the Independent Cascade (IC) model, which is susceptible to node failures. The outcome of such a process depends on the selection of its initiators, or seeds, by the seeder, as well as on two parameters not on seeder’s discretion: the attack strategy and the probabilistic diffusion outcome. In an abstraction, we consider three levels of seeder awareness regarding these two uncontrolled parameters, and evaluate the network’s viability aggregated over all possible extents of node failures. We introduce novel algorithms from building blocks found in previous works to evaluate the proposed measures. A thorough experimental study with synthetic and real, scale-free and homogeneous networks establishes that the proposed algorithms are effective and efficient, while the proposed measures highlight differences among networks in terms of their robustness and the surprise they can furnish under attack. Last, we devise a new measure of diffusion entropy that can inform the design of probabilistically robust networks. |
17:00-17:15 |
Certified Robustness of Community Detection against Adversarial Structural Perturbation via Randomized Smoothing Jinyuan Jia (Duke University), Binghui Wang (Duke University), Xiaoyu Cao (Duke University) and Neil Zhenqiang Gong (Duke University).
AbstractCommunity detection plays a key role in understanding graph structure. However, several recent studies showed that community detection is vulnerable to adversarial structural perturbation. In particular, via adding or removing a small number of carefully selected edges in a graph, an attacker can manipulate the detected communities. However, to the best of our knowledge, there are no studies on certifying robustness of community detection against such adversarial structural perturbation. In this work, we aim to bridge this gap. Specifically, we develop the first certified robustness guarantee of community detection against adversarial structural perturbation. Given an arbitrary community detection method, we build a new smoothed community detection method via randomly perturbing the graph structure. We theoretically show that the smoothed community detection method provably groups a given arbitrary set of nodes into the same community (or different communities) when the number of edges added/removed by an attacker is bounded. Moreover, we show that our certified robustness is tight. We also empirically evaluate our method on multiple real-world graphs with ground truth communities. |
User Modeling-B (3)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
16:00-16:30 |
Correcting for Selection Bias in Learning-to-rank Systems Zohreh Ovaisi (University of Illinois at Chicago), Ragib Ahsan (University of Illinois at Chicago), Yifan Zhang (Sun Yat-sen University), Kathryn Vasilaky (California Polytechnic State University) and Elena Zheleva (University of Illinois at Chicago).
AbstractClick data collected by modern recommendation systems are an important source of observational data that can be utilized to train learning-to-rank (LTR) systems. However, these data suffer from a number of biases that can result in poor performance for LTR systems. Recent methods for bias correction in such systems mostly focus on position bias, the fact that higher ranked results (e.g., top search engine results) are more likely to be clicked even if they are not the most relevant results given a user’s query. Less attention has been paid to correcting for selection bias, which occurs because clicked documents are reflective of what documents have been shown to the user in the first place. Here, we propose new counterfactual approaches which adapt Heckman's two-stage method and accounts for selection and position bias in LTR systems. Our empirical evaluation shows that our proposed methods have better accuracy compared to existing unbiased LTR algorithms under moderate position bias assumptions and are more robust to noise overall. |
16:30-17:00 |
Early Detection of User Exits from Clickstream Data: A Markov Modulated Marked Point Process Model Tobias Hatt (ETH Zurich) and Stefan Feuerriegel (ETH Zurich).
AbstractMost users leave e-commerce websites with no purchase. Hence, it is important for website owners to detect users at risk of exiting and intervene early (eg, via price promotions). Prior approaches make widespread use of clickstream data; however, state-of-the-art algorithms only model the sequence of web pages visited and not the time spent on them.In this paper, we develop a novel Markov modulated marked point process (M3PP) model for predicting user exits from clickstream data. It accommodates clickstream data in a holistic manner: our proposed M3PP models both the sequence of pages visited and the temporal dynamics between them (ie, the time spent on pages). This is achieved by a continuous-time marked point process. Different from previous Markovian clickstream models, our M3PP is the first model in which the continuous nature of time is considered. The marked point process is modulated by a continuous-time Markov process in order to account for different latent shopping phases. As a secondary contribution, we suggest a risk assessment framework. Rather than predicting future page visits, we compute a user's risk of exiting with no purchase. For this purpose, we build upon sequential hypothesis testing in order to suggest a risk score for users exits.Our computational experiments draw upon real-world clickstream data provided by a large online retailer. Based on it, we find that state-of-the-art algorithms are consistently outperformed by our M3PP model in terms of both AUROC (+6.24 percentage points) and so-called time of early warning (+12.93%). Accordingly, our M3PP model allows for timely detections of user exists and thus provides sufficient time for website owners to trigger dynamic online interventions (eg, adapting website content or price promotions). |
17:00-17:30 |
An End-to-end Topic-Enhanced Self-Attention Network for Social Emotion Classification Chang Wang (Huazhong University of Science and Technology (HUST), Wuhan, China) and Bang Wang (Huazhong University of Science and Technology (HUST), Wuhan, China).
AbstractSocial emotion classification is to predict the distribution of different emotions evoked by an article among its readers. Prior studies have shown that document semantic and topical features can help improve classification performance. However, how to effectively extract and jointly exploit such features have not been well researched. In this paper, we propose an end-to-end topic-enhanced self-attention network (TESAN) that jointly encodes document semantics and extracts document topics. In particular, TESAN first constructs a neural topic model to learn topical information and generates a topic embedding for a document. We then propose a topic-enhanced self-attention mechanism to encode semantic and topical information into a document vector. Finally, a fusion gate is used to compose the document representation for emotion classification by integrating the document vector and the topic embedding. The entire TESAN is trained in an end-to-end manner. Experimental results on three public datasets reveal that TESAN outperforms the state-of-the-art schemes in terms of higher classification accuracy and higher average Pearson correlation coefficient. Furthermore, the TESAN is computation efficient and can generate more coherent topics. |
17:30-17:45 |
Large-scale Causal Approaches to Debiasing Post-click Conversion Rate Estimation with Multi-task Learning Wenhao Zhang (Dept. of Computer Science; University of California, Los Angeles), Wentian Bao (Alibaba Group), Keping Yang (Alibaba Group), Quan Lin (Alibaba Group), Xiao-Yang Liu (Columbia University), Hong Wen (Alibaba Group) and Ramin Ramezani (Dept. of Computer Science; University of California, Los Angeles).
AbstractPost-click conversion rate (CVR) estimation is a critical task in e-commerce recommender systems. This task is deemed quite challenging under industrial setting with two major issues: 1) selection bias caused by user self-selection, 2) data sparsity due to the limited click events. A successful conversion typically has the following sequential events: "exposure→click→conversion". Conventional CVR estimators are trained in the click space, but inference is done in the entire exposure space. The unclicked data is excluded intentionally in training phase as we have no explicit conversion feedback for the items that are not clicked by customers. These information is typically missing not at random due to the user self-selection.Conventional CVR estimators fail to account for the causes of the missing data and treat them as missing at random. Hence, their estimations are highly likely to deviate from the real values by large. In addition, the data sparsity issue can also handicap many industrialCVR estimators which usually have large parameter space.In this paper, we propose two principled, efficient and highly effective CVR estimators for industrial CVR prediction tasks, namely,Multi-IPW and Multi-DR. The proposed models approach the CVR estimation task from a causal perspective and adapt for the cause of missing not at random. In addition, our methods are based on the multi-task learning framework and mitigate the data sparsity issue. Extensive experiments on industrial-level datasets demonstrate that the proposed methods outperform other state-of-the-art CVR models. |
Research Tracks (7)
Web Mining-A (7)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Large-Scale Talent Flow Embedding for Company Competitive Analysis Le Zhang (University of Science and Technology of China), Tong Xu (University of Science and Technology of China), Hengshu Zhu (Baidu Inc.), Chuan Qin (University of Science and Technology of China), Qingxin Meng (Rutgers-the State University of New Jersey), Hui Xiong (Rutgers-the State University of New Jersey) and Enhong Chen (University of Science and Technology of China).
AbstractRecent years have witnessed the growing interests in investigating the competition among companies. Existing studies for company competitive analysis generally rely on subjective survey data and inferential analysis. Instead, in this paper, we aim to develop a new paradigm for studying the competition among companies through the analysis of talent flows. The rationale behind this is that the competition among companies usually leads to talent movement. Along this line, we first build a Talent Flow Network based on the large-scale job transition records of talents, and formulate the concept of ``competitiveness'' for companies with consideration of their bi-directional talent flows in the network. Then, we propose a Talent Flow Embedding (TFE) model to learn the bi-directional talent attractions of each company, which can be leveraged for measuring the pairwise competitive relationships between companies. Specifically, we employ the random-walk based model in original and transpose networks respectively to learn representations of companies by preserving their competitiveness as well as the in/out-degree distribution of the network. Furthermore, we design a multi-task strategy to refine the learning results from a fine-grained perspective, which can jointly embed multiple talent flow networks by assuming the features of company keep stable but take different roles in networks of different job positions. Finally, extensive experiments on a large-scale real-world dataset clearly validate the effectiveness of our TFE model in terms of company competitive analysis and reveal some interesting rules of competition based on the derived insights on talent flows. |
11:00-11:30 |
OutfitNet: Fashion Outfit Recommendation with Attention-Based Multiple Instance Learning Yusan Lin (Visa Research), Maryam Moosaei (Visa Research) and Hao Yang (Visa Research).
AbstractRecommending fashion outfits to users presents several challenges. First of all, an outfit consists of multiple fashion items, and each user emphasizes different parts of an outfit when considering whether they like it or not. Secondly, a user's liking for a fashion outfit considers not only the aesthetics of each item but also the compatibility among them. Lastly, fashion outfit data is often sparse in terms of the relationship between users and fashion outfits. Not to mention, we can only obtain what the users like, but not what they dislike.To address the above challenges, in this paper, we formulate the fashion outfit recommendation problem as a multiple-instance-learning (MIL) problem. We propose OutfitNet, a fashion outfit recommendation framework that includes two stages. The first stage is a Fashion Item Relevancy network (FIR), which learns the compatibility between fashion items and further generates relevancy embedding of fashion items. In the second stage, an Outfit Preference network (OP) learns the users' tastes for fashion outfits using visual information. OutfitNet takes in multiple fashion items in a fashion outfit as input, learns the compatibility among fashion items, the users' tastes toward each item, as well as the users' attention on different items in the outfit with the attention mechanism.Quantitatively, our experiments show that OutfitNet outperforms state-of-the-art models in two tasks: fill-in-the-blank (FITB) and personalized outfit recommendation. Qualitatively, we demonstrate that the learned personalized item scores and attention scores capture well the users' fashion tastes, and the learned fashion item embeddings capture well the compatibility relationships among fashion items. We also leverage the learned fashion item embedding and propose a simple fashion outfit generation framework, which is shown to produce high-quality fashion outfit combinations. |
11:30-12:00 |
Measurements, Analyses, and Insights on the Entire Ethereum Blockchain Network Xi Tong Lee (Nanyang Technological University), Arijit Khan (Nanyang Technological University), Sourav Sen Gupta (Nanyang Technological University), Yu Hann Ong (Nanyang Technological University) and Xuan Liu (Nanyang Technological University).
AbstractBlockchains are increasingly becoming popular due to the prevalence of cryptocurrencies and decentralized applications. Ethereum is a distributed public blockchain network that focuses on running code (smart contracts) for decentralized applications. More simply, it is a platform for sharing information in a global state that cannot be manipulated or changed. Ethereum blockchain introduces a novel ecosystem of human users and autonomous agents (smart contracts). In this network, we are interested in all possible interactions: user to-user, user-to-contract, contract-to-user, and contract-to-contract. This requires us to construct interaction networks from the entire Ethereum blockchain data, where vertices are accounts (users, contracts) and arcs denote interactions. Each interaction network provides us with a different perspective on the Ethereum blockchain, and our analyses on the networks reveal new insights by combining information from the four networks. We perform an in-depth study of these networks based on several graph properties consisting of both local and global properties, discuss their similarities and differences with social networks and the Web, draw interesting conclusions, and highlight important, future research directions. |
12:00-12:15 |
ShapeVis: High-dimensional Data Visualization at Scale Siddarth R (Adobe Inc), Nupur Kumari (Adobe Inc), Akash Rupela (Adobe Inc), Piyush Gupta (Adobe Inc) and Balaji Krishnamurthy (Adobe Inc).
AbstractWe present ShapeVis, a visualization technique for point cloud data inspired from topological data analysis. Our method captures the underlying geometric and topological structure of the data in a compressed graphical representation. Much success has been reported by the graph-based data compression technique Mapper, that discretely approximates the Reeb graph of a filter function on the data. However, when using standard dimensionality reduction algorithms as the filter function, Mapper suffers from considerable computational cost. This makes it difficult to scale to high-dimensional data. Our proposed technique relies on finding a subset of points called landmarks along the data manifold to construct a weighted witness-graph over it. This graph captures the structural characteristics of the point cloud and its weights are determined using a Finite Markov Chain. We further compress this graph by applying induced maps from standard community detection algorithms. Using techniques borrowed from manifold tearing, we prune and reinstate edges in the induced graph based on their modularity to summarize the shape of data. We empirically demonstrate how our technique captures the structural characteristics of real and synthetic data sets. Further, we compare our approach with Mapper using various filter functions like t-Sne, UMAP, LargeVis, and show that our algorithm scales to millions of data points while preserving the quality of data visualization. |
12:15-12:30 |
NCVis: Noise Contrastive Approach for Scalable Visualization Aleksandr Artemenkov (Skoltech) and Maxim Panov (Skoltech).
AbstractModern methods for data visualization, such as t-SNE, usually have performance issues that prohibit their application to large amounts of high-dimensional data. In this work, we propose NCVis -- a high-performance visualization method built on a sound statistical basis of noise contrastive estimation. We show that NCVis outperforms state-of-the-art techniques in terms of speed while preserving the representation quality of other methods. In particular, the proposed approach successfully proceeds the large dataset of more than 1 million news in headers in several minutes and presents the underlying structure in a human-readable way. Moreover, it provides results consistent with classical methods like t-SNE on more straightforward datasets like images of hand-written digits. We believe that the broader usage of such software can significantly simplify the web data analysis and lower the large-scale application entry barrier. |
Social Network-A (7)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Seeding Network Influence in Biased Networks and the Benefits of Diversity Ana-Andreea Stoica (Columbia University), Jessy Xinyi Han (Columbia University) and Augustin Chaintreau (Columbia University).
AbstractThe problem of social influence maximization is widely applicable in designing viral campaigns, news dissemination, or medical aid. State-of-the-art algorithms often select "early adopters" that are most central in a network, unfortunately mirroring or exacerbating embedded historical biases in human networks and leaving under-represented communities out of the loop. In this paper, we aim at a rigorous foundation for fair influence maximization. Through a theoretical model of biased networks, we characterize the intricate relationship between diversity and efficiency, which sometimes may be at odds but may also reinforce each other. Most importantly, we prove an analytical condition under which more equitable choices of early adopters lead simultaneously to fairer outcomes and larger outreach. Analysis of data on DBLP confirms our condition is often met. We design and test a set of algorithms leveraging networks to optimize the diffusion of a message while avoiding to create disparate impact among participants based on gender or race. |
11:00-11:30 |
Efficient Algorithms towards Network Intervention Hui-Ju Hung (The Pennsylvania State University), Wang-Chien Lee (The Pennsylvania State University), De-Nian Yang (Academia Sinica), Chih-Ya Shen (National Tsing Hua University), Zhen Lei (The Pennsylvania State University) and Sy-Miin Chow (The Pennsylvania State University).
AbstractResearch suggests that social relationships have substantial impacts on individuals' health outcomes. Network intervention, through careful planning, can assist a network of users to build healthy relationships. However, most previous work is not designed to assist such planning by carefully examining and improving multiple network characteristics. In this paper, we propose and evaluate algorithms that facilitate network intervention planning through simultaneous optimization of network degree, closeness, betweenness, and local clustering coefficient, under scenarios involving Network Intervention with Limited Degradation - for Single target (NILD-S) and Network Intervention with Limited Degradation - for Multiple targets (NILD-M). We prove that NILD-S and NILD-M are NP-hard and cannot be approximated within any ratio in polynomial time unless P=NP. We propose the Candidate Re-selection with Preserved Dependency (CRPD) algorithm for NILD-S, and the Objective-aware Intervention edge Selection and Adjustment (OISA) algorithm for NILD-M. Various pruning strategies are designed to boost the efficiency of the proposed algorithms. Extensive experiments on various real social network datasets collected from public primary schools and the Web and an empirical study are conducted to show that CRPD and OISA outperform the baselines in both efficiency and effectiveness. |
11:30-12:00 |
Finding Large Balanced Subgraphs in Signed Networks Bruno Ordozgoiti (Aalto University), Antonis Matakos (Aalto University) and Aristides Gionis (Aalto University).
AbstractSigned networks are graphs whose edges are labelled with either a positive or a negative sign, and are able to capture nuances in interactions that are missed by their unsigned counterparts. The concept of balance in signed graph theory determines whether or not a network can be partitioned into two perfectly opposing subsets, and is therefore useful for modelling phenomena such as the existence of polarized communities in social networks. While determining whether a graph is balanced is easy, finding a large balanced subgraph is hard. The few heuristics available in the literature for this purpose are either ineffective or non-scalable. In this paper we propose an efficient algorithm for finding balanced subgraphs in signed networks. The algorithm relies on signed spectral theory and a novel bound for perturbations of the graph Laplacian. In a wide variety of experiments on real data we show that our algorithm can find balanced subgraphs much larger than those detected by existing methods, and is in addition faster. We test its scalability on graphs of up to 18 million edges. |
12:00-12:15 |
LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew Cyrus Rashtchian (UCSD), Aneesh Sharma (Google) and David Woodruff (Carnegie Mellon University).
AbstractAll-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets. Traditionally, similarity search has focused on discovering very similar pairs, for which a variety of efficient algorithms are known. However, recent work has highlighted the importance of discovering pairs of sets with relatively small intersection sizes. For example, in a recommender system, two users may be alike even though their interests only overlap on a small percentage of items. In such systems, it is also common that some dimensions are highly-skewed, because they are very popular. Together, these two properties render previous approaches infeasible for large input sizes. To address this problem, we present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity. The core of our algorithm is a randomized selection procedure based on Locality Sensitive Filtering. In particular, our method deviates from prior approximate algorithms, which are based on Locality Sensitive Hashing. Theoretically, we show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets. We prove guarantees on the communication, work, and maximum load of LSF-Join, and we also experimentally demonstrate its accuracy on multiple graphs. |
12:15-12:30 |
Higher-Order Label Homogeneity and Spreading in Graphs Dhivya Eswaran (Carnegie Mellon University), Srijan Kumar (Georgia Institute of Technology) and Christos Faloutsos (Carnegie Mellon University).
AbstractDo higher-order network structures aid graph semi-supervised learning? Given a graph and a few labeled vertices, labeling the remaining vertices is a high-impact problem with applications in several tasks, such as recommender systems, fraud detection and protein identification. However, traditional methods rely on edges for spreading labels, which is limited by the fact that all edges are not equal. Vertices with stronger connections participate in higher-order structures in graphs, which calls for methods that can leverage these structures in the semi-supervised learning tasks.Our contributions are three-fold. First, we create an information-theoretic metric to quantify the homogeneity of labels in higher-order structures in graphs. We show that across four diverse real-world networks, higher-order structures exhibit more homogeneity of labels compared to edges. Second, we create an algorithm, HOLS, for label spreading using higher-order structures. HOLS has strong theoretical guarantees and reduces to standard label spreading in the base case. Third, we conduct extensive experiments to compare HOLS to several traditional and recent state-of-the-art methods. We show that higher-order label spreading using triangles in addition to edges is up to 4.7% better than label spreading using edges alone. Compared to the baselines, HOLS leads to statistically significantly higher accuracy in all-but-one cases. HOLS is also fast and scalable to large graphs, running under 2 minutes in graphs with over 21 million edges. All the code and datasets are available at http://bit.ly/www2020hols for reproducibility. |
User Modeling-A (7)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Open Intent Extraction from Natural Language Interactions Nikhita Vedula (The Ohio State University), Nedim Lipka (Adobe), Pranav Maneriker (The Ohio State University) and Srinivasan Parthasarathy (The Ohio State University).
AbstractAccurately discovering user intents from their written or spoken language plays a critical role in natural language understanding and automated dialog response. Most existing research models this as a classification task with a single intent label per utterance, by grouping user utterances into a single intent type from a set of categories known beforehand. Going beyond this formulation, we define and investigate a new problem of open intent discovery. It involves discovering one or more generic intent types from text utterances, that may not have been encountered during training. We propose a novel domain-agnostic approach, OPINE, which formulates the problem as a sequence tagging task under an open-world setting. It employs a CRF on top of a bidirectional LSTM to extract intents in a consistent format, subject to constraints among intent tag labels. We apply a multi-head self-attention mechanism to effectively learn dependencies between distant words. We further use adversarial training to improve performance and robustly adapt our model across varying domains. Finally, we curate and plan to release an open intent annotated dataset of 25K real-life utterances spanning diverse domains. Extensive experiments show that our approach outperforms state-of-the-art baselines by 5-15% F1 score points. We also demonstrate the efficacy of OPINE in recognizing multiple, diverse domain intents with limited (or zero) training examples per unique domain. |
11:00-11:30 |
Dynamic Composition for Conversational Domain Exploration Idan Szpektor (Google), Deborah Cohen (Google), Gal Elidan (Google), Michael Fink (Google), Avinatan Hassidim (Google), Orgad Keller (Google), Sayali Kulkarni (Google), Eran Ofek (Google), Sagie Pudinsky (Google), Asaf Revach (Google), Shimi Salant (Google) and Yossi Matias (Google).
AbstractWe study conversational exploration and discovery(CoED), where the user’s goal is to enrich her knowledge of a given domain by conversing with an informative bot. Such conversations should be well grounded in high-quality domain knowledge as well as engaging and open-ended. A CoED bot should be proactive and introduce relevant information even if not directly asked by the user. The bot should also appropriately pivot the conversation to undiscovered regions of the domain. To address these dialogue characteristics, we introduce a novel approach termed dynamic composition. This approach decouples candidate content generation from the flexible composition of bot responses. This allows the bot to control the source, correctness and quality of the offered content, while achieving flexibility via a dialogue manager that selects the most appropriate contents in a compositional manner. We implemented a CoED bot based on dynamic composition and integrated it in theGoogle Assistant. As an example domain, the bot conversed about the NBA basketball league in a seamless experience, such that users were not aware whether they were conversing with the vanilla system or the one augmented with the CoED bot. Experimental results are positive and offer insights into what makes a good conversation.To the best of our knowledge, this is the first real user experiment of open-ended dialogues as part of a commercial assistant system. |
11:30-12:00 |
Broccoli: Sprinkling Lightweight Vocabulary Learning into Everyday Information Diets Roland Aydin (Institute of Materials Research, Helmholtz-Zentrum Geesthacht), Lars Klein (Ecole Polytechnique Fédérale de Lausanne), Arnaud Miribel (Ecole Polytechnique Fédérale de Lausanne) and Robert West (Ecole Polytechnique Fédérale de Lausanne).
AbstractThe learning of a new language remains to this date a cognitive task that requires considerable diligence and willpower, recent advances and tools notwithstanding. In this paper, we propose Broccoli, a new paradigm aimed at significantly reducing the required effort by seamlessly embedding vocabulary learning into users' everyday information diets. This is achieved by inconspicuously switching chosen words encountered by the user for their translation in the target language. Thus, by seeing words in context, the user can assimilate new vocabulary without much conscious effort. We validate our approach in a careful user study, finding that the efficacy of the lightweight Broccoli approach is competitive with traditional, memorization-based vocabulary learning. The low cognitive overhead is manifested in a pronounced decrease in learners' usage of mnemonics and other learning strategies, as compared to traditional learning. Finally, we establish that language patterns in typical information diets are compatible with spaced-repetition strategies, enabling an efficient use of the Broccoli paradigm. Overall, our work establishes the feasibility of a novel and powerful "install-and-forget" approach for embedded language acquisition. |
12:00-12:15 |
MetaSelector: Meta-Learning for Recommendation with User-Level Adaptive Model Selection Mi Luo (Huawei Noah's Ark Lab), Fei Chen (Huawei Noah’s Ark Lab), Pengxiang Cheng (Huawei Noah’s Ark Lab), Zhenhua Dong (Huawei Noah’s Ark Lab), Xiuqiang He (Huawei Noah’s Ark Lab), Jiashi Feng (National University of Singapore) and Zhenguo Li (Huawei Noah's Ark Lab).
AbstractRecommender systems often face heterogeneous datasets containing highly personalized historical data of users, where no single model could give the best recommendation for every user. We observe this ubiquitous phenomenon on both public and production datasets and address the issue of model selection in pursuit of optimizing the quality of recommendation for each user. We propose a meta-learning framework to facilitate user-level adaptive model selection in a hybrid recommender system. In this framework, a collection of recommenders is trained with data from all users, on top of which the meta-learning module trains a model selector that aims to select the best model for each user using the user-specific historical data. We conduct extensive experiments on two public datasets and a real-world production dataset, demonstrating that our proposed framework achieves significant improvements over single model baselines and sample-level model selector in terms of AUC and LogLoss. In particular, the improvement over the production dataset may lead to huge profit gain when deployed in online recommender systems. |
12:15-12:30 |
Latent Linear Critiquing for Conversational Recommender Systems Kai Luo (University of Toronto), Scott Sanner (University of Toronto), Ga Wu (University of Toronto), Hanze Li (University of Toronto) and Hojin Yang (University of Toronto).
AbstractCritiquing is a method for conversational recommendation that iteratively adapts recommendations in response to user preference feedback. In this setting, a user is iteratively provided with an item recommendation and attribute description for that item; a user may either accept the recommendation, or critique the attributes in the item description to generate a new recommendation. Historical critiquing methods were largely based on explicit constraint- and utility-based methods for modifying recommendations w.r.t. critiqued item attributes. In this paper, we revisit the critiquing approach in the era of recommendation methods based on latent embeddings with subjective item descriptions (i.e., keyphrases from user reviews). Two critical research problems arise: (1) how to co-embed keyphrase critiques with user preference embeddings to update recommendations, and (2) how to modulate the strength of multi-step critiquing feedback, where critiques are not necessarily independent, nor of equal importance. To address (1), we build on an existing state-of-the-art linear embedding recommendation algorithm to align review-based keyphrase attributes with user preference embeddings. To address (2), we exploit the linear structure of the embeddings and recommendation prediction to formulate a linear program (LP) based optimization problem to determine optimal weights for incorporating critique feedback. We evaluate the proposed framework on two recommendation datasets containing user reviews. Empirical results compared to a standard approach of averaging critique feedback show that our approach reduces the number of interactions required to find a satisfactory item and increases the overall success rate. |
Society (5)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
A Kernel of Truth: Determining Rumor Veracity on Twitter by Diffusion Pattern Alone Nir Rosenfeld (Harvard University), David Parkes (Harvard University) and Aron Szanto (Harvard University).
AbstractRecent work in the domain of misinformation identification has leveraged rich signals in the text and user identities associated with content on social media to discriminate between true and false information. But text can be strategically manipulated and accounts reopened under different aliases, suggesting that this approach is inherently brittle. In this work, we explore an alternative signal that is naturally robust---the pattern in which information propagates. Our goal is to answer the following question: can the veracity of an unverified rumor spreading through social media be predicted solely on the basis of its pattern of diffusion through the social network? Using graph kernels to extract topological information from Twitter cascade structures, we train models that are surprisingly accurate given that they are blind to language, user identities, and time, demonstrating that ``sanitized'' diffusion patterns can be highly informative of content. Our results suggest that, with the proper form of aggregation, the collective sharing pattern of the crowd can reveal powerful signals of rumor veracity, even in the early stages of propagation. |
11:00-11:30 |
Stop tracking me Bro! Differential Tracking of User Demographics on Hyper-Partisan Websites Pushkal Agarwal (King’s College London), Sagar Joglekar (King’s College London), Panagiotis Papadopoulos (Brave Software Inc.), Nishanth Sastry (King’s College London) and Nicolas Kourtellis (Telefonica Research).
AbstractWebsites with hyper-partisan, left or right-leaning focus offer content that is typically biased towards the expectations of their target audience. Such content often polarizes users, who are repeatedly primed to specific (extreme) content, usually reflecting hard party lines on political and socio-economic topics. Though this polarization has been extensively studied with respect to content, it is still unknown how it associates with the online tracking experienced by browsing users, especially when they exhibit certain demographic characteristics. For example, it is unclear how such websites enable the ad-ecosystem to track users based on their gender or age. In this paper, we take a first step to shed light and measure such potential differences in tracking imposed on users when visiting specific party-line’s websites. For this, we design and deploy a methodology to systematically probe such websites and measure differences in user tracking. This methodology allows us to create user personas with specific attributes like gender and age and automate their browsing behavior in a consistent and repeatable manner. Thus, we systematically study how personas are being tracked by these websites and their third parties, especially if they exhibit particular demographic properties. Overall, we test 9 personas on 556 hyper-partisan websites and find that right-leaning sites tend to track users more intensely than left-leaning, always depended on user demographics, and using both cookies and cookie synchronization methods, leading to more costly delivered ads. |
11:30-12:00 |
Characterizing Search-Engine Traffic to Internet Research Agency Web Properties Alexander Spangher (University of Southern California), Adam Fourney (Microsoft), Besmira Nushi (Microsoft), Gireeja Ranade (University of California, Berkeley) and Eric Horvitz (Microsoft).
AbstractThe Russia-based Internet Research Agency (IRA) carried out a broad information campaign in the U.S. before and after the 2016 presidential election. The organization created an expansive set of internet properties: web domains, Facebook pages, and Twitter bots, which received traffic via purchased Facebook ads, tweets, and search engines indexing their domains. In this paper, we focus on IRA activities that received exposure through search engines, by joining data from Facebook and Twitter with logs from a major internet company’s web browsers and search engine.We find that a substantial volume of Russian content was apolitical and emotionally-neutral in nature. Our observations demonstrate that such content gave IRA web-properties considerable exposure through search-engines and brought readers to websites hosting inflammatory content and engagement hooks. Our findings show that, like social media, web search also directed traffic to IRA generated web content, and the resultant traffic patterns are distinct from those of social media. |
Security (5)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Don’t Count Me Out: On the Relevance of IP Address in the Tracking Ecosystem Vikas Mishra (INRIA), Pierre Laperdrix (CNRS / Univ. Lille / Inria), Antoine Vastel (Univ. Lille / Inria), Walter Rudametkin (Univ. Lille / Inria), Romain Rouvoy (Univ. Lille / Inria / IUF) and Martin Lopatka (Mozilla).
AbstractTargeted online advertising has become an inextricable part of the way Web content and applications are monetized. Historical online advertising consisted of simple ad-banners broadly shown to website visitors; the current evolution is a complex ecosystem responsible for tracking users to learn their habits, and show them targeted, personalized ads. To protect users against tracking, several countermeasures have been proposed, ranging from browser extensions that leverage filter lists, to features natively integrated into popular browsers like Firefox and Brave. Nevertheless, few browsers offer protections against IP address-based tracking techniques. Notably, the most popular browsers, Chrome, Firefox, Safari and Edge do not protect users against IP address tracking. Indeed, while IP addresses assigned to mobile devices tend to be reassigned more frequently, residential IP addresses remain stable for long periods of time, despite being dynamic.In this paper, we study the stability of the public IP addresses a user’s device uses to communicate with our server. The public IP addresses we obtain could be those that are directly assigned to the users’ devices or, more commonly, the users’ devices are behind a gateway, such as a residential router, in which case, our server obtains the IP addresses of the routers. Over time, a same device communicates with our server using a set of distinct IP addresses, but we find that devices reuse some of their previous IP addresses for long periods of time. We call this IP address retention, and the duration for which an IP address is retained by a device, we call the IP address retention period.We use a dataset collected over a period of 111 days with 5, 443 users and 41, 566 unique IP addresses to study the retention period of IP addresses and show that 87% of users have at least one IP address that was retained for more than a month. We also present variations on the retention period based on the country and we show that, even in cases where long-lived IP addresses do change, more often than not only the least significant octet changes. Apart from being stable, we also show that 93% of users can be uniquely identified based on a set of long-lived IP addresses, thus showing both uniqueness and stability over time. Furthermore, we also detect the presence of cycles of IP addresses showing their potential to be used in inferring traits of the user’s behaviour, as well as mobility traces. Finally, we discuss different defence solutions that users could take advantage of to protect their privacy. |
11:00-11:30 |
How Do We Create a Fantabulous Password? Simon Woo (skku).
AbstractAlthough pronounceability can improve password memorability, most existing password generation approaches have not properly integrate the pronounceability of passwords in their designs. In this work, we demonstrate several shortfalls of current pronounceable password generation approaches, and then propose, ProSemPass, a new method of generating passwords that are pronounceable and semantically meaningful. In our approach, users supply initial input words and our system improves the pronounceability and meaning of the user-provided words by automatically creating a portmanteau. To measure the strength of our approach, we use attacker models, where attackers have complete knowledge of our password generation algorithms. We measure strength in guess numbers and compare those with other existing password generation approaches. Using a large-scale IRB-approved user study with 1,563 Amazon MTurkers over 9 different conditions, our approach achieves a 30% higher recall than those from current pronounceable password approaches, and is stronger than the offline guessing attack limit. |
11:30-12:00 |
Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text Vinayshekhar Bannihatti Kumar (Carnegie Mellon University), Roger Iyengar (Carnegie Mellon University), Namita Nisal (University of Michigan), Yuanyuan Feng (Carnegie Mellon University), Hana Habib (Carnegie Mellon University), Peter Story (Carnegie Mellon University), Sushain Cherivirala (Carnegie Mellon University), Margaret Hagan (Stanford University), Lorrie Cranor (Carnegie Mellon University), Shomir Wilson (The Pennsylvania State University), Florian Schaub (University of Michigan) and Norman Sadeh (Carnegie Mellon University).
AbstractWebsite privacy policies sometimes provide users the option to opt-out of certain collections and uses of their personal data. Unfortunately, many policies bury these instructions deep in their text, and few users of the web have the time or skill necessary to discover them. We describe an effort to automate the detection of opt-out choices in privacy policy text and to present them to users through a web browser extension. We describe the creation of two corpora of opt-out choices, which enable training classifiers to flexibly identify opt-outs in privacy policies. Our overall approach to extracting and classifying opt-out choices combines simple heuristics to identify a small set of commonly found opt-out hyperlinks with supervised machine learning to automatically identify less conspicuous instances. Our overall approach achieves a precision of 0.93 and a recall of 0.9. We introduce Opt-Out Easy, a web browser extension designed to present available opt-out choices to users as they browse the web. We discuss results of a user study to evaluate the usability of our browser extension. The paper also presents results of a large-scale analysis of opt-outs found in the text of several thousand of the most popular websites. |
12:00-12:30 |
Dark Matter: Uncovering the DarkComet RAT Ecosystem Brown Farinholt (University of California San Diego), Mohammad Rezaeirad (George Mason University), Damon McCoy (NYU) and Kirill Levchenko (University of Illinois at Urbana-Champaign).
AbstractRemote Access Trojans (RATs) are a persistent class of malware that give an attacker direct, interactive access to a victim’s personal computer, allowing the attacker to steal private data, spy on the victim in real-time using the camera and microphone, and verbally harass the victim through the speaker. To date, the users and victims of this pernicious form of malware have been challenging to observe in the wild due to the unobtrusive nature of infections. In this work, we report the results of a longitudinal study of the DarkComet RAT ecosystem. Using a known method for collecting victim log databases from DarkComet controllers, we present novel techniques for tracking RAT controllers across hostname changes and improve on established techniques for filtering spurious victim records caused by scanners and sandboxed malware executions. We downloaded 6,620 DarkComet databases from 1,029 unique controllers spanning over 5 years of operation. Our analysis shows that there have been at least 57,805 victims of DarkComet over this period, with 69 new victims infected every day, many of whose keystrokes have been captured, actions recorded, and webcams monitored during this time. Our methodologies for more precisely identifying campaigns and victims could potential be useful for improving the efficiency and efficacy of victim cleanup efforts and prioritization of law enforcement investigations. |
Search (4)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning Wei Ye (Peking University), Rui Xie (Peking University), Jinglei Zhang (Peking University), Tianxiang Hu (Peking University), Xiaoyin Wang (University of Texas at San Antonio) and Shikun Zhang (Peking University).
AbstractCode summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query. Since both tasks aim to model the association between natural language and programming language, recent studies have combined these two tasks to improve their performance. However, researchers have yet been able to effectively leverage the intrinsic connections between the two tasks as they train these tasks in a separate or pipeline manner so that their performance can not be well balanced. In this paper, we propose a novel end-to-end model for the two tasks by introducing an additional code generation task. More specifically, we explicitly exploit the probabilistic correlation between code summarization and code generation with dual learning, and utilize the two encoders for code summarization and code generation to train the code retrieval task via multi-task learning. We have carried out extensive experiments on an existing dataset of SQL and Python, and results show that our model can significantly improve the results of the code retrieval task over the-state-of-art models, as well as achieve competitive performance in terms of BLEU score for the code summarization task. |
11:00-11:30 |
Metric Learning with Equidistant and Equidistributed Triplet-based Loss for Product Image Search Furong Xu (UCAS), Wei Zhang (Ant Financial Services Group), Yuan Cheng (Ant Financial Services Group) and Wei Chu (Ant Financial Services Group).
AbstractProduct image search in E-commerce systems is a challenging task, because of a huge number of product classes, low intra-class similarity and high inter-class similarity. Deep metric learning, based on paired distances independent of the number of classes, aims to minimize intra-class variances and inter-class similarity in feature embedding space. Most existing approaches strictly restrict the distance between samples with fixed values to distinguish different classes of samples. However, the distance of paired samples has various magnitudes during different training stages. Therefore, it is difficult to directly restrict absolute distances with fixed values. In this paper, we propose a novel Equidistant and Equidistributed Triplet-based (EET) loss function to adjust the distance between samples with relative distance constraints. By optimizing the loss function, the algorithm progressively maximizes intra-class similarity and inter-class variances. Specifically, 1) the equidistant loss pulls the matched samples closer by adaptively constraining two samples of the same class to be equally distant from another one of a different class in each triplet, 2) the equidistributed loss pushes the mismatched samples farther away by guiding different classes to be uniformly distributed while keeping intra-class structure compact in the embedding space. Extensive experimental results on product search benchmarks verify the improved performance of our method. We also achieve improvements on other retrieval datasets, which show superior generalization capacity of our method in image search. |
11:30-12:00 |
Adversarial Multimodal Representation Learning for Click-Through Rate Prediction Xiang Li (Alibaba Group), Chao Wang (Alibaba Group), Jiwei Tan (Alibaba Group), Xiaoyi Zeng (Alibaba Group), Dan Ou (Alibaba Group) and Bo Zheng (Alibaba Group).
AbstractFor better user experience and business effectiveness, Click-Through Rate (CTR) prediction has been one of the most important tasks in E-commerce. Although extensive CTR prediction models have been proposed, learning good representation of items from multimodal features is still less investigated, considering an item in E-commerce usually contains multiple heterogeneous modalities. Previous works either concatenate the multiple modality features, that is equivalent to giving a fixed importance weight to each modality; or learn dynamic weights of different modalities for different items through technique like attention mechanism. However, a problem is that there usually exists common redundant information across multiple modalities. The dynamic weights of different modalities computed by using the redundant information may not correctly reflect the different importance of each modality. To address this, we explore the complementarity and redundancy of modalities by considering modality-specific and modality-invariant features differently. We propose a novel Multimodal Adversarial Representation Network (MARN) for the CTR prediction task. A multimodal attention network first calculates the weights of multiple modalities for each item according to its modality-specific features. Then a multimodal adversarial network learns modality-invariant representations where a double-discriminators strategy is introduced. Finally, we achieve the multimodal item representations by combining both modality-specific and modality-invariant representations. We conduct extensive experiments on both public and industrial datasets, and the proposed method consistently achieves remarkable improvements to the state-of-the-art methods. Moreover, the approach has been deployed in an operational E-commerce system and online A/B testing further demonstrates the effectiveness. |
12:00-12:30 |
Multi-Objective Ranking Optimization for Product Search Using Stochastic Label Aggregation David Carmel (Amazon), Elad Haramaty (Amazon), Arnon Lazerson (Amazon) and Liane Lewin-Eytan (Amazon).
AbstractLearning a ranking model in product search involves satisfying many requirements such as maximizing the relevance of retrieved products with respect to the user query, as well as maximizing the purchase likelihood of these products. Multi-Objective Ranking Optimization (MORO) is the task of learning a ranking model from training examples while optimizing multiple objectives simultaneously. Label aggregation is a popular solution approach for multi-objective optimization, which reduces the problem into a single objective optimization problem, by aggregating the multiple labels of the training examples, related to the different objectives, to a single label. In this work we explore several label aggregation methods for MORO in product search. We show that a ranking model that is optimized for the reduced single objective problem, using a deterministic label aggregation approach, does not necessarily reach an optimal solution for the original multi-objective problem. We propose a novel stochastic label aggregation method which randomly selects a label per training example according to a given distribution over the labels. We provide a theoretical proof showing that stochastic label aggregation is superior to alternative aggregation approaches, in the sense that any optimal solution of the MORO problem can be generated by a proper parameter setting of the stochastic aggregation process. We experiment on three different datasets: two from the voice product search domain, and one publicly available dataset from the Web product search domain. We demonstrate empirically over these three datasets that MORO with stochastic label aggregation provides a family of ranking models that fully dominates the set of MORO models built using deterministic label aggregation. |
Mobile (5)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Client Insourcing: Bringing Ops In-House for Seamless Re-engineering of Full-Stack JavaScript Applications Kijin An (Virginia Tech) and Eli Tilevich (Virginia Tech).
AbstractModern web applications are distributed across a browser-based client and a cloud-based server. The distributed nature of web applications complicates their inspection and evolution. Also, mature program analysis and transformation tools work only with centralized software. Inspired by business process re-engineering, in which remote operations can be insourced back in house to restructure and outsource anew, we bring an analogous approach to the re-engineering of full-stack JavaScript applications. We designed and implemented the Client Insourcing automatic refactoring to create a distributed application’s centralized variant to inspect, modify, and redistribute to meet new requirements. We demonstrate the utility and value of Client Insourcing to address changes in privacy, reliability, and performance requirements. By streamlining the required non-trivial program inspections and modifications, our approach can become a helpful aid in the re-engineering of web applications. |
11:00-11:30 |
When Recommender Systems Meet Fleet Management: Practical Study in Online Driver Repositioning System Zhe Xu (DiDi Chuxing), Chang Men (DiDi Chuxing), Peng Li (Didi Chuxing), Bicheng Jin (Didi Chuxing), Ge Li (Didi Chuxing), Yue Yang (Didi Chuxing), Chunyang Liu (Didi Chuxing), Ben Wang (Didi Chuxing) and Xiaohu Qie (Didi Chuxing).
AbstractE-hailing platforms have become an important component of public transportation in recent years. The supply (online drivers) and demand (passenger requests) are intrinsically imbalanced because of the pattern of human behavior, especially in time and locations such as peak hours and train stations. Hence, how to balance supply and demand is one of the key problems to satisfy passengers and drivers and increase social welfare. As an intuitive and effective approach to address this problem, driver repositioning has been employed by some real-world e-hailing platforms. In this paper, we describe a novel framework of driver repositioning system, which meets various requirements in practical situations, including robust driver experience satisfaction and multi-driver collaboration. We introduce an effective and user-friendly driver interaction design called "driver repositioning task". A novel modularized algorithm is developed to generate the repositioning tasks in real time. To our knowledge, this is the first industry-level application of driver repositioning. We evaluate the proposed method in real-world experiments, achieving a 2% improvement of driver income. Our framework has been fully deployed online and repositions millions of drivers every day. |
11:30-12:00 |
Hierarchically Structured Transformer Networks for Fine-Grained Spatial Event Prediction Chao Huang (JD Digits), Xian Wu (University of Notre Dame), Chuxu Zhang (University of Notre Dame) and Nitesh Chawla (University of Notre Dame).
AbstractSpatial event forecasting is challenging and crucial for urban sensing scenarios, which is beneficial for a wide spectrum of spatial-temporal mining applications, ranging from traffic management, public safety, to environment policy making. In spite of significant progress has been made to solve spatial-temporal prediction problem, most existing deep learning based methods based on a coarse-grained spatial setting and the success of such methods largely relies on data sufficiency. In many real applications, predicting events with a fine-grained spatial resolution do play a critical role to provide high discernibility of spatial-temporal data distributions. However, in such cases, applying existing methods will result in weak performance since they may not well capture the quality spatial-temporal representations when training triple instances are highly imbalance across locations and time.To tackle this challenge, we develop a hierarchically structured Spatial-Temporal Transformer network (STtrans) which leverages a main embedding space to capture the inter-dependencies across time and space for alleviating data imbalance issue. In our STtrans framework, the first-stage transformer module discriminates the types of region and time-wise relations. To make the latent spatial-temporal representations be reflective of the relational structure between categories, we further develop a cross-category fusion transformer network to endow STtrans with the capability to preserve the semantic signals in a fully dynamic manner. Finally, an adversarial training strategy is introduced to yield a robust spatial-temporal learning under data imbalance. Extensive experiments on two real-world imbalanced spatial-temporal datasets from NYC and Chicago demonstrate the superiority of our method over various state-of-the-art baselines. |
12:00-12:30 |
Sub-Linear RACE Sketches for Approximate Kernel Density Estimation on Streaming Data Benjamin Coleman (Rice University) and Anshumali Shrivastava (Rice University).
AbstractKernel density estimates are important for many machine learning applications in the streaming setting. Unfortunately, they have a prohibitive memory and computation cost for large, high-dimensional datasets. Recent sampling algorithms for high-dimensional densities can reduce the computation cost but cannot operate online, while streaming algorithms currently suffer from the curse of dimensionality. Even though the problem is well-studied, all existing methods suffer from a high memory storage cost which is prohibitive for many internet of things (IoT) and mobile applications. We propose an online sketching algorithm to compress a set of N high dimensional vectors into a small array of integer counters. This sketch is sufficient to estimate the kernel density for a large class of useful kernels. Our method is practical to implement and comes with strong theoretical guarantees. Our sketches are mergeable, parallel and ideal for distributed computation settings. We evaluate our method on datasets with hundreds to thousands of dimensions and show that our sketch provides a 10x compression improvement over competing methods at a similar computational cost. We expect that our dataset compression method will enable a variety of applications in resource-constrained settings such as mobile and IoT. |
Web Mining-B (5)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Estimate the Implicit Likelihoods of GANs with Application to Anomaly Detection Shaogang Ren (Baidu Research, USA), Dingcheng Li (Baidu Research, USA), Zhixin Zhou (Baidu Research, USA) and Ping Li (Baidu Research, USA).
AbstractThe thriving of deep models and generative models provides approaches to model high dimensional distributions. The fact that GANs can generate amazing realistic images implies that they can learn the data manifolds well. In this manuscript, we propose an approach to estimate the implicit likelihoods of GAN models. A stable regularized inverse function of the generator can be learned with the help of a variance network of the generator. The local variance of the sample distribution can be approximated by the normalized distance in the latent space. Simulation studies, anomaly detection, and likelihood testing on real-world datasets validate the proposed method, which outperforms some baseline methods in these tasks. |
11:00-11:30 |
Real-Time Clustering for Large Sparse Online Visitor Data Gromit Yeuk-Yin Chan (New York University), Fan Du (Adobe), Ryan Rossi (Adobe), Anup Rao (Adobe), Eunyee Koh (Adobe), Claudio Silva (New York University) and Juliana Freire (NYU Poly).
AbstractOnline visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To explore clusters in different hierarchies and discover useful customer segments, marketers often need to explore different splits of the data. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current clustering techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. The algorithm first pre-processes the input points to compute annotations for cluster assignment. While the cluster assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. To address this challenge, we leverage a weighted Jaccard similarity metric and propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation of this approach on Spark, which effectively deals with data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 $\times$ speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Our scalable clustering algorithm enables data scientists and marketers to interactively explore and discover customer segments from millions of online visitor records in real-time. |
11:30-11:45 |
What Sparks Joy: The AffectVec Emotion Database Shahab Raji (Rutgers University) and Gerard de Melo (Rutgers University).
AbstractAffective analysis of textual data is instrumental in understanding human communication in the modern era of social media. A number of semantic resources have been proposed as attempts to capture the emotional associations of words. In this work, we show that we can obtain a resource that goes beyond the common binary association scores for emotion classification by using elegant techniques that draw on lexical associations as well as existing emotion lexicons. In a series of statistical and machine learning experiments, we show that these simple techniques outperform previous state-of-the-art approaches by substantial margins. |
11:45-12:00 |
Visual Concept Naming: Discovering Well-Recognized Textual Expressions of Visual Concepts Masayasu Muraoka (IBM Research - Tokyo), Tetsuya Nasukawa (IBM Research - Tokyo), Rudy Raymond (IBM Research - Tokyo) and Bishwaranjan Bhattacharjee (IBM Thomas J. Watson Research Center).
AbstractSignificant advances in deep learning for Computer Vision have enabled object recognition systems to recognize a large number of visual concepts in images at almost the same level as humans. Given an image, such recognition systems output labels for representing visual concepts expressed in text. However, a visual concept may be represented by various expressions, not only by the typical expressions used for labels (e.g., car, automobile, or vehicle) but also by other expressions including casual expressions such as auto and specific expressions (e.g., BMW, Jeep, or Suzuki) in real-world textual data, such as SNS data. The expressions can be also expressed in various languages (e.g., Wagen, macchina, and mobil meaning car in German, Italian, and Indonesian, respectively). Yet, an object recognition system does not deal with this association because the system does not consider textual data. Associating textual expressions with the corresponding visual objects is essential for bridging the gap between vision and language because they are tightly linked. To this end, we propose a task called Visual Concept Naming to associate diverse textual expressions written by humans who have different background knowledge in different languages. The goal of the task is to extract textual expressions, i.e., names of visual concepts from real-world multimodal data, consisting of textual data combined with visual data. To tackle the task, we create a dataset consisting of 3.4 million tweets in total in three languages. We also propose a method for extracting candidate names of visual concepts and validating them by exploiting Web-based knowledge obtained through image search. To demonstrate the capability of our method, we conduct an experiment with the dataset we create and evaluate names obtained by our method through crowdsourcing, where we establish an evaluation method to verify the names. The experimental results indicate that the proposed method can identify a wide variety of names of visual concepts. The names we obtained also show interesting insights regarding languages and cultures. |
Semantics (5)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
10:30-11:00 |
Multiple Knowledge Syncretic Transformer for Natural Dialogue Generation Xiangyu Zhao (College of Intelligence and Computing, Tianjin University), Longbiao Wang (College of Intelligence and Computing, Tianjin University), Ruifang He (College of Intelligence and Computing, Tianjin University), Ting Yang (College of Intelligence and Computing, Tianjin University), Jinxin Chang (College of Intelligence and Computing, Tianjin University) and Ruifang Wang (College of Intelligence and Computing, Tianjin University).
AbstractKnowledge is essential for intelligent conversation systems to generate informative responses. This knowledge comprises a wide range of diverse modalities such as knowledge graphs (KGs), grounding documents and conversation topics. However, limited abilities in understanding language and utilizing different types of knowledge still challenge existing approaches. Some researchers try to enhance models' language comprehension ability by employing the pre-trained language models, but they neglect the importance of external knowledge in specific tasks. In this paper, we propose a novel universal transformer-based architecture for dialogue system, the Multiple Knowledge Syncretic Transformer (MKST), which fuses multi-knowledge in open-domain conversation. Firstly, the model is pre-trained on a large-scale corpus to learn commonsense knowledge. Then during fine-tuning, we divide the type of knowledge into two specific categories that are handled in different ways by our model. While the encoder is responsible for encoding dialogue contexts with multifarious knowledge together, the decoder with a knowledge-aware mechanism attentively reads the fusion of multi-knowledge to promote better generation. This is the first attempt that fuses multi-knowledge in one conversation model. The experimental results have been demonstrated that our model achieves significant improvement on knowledge-driven dialogue generation tasks than state-of-the-art baselines. Meanwhile, our new benchmark could facilitate the further study in this research area. |
11:00-11:30 |
Novel Entity Discovery from Web Tables Shuo Zhang (University of Stavanger), Edgar Meij (Bloomberg L.P.), Krisztian Balog (University of Stavanger) and Ridho Reinanda (Bloomberg LP).
AbstractWhen working with any sort of knowledge base (KB) one has to make sure it is as complete and also as up-to-date as possible. Both tasks are non-trivial as they require recall-oriented efforts to determine which entities and relationships are missing from the KB. As such they require a significant amount of labor. Tables on the Web on the other hand are abundant and have the distinct potential to assist with these tasks. In particular, we can leverage the rich content in such tables to discover new entities, properties, and relationships. This paper addresses two main tasks in this context: table-to-KB matching and novel entity discovery. The first task aims to infer table semantics by linking table cells and heading columns to elements of a KB. We propose a novel, feature-based method for this task and on two public test collections, we demonstrate substantial improvements over the state-of-the-art in terms of precision whilst also improving recall. We further apply our method to annotate a corpus of 3M tables, which will be released as a public resource. The second task is novel and targets the discovery of new entities and relationships, where we differentiate different types including in-KB (``known'') and out-of-KB (``novel'') information. When evaluated using three purpose-built test collections, we find that our proposed approaches obtain a marked improvement on precision whilst keeping recall stable. |
11:30-12:00 |
Stable Model Semantics for Recursive SHACL Medina Andresel (Vienna University of Technology), Julien Corman (Free University of Bozen-Bolzano), Magdalena Ortiz (Vienna University of Technology), Juan L. Reutter (Pontificia Universidad Católica), Ognjen Savkovic (Free University of Bolzano) and Mantas Simkus (Vienna University of Technology).
AbstractSHACL (Shapes Constraint Language) is a W3C recommendation for validating graph-based data against a set of conditions. Among the interesting features of SHACL is the ability to define recursive shapes, to state, for example, that children of persons must be persons. Although the recommendation left open the semantics of recursive shapes, there has already been proposals to extend the official semantics for the case of recursion. However, they are based on the idea of possibility: a graph will be valid against a schema as long as one can find a way to assign shapes to nodes in such a way that all constraints are satisfied. However, this definition is not constructive, as it does not give any guidelines on how one is to obtain such assignment, and it may lead to unfounded assignments, where the only reason to state that a node has a certain shape is because it serves to validate the graph. In this paper we propose a stricter semantics for SHACL that is based on the idea of stable models in logic programming: instead of allowing any possible assignment, we only allow those where each shape assignments is justified by a given constraint. We further exploit the relation between our semantics and logic programming, and show that the validation problem for a graph and a SHACL schema can be encoded into an ASP program. This also gives us a constructive semantics for a special type of SHACL schemas that are based on the idea of stratified negation. Finally, we also extend our semantics in the context of partial assignments, which have been used to define a more relaxed notion of validation that is tolerant to certain faults in the schema. In this case, we show that the stable semantics with partial assignments can be captured by the same ASP translation, this time working with well-founded ASP models. |
12:00-12:30 |
NetTaxo: Automated Topic Taxonomy Construction from Text-Rich Network Jingbo Shang (University of Illinois at Urbana-Champaign), Xinyang Zhang (University of Illinois at Urbana-Champaign), Liyuan Liu (University of Illinois at Urbana-Champaign), Sha Li (University of Illinois at Urbana-Champaign) and Jiawei Han (University of Illinois at Urbana-Champaign).
AbstractThe automated construction of topic taxonomies can benefit numerous applications, including web search, recommendation, and knowledge discovery. One of the major advantages of automatic taxonomy construction is the ability to capture corpus-specific information and adapt to different scenarios. To better reflect the characteristics of a corpus, we take the meta-data of documents into consideration and view the corpus as a text-rich network. In this paper, we propose NetTaxo, a novel automatic topic taxonomy construction framework, which goes beyond the existing paradigm and allows text data to collaborate with network structure. Specifically, we learn term embeddings from both text and network as contexts. Network motifs are adopted to capture appropriate network contexts. We conduct an instance-level selection for motifs, which further refines term embedding according to the granularity and semantics of each taxonomy node. Clustering is then applied to obtain sub-topics under a taxonomy node. Extensive experiments on two real-world datasets demonstrate the superiority of our method over the state-of-the-art, and further verify the effectiveness and importance of instance-level motif selection. |
Research Tracks (8)
Web Mining-A (8)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Complex Factoid Question Answering with a Free-Text Knowledge Graph Chen Zhao (University of Maryland), Chenyan Xiong (Microsoft), Xin Qian (University of Maryland) and Jordan Boyd-Graber (University of Maryland).
AbstractWe introduce DELFT, a factoid question answering system which combines the nuance and depth of knowledge graph question answering approaches with the broader coverage of free-text. DELFT builds a free-text knowledge graph from Wikipedia, with entities as nodes, and sentences in which entities co-occur as edges. For each question, DELFT finds the subgraph linking question entity nodes to candidate using text sentences as edges, yielding a dense and high coverage semantic graph. A novel graph neural network reasons over the free-text graph —combining evidence on the nodes via information along edge sentences—to select a final answer. Experiments on three question answering datasets show DELFT can answer entity-rich questions better than machine reading based models, BERT-based answer ranking and memory networks with big margins. DELFT's strong advantage comes from both the high coverage of its free-text knowledge graph—more than doubled that of DBpedia relations —and the novel graph neural network model which conducts accurate structural reasoning on the rich but also noisy free-text evidence. |
14:00-14:30 |
Abstractive Snippet Generation Wei-Fan Chen (Paderborn University), Shahbaz Syed (Leipzig University), Benno Stein (Bauhaus-Universität Weimar), Matthias Hagen (Martin-Luther-Universität Halle-Wittenberg) and Martin Potthast (Leipzig University).
AbstractAn abstractive snippet is an originally created piece of text to summarize a web page on a search engine results page. Compared to the conventional extractive snippets, which are generated by extracting literal phrases and sentences from a web page, abstractive snippets circumvent copyright issues; even more interesting is the fact, that they open the door for personalization. Abstractive snippets have been evaluated as equally powerful in terms of user acceptance and expressiveness-but the key question remains: Can abstractive snippets be automatically generated with sufficient quality?This paper introduces a competitive approach to abstractive snippet generation, supported by a thorough evaluation. We identify new sources that can be exploited via distant supervision to serve as ground truth data for this kind of summarization tasks: web directories (a hierarchical list of websites with descriptions organized by subject) and anchor contexts (the sentences around hyperlinks). Regarding the former, we utilize the DMOZ Open Directory Project, which is one of the largest human-edited directories on the web. Regarding the latter, we mine the entire ClueWeb09 and ClueWeb12 corpora. Altogether, we compile more than 3 million triples of the form as training examples, where the anchor context is used in lieu of a genuine and query-biased abstractive snippet for the target web page. Utilizing the new dataset, we propose a bidirectional generation model to generate query-biased snippets. We assess the quality of both our training data and the generated abstractive snippets with standard measures, with crowdsourcing, and against two existing abstractive snippet generation approaches. The evaluation shows that our novel data sources along with the proposed bidirectional model is suited to produce usable, query-biased, and abstractive snippets while minimizing text reuse. |
14:30-15:00 |
Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus Bang Liu (University of Alberta), Haojie Wei (Tencent), Di Niu (University of Alberta), Haolan Chen (Tencent) and Yancheng He (Tencent).
AbstractLearning to ask questions is critical to both human and machine intelligence. It helps knowledge acquisition, improves machine reading comprehension and question-answering tasks, and helps to continue a conversation in chatbots. Existing answer-aware question generation models are ineffective at generating a large number of high-quality question-answer pairs from unstructured text, since given an answer and an input passage, question generation is inherently a one-to-many mapping problem. In this paper, we propose Answer-Clue-Style-aware Question Generation (ACS-QG), a novel system aimed at automatically generating diverse and high-quality question-answer pairs from unlabeled text corpus at scale by mimicking the way a human asks questions. Our system consists of: i) an information extractor, which samples multiple types of assistive information to guide question generation; ii) neural question generators, which generates diverse and controllable questions about a passage, utilizing the extracted assistive information as an input; and iii) a neural quality controller, which filters out low-quality generated data based on text entailment. We compare our question generation models with existing approaches and perform pilot user studies to evaluate the quality of the generated question-answer pairs. The evaluation results show that our system dramatically outperforms state-of-the-art neural question generation models in terms of the generation quality, while being scalable in the meantime. With models trained on a relatively smaller amount of data, we can generate 2.8 million quality-assured question-answer pairs from a million sentences in Wikipedia. |
15:00-15:15 |
Distant Supervision for Multi-Stage Fine-Tuning in Retrieval-Based Question Answering Yuqing Xie (University of Waterloo), Wei Yang (RSVP.ai), Luchen Tan (RSVP.ai), Kun Xiong (RSVP.ai), Nicholas Jing Yuan (HUAWEI Cloud & AI), Baoxing Huai (HUAWEI Cloud & AI), Ming Li (University of Waterloo) and Jimmy Lin (University of Waterloo).
AbstractWe tackle the problem of question answering directly on a large document collection, combining simple ``bag of words'' passage retrieval with a BERT-based reader for extracting answer spans. In the context of this architecture, we present a data augmentation technique using distant supervision to automatically annotate paragraphs as either positive or negative examples to supplement existing training data, which are then used together to fine-tune BERT.We explore a number of details that are critical to achieving high accuracy:\ the proper sequencing of different datasets during fine-tuning, the balance between ``difficult'' vs.\ ``easy'' examples, and different approaches to gathering negative examples. Experimental results show that, with the appropriate settings, we can achieve large gains in effectiveness on two English and two Chinese QA datasets. We are able to achieve state-of-the-art results without any modeling advances, which once again affirms the clich\'e ``there's no data like more data''. |
15:15-15:30 |
TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis Zilong Wang (Peking University), Zhaohong Wan (Peking University) and Xiaojun Wan (Peking University).
AbstractMultimodal sentiment analysis is an important research area that predicts speaker's sentiment tendency through features extracted from textual, visual and acoustic modalities. The central challenge is the fusion method of the multimodal information. A variety of fusion methods have been proposed, but few of them adopt end-to-end translation models to mine the subtle correlation between modalities. Enlightened by recent success of Transformer in the area of machine translation, we propose a new fusion method, TransModality, to address the task of multimodal sentiment analysis. We assume that translation between modalities contributes to a better joint representation of speaker's utterance. With Transformer, the learned features embody the information both from the source modality and the target modality. We validate our model on multiple multimodal datasets: CMU-MOSI, MELD, IEMOCAP. The experiments show that our proposed method achieves the state-of-the-art performance. |
Social Network-A (8)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Edge Formation In Social Networks To Nurture Content Creators Chun Lo (LinkedIn), Emilie de Longueau (LinkedIn), Ankan Saha (LinkedIn), Shaunak Chatterjee (LinkedIn) and Ye Tu (LinkedIn).
AbstractSocial networks act as major content marketplaces where creators and consumers come together to share and consume various kinds of content. Popular content ranking applications (e.g., newsfeed, moments, notifications, ads) and edge recommendations (e.g., connect to members, follow celebrities or groups or hashtags) on such platforms aim at improving the consumer experience. In this work, we focus on the creator experience and specifically on improving edge recommendations to better serve creators in such ecosystems.The audience and reach of creators -- individuals, celebrities, publishers and companies -- are critically shaped by these edge recommendation products. Hence, incorporating creator utility in such recommendations can have a material impact on their success, and in turn, on the marketplace. Our proposed solution involves edge-level creator utility estimation (for currently unformed edges) and an experiment design that accounts for the network effect. We also discuss the implementation of our proposal at scale on LinkedIn, a professional network with 645M+ members, and report our findings. |
14:00-14:30 |
Friend or Faux: Graph-Based Early Detection of Fake Accounts on Social Networks Adam Breuer (Harvard University), Roee Eilat (Facebook) and Udi Weinsberg (Facebook).
AbstractIn this paper, we study the problem of early detection of fake user accounts on social networks based solely on their network connectivity with other users. Removing such accounts is a core task for maintaining the integrity of social networks, and early detection helps to reduce the harm that such accounts inflict. However, new fake accounts are notoriously difficult to detect via graph-based algorithms, as their small number of connections are unlikely to reflect a significant structural difference from those of new real accounts. We present the SybilEdge algorithm, which determines whether a new user is a fake (`sybil') account by aggregating over (I) her choices of friend request targets and (II) these targets' respective responses. SybilEdge performs this aggregation giving more weight to a user's choices of targets to the extent these targets are preferred by other fake versus real users, and also to the extent these targets respond differently to fake versus real users. We show that this algorithm rapidly detects new fake accounts at scale on the Facebook network, and also that it performs well compared to state-of-the-art alternatives on simulated networks designed to capture a variety of sybil attack strategies. To our knowledge, this is the first time a graph-based algorithm has been shown to achieve high accuracy (AUC > 0.9) on new users who have only sent a small number of friend requests. |
14:30-15:00 |
Searching for polarization in signed graphs: a local spectral approach Han Xiao (Aalto University), Bruno Ordozgoiti (Aalto University) and Aristides Gionis (Aalto University).
AbstractSigned graphs have been used to model interactions in social networks, which can be either positive (friendly) or negative (antagonistic). The model has been used to study polarization and other related phenomena in social networks, which can be harmful to the process of democratic deliberation in our society. An interesting and challenging task in this application domain is to detect polarized communities in signed graphs. A number of different methods have been proposed for this task~chu2016finding, lo2013mining, however, existing methods aim at finding globally optimal solutions. Instead, in this paper we are interested in finding polarized communities that are related to a small set of seed nodes provided as input. Seed nodes may consist of two sets, which constitute the two sides of a polarized structure.In this paper we formulate the problem of finding local polarized communities in signed graphs as a locally-biased eigen-problem~mahoney2012local. By viewing the eigenvector associated with the smallest eigenvalue of the Laplacian matrix as the solution of a constrained optimization problem, we are able to incorporate the local information as an additional constraint. In addition, we show that the locally-biased vector can be used to find communities with approximation guarantee with respect to a local analogue of Cheeger constant on signed graphs~atay2014cheeger. By exploiting the sparsity in the input graph, an indicator-vector for the polarized communities can be found in time linear to the graph size.Our experiments on real-world networks validate the proposed algorithm and demonstrate its usefulness at finding local structures in this semi-supervised manner. |
15:00-15:15 |
Just SLaQ When You Approximate: Accurate Spectral Distances for Web-Scale Graphs Anton Tsitsulin (Google), Marina Munkhoeva (Google) and Bryan Perozzi (Google AI).
AbstractGraph comparison is a fundamental operation of many many tasks in data mining and information retrieval. Because of the combinatorial nature of graphs, it is hard to balance the expressiveness of the similarity measure and its scalability. Spectral graph analysis provides quintessential tools for mining information from networks, as the spectrum of a graph reflects its multi-scale structure and, thus, is a well-suited foundation for reasoning about differences between graphs. However, computing the full spectrum of a graph is computationally prohibitive, and spectral methods for graph comparison therefore must rely on rough approximation techniques with few error guarantees. In addition to approximation error, scalabilty is a bottleneck for most graph comparison methods. Few distance measures between unaligned graphs can handle graphs with more than ten thousand nodes, and those which can sacrifice approximation guarantees and accuracy for scalability sake.In this work, we propose SLaQ, an efficient and effective approximation technique for computing two distances between graphs with millions of nodes and billions of edges. We derive the corresponding error bounds and demonstrate that accurate computation is possible in time linear in the number of graph edges. In a thorough experimental evaluation we show that SLaQ outperforms existing approximation methods, sometimes by several orders of magnitude in accuracy, while maintaining comparable performance, allowing to accurately compare of million-scale graphs in a matter of minutes on a single machine. |
15:15-15:30 |
P-Simrank: Extending Simrank to Scale-Free Bipartite Networks Prasenjit Dey (Microsoft), Kunal Goel (Microsoft) and Rahul Agrawal (Microsoft).
AbstractThe measure of similarity between nodes in a graph is a useful tool in many areas of computer science. SimRank, proposed by Jeh and Widom, is a classic measure of similarities of nodes in graph that has both theoretical and intuitive properties and has been extensively studied and used in many applications such as Query-Rewriting, link prediction, collaborative filtering and so on. Existing works based on Simrank primarily focus on preserving the microscopic structure, such as the second and third order proximity of the vertices, while the macroscopic scale-free property is largely ignored. Scale-free property is a critical property of any real-world web graphs where the vertex degrees follow a heavy-tailed distribution. In this paper, we introduce P-Simrank which extends the idea of Simrank to Scale-free bipartite networks. To study the efficacy of the proposed solution on a real world problem, we tested the same on the well known query-rewriting problem in bipartite click graph, similar to Simrank++, which acts as our baseline. We show that Simrank++ produces sub-optimal similarity scores in case of bipartite graphs where degree distribution of vertices follow power-law. We also show how P-Simrank can be optimized for real-world large graphs. Finally, we experimentally evaluate P-Simrank algorithm against Simrank++, using actual click graphs obtained from Bing, and show that P-Simrank outperforms Simrank++ in variety of metrics. |
User Modeling-A (8)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Discovering Strategic Behaviors for Collaborative Content-Production in Social Networks Yuxin Xiao (University of Illinois at Urbana-Champaign), Adit Krishnan (University of Illinois at Urbana-Champaign) and Hari Sundaram (University of Illinois at Urbana-Champaign).
AbstractSome online social networks provide an explicit mechanism to allocate rewards based on users' actions, while the mechanism is more opaque in other types of social networks. Nonetheless, there are always individuals who are able to obtain higher reputations than their peers in those networks. An intuitive yet important question to ask is whether they employ strategic behaviors to become influential. It might appear that the influencers in those networks "have gamed the system" and the rest have not figured out the mechanism. However, it remains difficult to draw conclusions on the rationality of those winning individuals due to factors like the combinatorial strategy space, the inability to determine the payoffs and the resource limitation faced by individuals. The challenging nature of this question draws long-term attention from both the theory and data mining communities.Therefore, in this paper, we are motivated to investigate whether resource-limited individuals are able to discover strategic behaviors associated with high payoffs when producing contents in social networks. To properly tackle this question, we propose a novel framework of Dynamic Dual Attention Networks (DDAN) which models individuals' content production strategies under the influence of social interactions involved in the process. Extensive experimental results illustrate the model's effectiveness in user behavior modeling. Furthermore, we make three strong empirical findings: first, different strategies give rise to different payoffs; second, the best performing individuals exhibit stability in their preferential orders over strategies, which indicates the emergence of strategic behaviors; third, the stability of preference is correlated with high payoffs. To the best of our knowledge, this is the first attempt to formally identify strategic behaviors from empirical data. |
14:00-14:30 |
Learning from Cross-Modal Behavior Dynamics with Graph-Regularized Neural Contextual Bandit Xian Wu (university of notre dame), Suleyman Cetintas (Yahoo Research), Deguang Kong (Google), Miao Lu (Yahoo Research), Jian Yang (Yahoo Research) and Nitesh Chawla (University of Notre Dame).
AbstractContextual multi-armed bandit algorithms have received significant attention in modeling users' preferences for online personalized recommender systems in a timely manner. While significant progress has been made along this direction, a few major challenges have not been well addressed yet: (i) a vast majority of the literature is based on linear models that cannot capture complex non-linear inter-dependencies of user-item interactions; (ii) existing literature mainly ignores the latent relations between users and non-recommended items: hence may not properly reflect users' preferences in the real-world; (iii) current solutions are mainly based on historical data and are prone to cold-start problems for new users who have no interaction history. To address the above challenges, we develop a Graph Regularized Cross-modal (GRC) learning model, a general framework to exploit transferable knowledge learned from multi-modal user-item interactions as well as the external features of users and items in online personalized recommendations. In particular, the GRC framework seamlessly combines the linearity of contextual bandit framework and the non-linearity of neural network in modeling complex inherent structure of user-item interactions. We further augment GRC with the cooperation of the metric learning technique and a graph-constrained embedding module, to map the units from different dimensions (temporal, social and semantic) into the same latent space. An extensive set of experiments conducted on two benchmark datasets as well as a large scale proprietary dataset from a major search engine demonstrates the power of the proposed GRC model in effectively capturing users' dynamic preferences under different settings by outperforming all baselines by a large margin. |
14:30-15:00 |
Modeling Users’ Behavior Sequences with Hierarchical Explainable Network for Cross-domain Fraud Detection Yongchun Zhu (Institute of Computing Technology, Chinese Academy of Sciences), Dongbo Xi (Institute of Computing Technology, Chinese Academy of Sciences), Bowen Song (Ant Financial Services Group), Fuzhen Zhuang (Institute of Computing Technology, Chinese Academy of Sciences), Shuai Chen (Ant Financial Services Group), Xi Gu (Ant Financial Services Group) and Qing He (Institute of Computing Technology, Chinese Academy of Sciences).
AbstractWith the explosive growth of the e-commerce industry, detecting online transaction fraud in real-world applications has become increasingly important to the development of e-commerce platforms. The sequential behavior history of users provides useful information in differentiating fraudulent payments from regular ones. Recently, some approaches have been proposed to solve this sequence-based fraud detection problem. However, these methods usually suffer from two problems: the prediction results are difficult to explain and the exploitation of the internal information of behaviors is insufficient. To tackle the above two problems, we propose a Hierarchical Explainable Network (HEN) to model users' behavior sequences, which could not only improve the performance of fraud detection but also make the inference process interpretable.Meanwhile, as e-commerce business expands to new domains, e.g. new countries or new markets, one major problem for modeling user behavior in fraud detection systems is the limitation of data collection, e.g., very few data/labels available. Thus, in this paper, we further propose a transfer framework to tackle the cross-domain fraud detection problem, which aims to transfer knowledge from existing domains (source domains) with enough and mature data to improve the performance in the new domain (target domain). Our proposed method is a general transfer framework that could not only be applied upon HEN but also various existing models in the Embedding \& MLP paradigm.By utilizing data from a world-leading cross-border e-commerce platform, we conduct extensive experiments in detecting card-stolen transaction frauds in different countries to demonstrate the superior performance of HEN. Besides, based on 90 transfer task experiments, we also demonstrate that our transfer framework could not only contribute to the cross-domain fraud detection task with HEN, but also be universal and expandable for various existing models. Moreover, HEN and the transfer framework form three-level attention which greatly increases the explainability of the detection results. |
15:00-15:15 |
Beyond Clicks: Modeling Multi-Relational Item Graph for Session-Based Target Behavior Prediction Wen Wang (East China Normal University), Wei Zhang (East China Normal University), Shukai Liu (Search Product Center, WeChat Search Application Department, Tencent), Qi Liu (Search Product Center, WeChat Search Application Department, Tencent), Bo Zhang (Search Product Center, WeChat Search Application Department, Tencent), Leyu Lin (Search Product Center, WeChat Search Application Department, Tencent) and Hongyuan Zha (Georgia Institute of Technology).
AbstractSession-based target behavior prediction is the task of predicting the next item to be interacted in the current anonymous behavior sequence under a specific type of user behavior (e.g., clicking an item). Although existing methods for session-based behavior prediction leverage powerful representation learning approaches to encode items' sequential relevance in a low-dimensional space, they suffer from several limitations. Firstly, they focus on only using the same type of user behavior as input for prediction, and ignore the potential of leveraging other type of behavior as auxiliary information which is particularly crucial when the target behavior is sparse but important (e.g., buying or sharing an item). Secondly, item-to-item relations in different sequences are modeled separately and locally, and they lack a principled way to globally encode these relations more effectively.To overcome these limitations, we propose a novel Multi-relational Graph Neural Network model for Session-based target behavior Prediction, namely MGNN-SPred for short. Specifically, we build a Multi-Relational Item Graph (MRIG) based on all behavior sequences from all sessions, involving target and auxiliary behavior types. MGNN-SPred learns global item-to-item relations based on MRIG and further obtains local representations for current target and auxiliary behavior sequences, respectively. In the end, MGNN-SPred leverages a gating mechanism to adaptively fuse different types of local representations for predicting next item interacted with target behavior. The extensive experiments on two real-world datasets demonstrate the superiority of our proposed model by comparing with state-of-the-art session-based prediction methods, validating the benefits of leveraging auxiliary behavior and learning item-to-item relations over MRIG. |
15:15-15:30 |
To be Tough or Soft: Measuring the Impact of Counter-Ad-blocking Strategies on User Engagement Shuai Zhao (New Jersey Institute of Technology), Achir Kalra (Forbes Media), Cristian Borcea (New Jersey Institute of Technology) and Yi Chen (New Jersey Institute of Technology).
AbstractThe fast growing ad-blocker usage results in large revenue decrease for ad-supported online websites. Facing this problem, many online publishers choose either to cooperate with ad-blocker software companies to show acceptable ads or to build a wall that requires users to whitelist the site for content access. However, it's lack of studies on the impact of these two counter-ad-blocking strategies on user behaviors. To address this issue, we conduct a randomized field experiment on the website of Forbes Media, a major US media publisher. The ad-blocker users are divided into a treatment group, which receives the wall strategy, and a control group, which receives the acceptable ads strategy. We utilize the difference-in-differences method to estimate the causal effects. Our study shows that the wall strategy has an overall negative impact on user engagements. It has no statistically significant effect on highly-engaged users as they would view the pages no matter what strategy is used. It has a big impact on low-engaged users, who have no loyalty to the site. Our study also shows that revisiting behavior decreases over, but the ratio of session whitelisting increases over time as the remaining users have relatively high loyalty and high engagement. The paper concludes with discussions of managerial insights for publishers when determine counter-ad-blocking strategies. |
Crowdsourcing (3)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Becoming The Super Turker: Increasing Wages Via A Strategy From High Earning Workers Saiph Savage (Universidad Nacional Autonoma de Mexico (UNAM)), Chun Wei Chiang (West Virginia University), Susumu Saito (Waseda University), Carlos Toxtli (West Virginia University) and Jeffrey Bigham (Carnegie Mellon University).
AbstractCrowd markets have traditionally limited workers by not providing transparency information concerning which tasks pay fairly or which requesters are unreliable. Researchers believe that a key reason why crowd workers earn low wages is due to this lack of transparency. As a result, tools have been developed to provide more transparency within crowd markets to help workers. However, while most workers use these tools, they still earn less than minimum wage. We argue that the missing element is guidance on how to use transparency information. In this paper, we explore how novice workers can improve their earnings by following the transparency criteria of Super Turkers, i.e., crowd workers who earn higher salaries on Amazon Mechanical Turk (MTurk). We believe that Super Turkers have developed effective processes for using transparency information. Therefore, by having novices follow a Super Turker criteria (one that is simple and popular among Super Turkers), we can help novices increase their wages. For this purpose, we: (i) conducted a survey and data analysis to computationally identify a simple yet common criteria that Super Turkers use for handling transparency tools; (ii) deployed a two-week field experiment with novices who followed this Super Turker criteria to find better work on MTurk. Novices in our study viewed over 25,000 tasks by 1,394 requesters. We found that novices who utilized this Super Turkers' criteria earned better wages than other novices. Our results highlight that tool development to support crowd workers should be paired with educational opportunities that teach workers how to effectively use the tools and their related metrics (e.g., transparency values). We finish with design recommendations for empowering crowd workers to earn higher salaries. |
14:00-14:30 |
Towards Hybrid Human-AI Workflows for Unknown Unknown Detection Anthony Liu (University of Michigan), Santiago Guerra (Universidad de Monterrey), Isaac Fung (University of Michigan), Gabriel Matute (University of Michigan), Ece Kamar (Microsoft) and Walter Lasecki (University of Michigan, Computer Science & Engineering).
AbstractPredictive models are susceptible to errors called unknown unknowns, in which the model assigns incorrect labels to instances with high confidence. These commonly arise when training data does not represent variations of a class encountered at model deployment. Prior work showed that crowd workers can identify instances of unknown unknowns, but asking the crowd to identify a sufficient number of individual instances can be costly to acquire. Instead, this paper presents an approach that leverages people's ability to find patterns that can be used to retrain classifiers more effectively with fewer examples. Our approach asks crowd workers to suggest and verify patterns in unknown unknowns. We then use these patterns to train a secondary classifier that is used to identify additional examples from existing data that the primary classifier has encountered (and potentially mis-classified) in the past. Our experiments show that using this approach outperforms existing unknown unknown detection methods for improving classifier performance. This work is the first to leverage crowds to identify error patterns in large datasets to improve the training of machine learning classifiers. |
14:30-14:45 |
A Multi-task Learning Framework for Road Attribute Updating via Joint Analysis of Map Data and GPS Traces Yifang Yin (National University of Singapore), Jagannadan Varadarajan (Grab), Guanfeng Wang (GrabTaxi Research and Development Centre), Xueou Wang (National University of Singapore), Dhruva Sahrawat (iiitd), Roger Zimmermann (National University of Singapore) and See-Kiong Ng (National University of Singapore).
AbstractThe quality of a digital map is of utmost importance for geo-aware services. However, maintaining an accurate and up-to-date map is a highly challenging task that usually involves a substantial amount of manual work. To reduce the manual efforts, methods have been proposed to automatically derive road attributes by mining GPS traces. However, previous methods always modeled each road attribute separately based on intuitive hand-crafted features extracted from GPS traces. This observation motivates us to propose a machine learning based method to learn joint features not only from GPS traces but also from map data. To model the relations among the target road attributes, we extract low-level shared feature embeddings via multi-task learning, while still being able to generate task-specific fused representations by applying attention-based feature fusion. To model the relations between the target road attributes and other contextual information that is available from a digital map, we propose to leverage map tiles at road centers as visual features that capture the information of the surrounding geographic objects around the roads. We perform extensive experiments on the OpenStreetMap where state-of-the-art classification accuracy has been obtained compared to existing road attribute detection approaches. |
14:45-15:00 |
Crowdsourcing Detection of Sampling Biases in Image Datasets Xiao Hu (Purdue University), Haobo Wang (Purdue University), Anirudh Vegesana (Purdue University), Somesh Dube (Purdue University), Kaiwen Yu (Purdue University), Gore Kao (Purdue University), Shuo-Han Chen (Purdue University), Yung-Hsiang Lu (Purdue University), George Thiruvathukal (Loyola University) and Ming Yin (Purdue University).
AbstractDespite many exciting innovations in computer vision, recent studies reveal a number of risks in existing computer vision systems, suggesting results of such systems may be unfair and untrustworthy. Many of these risks can be partly attributed to the use of a training image dataset that exhibits sampling biases and thus does not accurately reflect the real visual world. Being able to detect potential sampling biases in the visual dataset prior to model development is thus essential for mitigating the fairness and trustworthy concerns in computer vision. In this paper, we propose a three-step crowdsourcing workflow to get humans into the loop for facilitating bias discovery in image datasets. Through two sets of evaluation studies, we find that the proposed workflow can effectively organize the crowd to detect sampling biases in both datasets that are artificially created with designed biases and real-world image datasets that are widely used in computer vision research and system development. |
Health (4)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
CLARA: Clinical Report Auto-completion Siddharth Biswal (Georgia Institute of Technology), Cao Xiao (IQVIA), Lucas Glass (IQVIA), Brandon Westover (MGH) and Jimeng Sun (Georgia Institute of Technology).
AbstractGenerating clinical reports from raw recording such as X-rays and electroencephalogram (EEG) is an essential and routine task for doctors. However, it is often time-consuming to write accurate and detailed reports. Most existing methods try to generate the whole reports from the raw input with limited success because 1) generated reports often contain errors that need manual review and correction, 2) it does not save time when doctors want to write additional information into the report, and 3) the generated reports are not customized based on individual doctors' preference. We propose {\it CL}inic{\it A}l {\it R}eport {\it A}uto-completion (CLARA), an interactive method that generates reports in a sentence by sentence fashion based on doctors' anchor words and partially completed sentences. CLARA searches for most relevant sentences from existing reports as the template for the current report. The retrieved sentences are sequentially modified by combining with the input feature representations to create the final report. In our experimental evaluation CLARA achieved 0.393 CIDEr and 0.248 BLEU-4 on X-ray reports and 0.482 CIDEr and 0.491 BLEU-4 for EEG reports for sentence-level generation, which is up to 35% improvement over the best baseline. Also via our qualitative evaluation, CLARA is shown to produce reports which have a higher level of approval by doctors. |
14:00-14:30 |
The Automated Copywriter: Algorithmic Rephrasing of Health-Related Advertisements to Improve their Performance Brit Youngmann (Microsoft Research), Elad Yom-Tov (Microsoft Research), Ran Gilad-Bachrach (Microsoft Research) and Danny Karmon (Microsoft Healthcare NExT).
AbstractSearch advertising is one of the most commonly-used methods of advertising. Past work has shown that search advertising can be employed to improve health by eliciting positive behavioral change. However, writing effective advertisements requires expertise and (possible expensive) experimentation, both of which may not be available to public health authorities wishing to elicit such behavioral changes, especially when dealing with a public health crises such as epidemic outbreaks.Here we develop an algorithm which builds on past advertising data to train a sequence-to-sequence Deep Neural Network which “translates” advertisements into optimized ads that are more likely to be clicked. The network is trained using more than 114 thousands ads shown on Microsoft Advertising. We apply this translator to two health related domains: Medical Symptoms (MS) and Preventative Healthcare (PH) and measure the improvements in click-through rates (CTR).Our experiments show that the generated ads are predicted to have higher CTR in 81% of MS ads and 76% of PH ads. To understand the differences between the generated ads and the original ones we develop estimators for the affective attributes of the ads. We show that the generated ads contain more calls-to-action and that they reflect higher valence (36% increase) and higher arousal (87%) on a sample of 1000 ads. Finally, we run an advertising campaign where 10 random ads and their rephrased versions from each of the domains are run in parallel. We show an average improvement in CTR of 68% for the generated ads compared to the original ads.Our results demonstrate the ability to automatically optimize advertisement for the health domain. We believe that our work offers health authorities an improved ability to help nudge people towards healthier behaviors while saving the time and cost needed to optimize advertising campaigns. |
14:30-15:00 |
Adversarial Cooperative Imitation Learning for Dynamic Treatment Regimes Wenchao Yu (University of California, Los Angeles), Lu Wang (Georgia Institute of Technology), Wei Cheng (NEC Labs), Martin Renqiang Ren (NEC Labs), Bo Zong (NEC Labs), Xiaofeng He (East China Normal University), Hongyuan Zha (Georgia Institute of Technology), Wei Wang (University of California, Los Angeles) and Haifeng Chen (NEC Labs).
AbstractRecent developments in discovering dynamic treatment regimes (DTRs) have heightened the importance of deep reinforcement learning (DRL) which are used to recover the doctor's treatment policies. However, existing DRL-based methods expose the following limitations: 1) supervised methods based on behavior cloning suffer from compounding errors; 2) the self-defined reward signals in reinforcement learning models are either too sparse or need clinical guidance; 3) only positive trajectories (e.g. survived patients) are considered in current imitation learning models, with negative trajectories (patient samples with negative outcomes, e.g. deceased patients) been largely ignored, which are examples of what not to do and could help the learned policy avoid repeating mistakes. To address these limitations, in this paper, we propose the adversarial cooperative imitation learning model, ACIL, to deduce the optimal dynamic treatment regimes that mimics the positive trajectories while differs from the negative trajectories. Specifically, two discriminators are used to help achieve this goal: an adversarial discriminator is designed to minimize the discrepancies between the trajectories generated from the policy and the positive trajectories, and a cooperative discriminator is used to distinguish the negative trajectories from the positive and generated trajectories. The reward signals from the discriminators are utilized to refine the policy for dynamic treatment regimes. Experiments on the publicly real-world medical data demonstrate that ACIL improves the likelihood of patient survival and provides better dynamic treatment regimes with the exploitation of information from both positive and negative trajectories. |
15:00-15:15 |
Quantifying Community Characteristics of Maternal Mortality Using Social Media Rediet Abebe (Harvard University), Salvatore Giorgi (University of Pennsylvania), Anna Tedijanto (Cornell University), Anneke Buffone (University of Pennsylvania) and H. Andrew Schwartz (Stony Brook University).
AbstractThe United States has the highest rate of maternal mortality of any developed nation. Mortality rates have more than doubled in the past 25 years and nearly 60,000 women face near-fatal complications every year. The experiences of Black and Latina mothers are notably worse: mortality rates for these groups can be 3 to 4 times higher than the mortality rates for white women. Despite extensive public health research, there remains a lot to be understood about contributing factors to pregnancy-related deaths and what characterizes communities with relatively high or low maternal mortality rates; indeed, standard socio-demographic and risk-factor variables do not adequately capture maternal experiences and disparities by race.Here, we explore the role that social media language can play in providing insights into community characteristics of maternal mortality. First, by analyzing pregnancy-related tweets generated in US counties, we reveal a diverse set of topics discussed on the platform including Morning Sickness, Celebrity Pregnancies, and Abortion Rights. We find that these topics predict maternal mortality rates with higher accuracy than standard socioeconomic and risk-related variables such as income, employment rates, access to healthcare, and race. We then select six topics -- Maternal Studies, Teen Pregnancy, and Congratulatory Remarks, in addition to the above three -- chosen for their interpretability and connections to known health and maternal risk factors. We show that these six topics have nearly as much predictive power as all the topics combined. We also investigate psychological aspects of communities to find that the use of less trustful, more stressed, and more negative language is significantly associated with higher mortality rates; even more notably, Trust and Affect explained a significant portion of the racial disparities in maternal mortality. We believe these findings provide further insights related to the intricate and urgent issues surrounding maternal health and can help inform actionable items at the community-level. |
Economics (3)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
Designing Fairly Fair Classifiers Via Economic Fairness Notions Safwan Hossain (University of Toronto, Vector Institute), Andjela Mladenovic (Independent Scientist) and Nisarg Shah (University of Toronto).
AbstractThe past decade has witnessed a rapid growth of research on fairness in machine learning. In contrast, fairness has been formally studied for almost a century in microeconomics in the context of resource allocation, during which many general-purpose notions of fairness have been proposed. This paper explore the applicability of two such notions --- envy-freeness and equitability --- in machine learning. We propose novel relaxations of these fairness notions which apply to groups rather than individuals, and are compelling in a broad range of settings. Our approach provides a unifying framework by incorporating several recently proposed fairness definitions as special cases. We provide generalization bounds for our approach, and theoretically and experimentally evaluate the tradeoff between loss minimization and our fairness guarantees. |
14:00-14:30 |
Traveling the token world: A graph analysis of Ethereum ERC20 token ecosystem Weili Chen (Sun Yat-sen University), Tuo Zhang (Sun Yat-sen University), Zhiguang Chen (Sun Yat-sen University), Zibin Zheng (Sun Yat-sen University) and Yutong Lu (Sun Yat-sen University).
AbstractThe birth of Bitcoin ushered in the era of cryptocurrency, which has now become a financial market attracted extensive attention worldwide. The phenomenon of startups launching Initial Coin Offerings (ICOs) to raise capital led to thousands of tokens being distributed on blockchains. Many studies have analyzed this phenomenon from an economic perspective. However, little is know about the characteristics of participants in the ecosystem. To fill this gap and considering over 80% of ICOs launched based on ERC20 token on Ethereum, in this paper, we conduct a systematic investigation on the whole Ethereum ERC20 token ecosystem to characterize the token creator, holder, and transfer activity. By downloading the whole blockchain and parsing the transaction records and event logs, we construct three graphs, namely token creator graph, token holder graph, and token transfer graph. We obtain many observations and findings by analyzing these graphs. Besides, we propose an algorithm to discover potential relationships between tokens and other accounts. The reported case shows that our algorithm can effectively reveal entities and the complex relationship between various accounts in the token ecosystem. |
14:30-14:45 |
A Feedback Shift Correction in Predicting Conversion Rates under Delayed Feedback Shota Yasui (CyberAgent.inc.), Gota Morishita (CyberAgent.inc.), Fujita Komei (CyberAgent.inc.) and Masashi Shibata (CyberAgent.inc.).
AbstractIn display advertising, predicting the conversion rate, that is, the probability that a user takes a predefined action on an advertisers' website such as purchasing goods, is fundamental in estimating the value of showing a user an advertisement. However, there is a relatively long time delay between a click of a display advertisement and its resultant conversion. Because of this delayed feedback, some positive instances are labeled as negative when training data is gathered because some conversions that will occur in the future have not yet occurred. As a result, the conditional label distribution of the training data is different from that of the test data in the production environment because these are tracked for a sufficiently long period to be correctly labeled. This situation is referred to as a feedback shift. We address this problem by using an importance weight approach typically used for covariate shift correction. We prove its consistency for the feedback shift. Moreover, the importance weight approach can be applied to a wide variety of models and learning algorithms. Finally, two different experiments were conducted. The first experiment was conducted to prove the effectiveness of our proposed method from two different perspectives; performance and time efficiency. The results show that our proposed approach outperforms the existing method in terms of both. During the second experiment, we implemented a Field-awareFactorization Machine(FFM) with importance weight(FFMIW) to incorporate our proposed method into our production environment. The normal FFM and FFMIW were evaluated on an offline dataset. In addition, we conducted an online A/B test in the production system. In both settings, it was shown that FFMIW is superior to an FFM. |
14:45-15:00 |
Predicting Drug Demand with Wikipedia Views: Evidence from Darknet Markets. Sam Miller (University of Warwick & Alan Turing Institute), Abeer El-Bahrawy (City University London), Martin Dittus (Oxford Internet Institute, University of Oxford & Alan Turing Institute), Mark Graham (University of Oxford) and Joss Wright (University of Oxford).
AbstractRapid changes in illicit drug demand, such as the Fentanyl epidemic, are a major public health issue. Policymakers currently rely on annual surveys to monitor public consumption, which are arguably too infrequent to detect rapid shifts in drug use. We present a novel method to predict drug use based on high-frequency sales data from darknet markets. We show that models based on historic trades alone cannot accurately predict drug demand. However, augmenting these models with data on Wikipedia page views for each drug greatly improves predictive accuracy. These results hold out-of-sample at high time frequency, across a range of drugs and countries. We find that Wikipedia page views most improve predictive accuracy for less popular drugs, suggesting our model may be particularly useful for detecting newly emerging substances. Therefore Wikipedia data may enable us to build a high frequency measure of drug demand, which could help policymakers respond more quickly to future drug crises. |
Systems (3)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-13:45 |
ResQueue: A Smarter Datacenter Flow Scheduler Hamed Rezaei (University of Illinois at Chicago) and Balajee Vamanan (University of Illinois at Chicago).
AbstractDatacenters host a mix of applications: foreground applications that perform distributed lookups in order to service user queries; and, background applications that perform other background tasks such as data reorganization, data backup, and data replication. While background flows produce the most load, foreground applications produce the most number of flows. Because flows (packets) from both types of applications compete at switches for network bandwidth, datacenter networks' performance is highly dependent on underlying flow scheduling mechanisms. Existing flow schedulers use flow size to distinguish critical flows from non-critical flows. However, recent studies on important datacenter workloads reveal that most flows are quite small (e.g., most flows consist of only a handful packets). In light of recent findings, we make the key observation that because flow size is not sufficient to distinguish critical flows from non-critical flows, existing flow schedulers do not achieve the desired prioritization. In this paper, we introduce ResQueue, which uses a combination of flow size and packet history to calculate the priority of each flow. Our analysis shows that ResQueue improves tail flow completion times of short flows by up to 60% over the state-of-the-art flow scheduling mechanisms. |
13:45-14:15 |
JSCleaner: De-Cluttering Mobile Webpages Through JavaScript Cleanup Moumena Chaqfeh (NYUAD), Yasir Zaki (NYUAD), Jacinta Hu (NYUAD) and Lakshmi Subramanian (NYU).
AbstractA significant fraction of webpages suffer from the excessive usage of JavaScript. Based on analyzing popular webpages, we observe that a reasonable fraction of JavaScript utilized by these pages are not truly essential for many of the functional and visual features of the page. In this paper, we propose JSCleaner, a JavaScript de-cluttering engine that aims at simplifying webpages without compromising the page content or functionality. JSCleaner uses a classification algorithm that classifies JavaScript into three main categories: non-critical, replaceable, and critical JavaScript. JSCleaner removes the non-critical scripts from a webpage, translates the replaceable scripts with their HTML outcomes, and preserves the critical scripts. Our quantitative evaluation of 500 popular webpages shows that JSCleaner achieves around 30% reduction in page load times coupled with a 50% reduction in requested objects and page size. In addition, our qualitative user study of 103 evaluators shows that JSCleaner preserves 95% of the page content similarity, while maintaining about 88% of the page functionality. |
14:15-14:45 |
PG2S+: Stack Distance Construction Using Popularity, Gap and Machine Learning Jiangwei Zhang (National University of Singapore) and Y.C. Tay (National University of Singapore).
AbstractStack distance characterizes temporal locality of workloads and plays a vital role in cache analysis since the 1970s. However, the most efficient implementations of exact stack distance calculation are too costly, and impractical for online use. Hence, much work were done to optimize the exact computation, or approximate it through sampling or modeling.This paper introduces a new approximation technique PG2S that is based on reference popularity and gap distance. This approximation is exact under the Independent Reference Model (IRM). The technique is further extended, using machine learning, to PG2S+ for non-IRM reference patterns. Extensive experiments show that PG2S+ is much more accurate and robust than other state-of-the-art algorithms for determining stack distance. PG2S+ is the first technique to exploit the strong correlation among reference popularity, gap distance and stack distance. |
14:45-15:00 |
Scaling PageRank to 100 Billion Pages Stergios Stergiou (Google).
AbstractDistributed graph frameworks formulate tasks as sequences of supersteps within which communication is performed asynchronously by transmitting messages over the graph edges. PageRank's communication pattern is identical across supersteps since each vertex sends messages to all its edges. We exploit this pattern to develop a new communication paradigm that allows us to exchange messages that include only edge payloads, dramatically reducing bandwidth requirements. Experiments on a web graph of 38 billion vertices and 3.1 trillion edges yield execution times of 34.4 seconds per iteration, suggesting more than an order of magnitude improvement over the state-of-the-art. |
Semantics (6)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Time |
Title and Authors (Presenter) |
13:30-14:00 |
SMART-KG: Hybrid Shipping for SPARQL Querying on the Web Amr Azzam (Vienna University of Business and Economics), Javier D. Fernández (Vienna University of Economics and Business), Maribel Acosta (Karlsruhe Institute of Technology), Martin Beno (Vienna University of Economics and Business) and Axel Polleres (Vienna University of Economics and Business - WU Wien).
AbstractWhile Linked Data (LD) provides standards for publishing (RDF)and (SPARQL) querying Knowledge Graphs (KGs) on the Web, serving, accessing and processing such open, decentralized KGs is often practically impossible, as query timeouts on publicly available SPARQL endpoints show. Alternative solutions such as Triple Pat-tern Fragments (TPF) attempt to tackle the problem of availability by pushing query processing workload to the client side, but suffer from unnecessary transfer of irrelevant data on complex queries with large intermediate results. In this paper we present smart-KG, a novel approach to share the load between servers and clients, while significantly reducing data transfer volume, by combining TPF with shipping compressed KG partitions. Our evaluations show that smart-KG outperforms state-of-the-art client-side solutions and increases server-side availability towards more cost-effective and balanced hosting of open and decentralized KGs. |
14:00-14:30 |
Adaptive Low-level Storage of Very Large Knowledge Graphs Jacopo Urbani (Vrije Universiteit Amsterdam) and Ceriel Jacobs (Vrije Universiteit Amsterdam).
AbstractThe increasing availability and usage of Knowledge Graphs (KGs) on the Web calls for scalable and general-purpose solutions to store this type of data structures. We propose KGSYS, a novel storage architecture for very large KGs on centralized systems. KGSYS uses several interlinked data structures to provide fast access to nodes and edges, with the physical storage changing depending on the topology of the graph to reduce the memory footprint. In contrast to single architectures designed for single tasks, our approach offers an interface with few low-level and general-purpose primitives that can be used to implement tasks like SPARQL query answering, reasoning, or graph analytics. Our experiments show that KGSYS can handle graphs with 10^11 edges using inexpensive hardware, delivering competitive performance on multiple |
14:30-15:00 |
Differentially Private Stream Processing for the Semantic Web Daniele Dell'Aglio (University of Zurich) and Abraham Bernstein (University of Zurich).
AbstractData often contains sensitive information, which poses a major obstacle to publishing it. Some suggest to obfuscate the data or only releasing some data statistics. These approaches have, however, been shown to provide insufficient safeguards against de-anonymisation. Recently, differential privacy (DP) - an approach that injects noise into the query answers to provide statistical privacy guarantees - has emerged as a solution to release sensitive data. This study investigates how to continuously release privacy-preserving histograms (or distributions) from a continuous stream of sensitive data by combining DP and semantic web technologies. We focus on distributions, as they are the basis for many analytic applications. Specifically, we propose SihlQL, a query language that processes RDF streams in a privacy-preserving fashion. SihlQL builds on top of SPARQL and the w-event DP framework. We show how some peculiarities of w-event privacy constrains the expressiveness of SihlQL queries. Addressing these constraints, we propose an extension of w-event privacy that provides answers to more general queries while preserving their privacy. To evaluate SihlQL, we implemented a prototype engine that compiles queries to Apache Flink topologies and studied its privacy properties using real-world data from an IPTV provider and an online e-commerce web site. |
15:00-15:30 |
Guiding Corpus-based Set Expansion by Auxiliary Sets Generation and Co-Expansion Jiaxin Huang (University of Illinois Urbana-Champaign), Yiqing Xie (The Hong Kong University of Science and Technology), Yu Meng (University of Illinois Urbana-Champaign), Jiaming Shen (University of Illinois Urbana-Champaign), Yunyi Zhang (University of Illinois Urbana-Champaign) and Jiawei Han (University of Illinois Urbana-Champaign).
AbstractGiven a small set of seed entities (e.g., “USA”, “Russia”), corpus-based set expansion is to induce an extensive set of entities which share the same semantic class (Country in this example) from a given corpus. Set expansion benefits a wide range of downstream applications in knowledge discovery, such as web search, taxonomy construction, and query suggestion. Existing corpus-based set expansion algorithms typically bootstrap the given seeds by incorporating lexical patterns and distributional similarity. However, due to no negative sets provided explicitly, these methods suffer from semantic drift caused by expanding the seed set freely without guidance. We propose a new framework, Set-CoExpan, that automatically generates auxiliary sets as negative sets that are closely related to the target set of user's interest, and then performs multiple sets co-expansion that extracts discriminative features by comparing target set and auxiliary sets, to form multiple cohesive sets that are distinctive from one another, thus resolving the semantic drift issue. In this paper we demonstrate that by generating auxiliary sets, we can guide the expansion process of target set to avoid touching those ambiguous areas around the border with auxiliary sets, and we show that Set-CoExpan outperforms strong baseline methods significantly. |
Research Tracks
Research Tracks
Research Tracks (1)
Web Mining-A (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Jiaqi Ma (University of Michigan), Zhe Zhao (Google), Xinyang Yi (Google), Ji Yang (Google), Minmin Chen (Google), Jiaxi Tang (Simon Fraser University), Lichan Hong (Google) and Ed H. Chi (Google).
Abstract
Recommender systems in industrial production often need to serve billions of users with a million-level candidate item space to recommend from. And moreover, the systems are required to respond users' request in real time within milliseconds. The large scale and the strict latency have led to numerous technical challenges.One major challenge is how to serve the users efficiently with highly personalized content. To achieve this goal, a two-stage approach is widely used, where an efficient candidate generation model generates a candidate set of hundreds of items from the whole item space at the first stage, and then, at the second stage, a more powerful ranking model re-ranks the candidate items and recommends the top few items to the user.Another major challenge is how to get enough labeled data to train such large-scale recommender systems. Fortunately, abundant logged user feedback (e.g., user clicks or dwell time) generated by historical recommender systems is available and commonly used as training data. However, such data are inherently biased because the feedback can only be observed on items recommended by the historical systems. Researchers therefore have applied off-policy correction on the learning of recommender systems to address such biases. However, most existing work only studied off-policy correction on single-stage systems.In this work, we demonstrated that naively applying the existing off-policy correction methods to the two-stage recommender systems is sub-optimal and proposed an efficient two-stage off-policy policy gradient method to correct the bias in two-stage systems. We also conducted experiments on real-world datasets with large action space and showed the effectiveness of the proposed method.Quanming Yao (4Paradigm), Xiangning Chen (University of California, Los Angeles), James Kwok (The Hong Kong University of Science and Technology), Yong Li (Tsinghua University) and Cho-Jui Hsieh (University of California, Los Angeles).
Abstract
Interaction function (IFC), which captures interactions among items and users, is of great importance in collaborative filtering (CF). The inner product is the most popular IFC due to its success in low-rank matrix factorization. However, interactions in real-world applications can be highly complex. Many other operations (such as plus and concatenation) have also been proposed, and can possibly offer better performance than the inner product. In this paper, motivated by the success of automated machine learning, we propose to search for proper interaction functions (SIF) for CF tasks. We first design an expressive search space for SIF by reviewing and generalizing existing CF approaches. We then propose to represent the search space as a structured multi-layer perceptron and design a stochastic gradient descent algorithm that can simultaneously update both architectures and learning parameters. Experimental results demonstrate that the proposed method can be much more efficient than popular AutoML approaches, and also obtain much better prediction performance than state-of-the-art CF approaches.Jyun-Yu Jiang (University of California, Los Angeles), Patrick H. Chen (University of California, Los Angeles), Cho-Jui Hsieh (University of California, Los Angeles) and Wei Wang (University of California, Los Angeles).
Abstract
Top-K recommender systems aim to generate few but satisfactory personalized recommendations for various practical applications, such as item recommendation for e-commerce and link prediction for social networks. However, the numbers of users and items can be enormous, thereby leading to myriad potential recommendations as well as the bottleneck in evaluating and ranking all possibilities. Existing Maximum Inner Product Search (MIPS) based methods treat the item ranking problem for each user independently and the relationship between users has not been explored. In this paper, we propose a novel model for clustering and navigating for top-K recommenders (CANTOR) to expedite the computation of top-K recommendations based on latent factor models. A clustering-based framework is first presented to leverage user relationships to partition users into affinity groups, each of which contains users with similar preferences. CANTOR then derives a coreset of representative vectors for each affinity group by constructing a set cover with a theoretically guaranteed difference to user latent vectors. Using these representative vectors in the coreset, approximate nearest neighbor search is then applied to obtain a small set of candidate items for each affinity group to be used when computing recommendations for each user in the affinity group. This approach can significantly reduce the computation without compromising the quality of the recommendations. Extensive experiments are conducted on six publicly available large-scale real-world datasets for item recommendation and personalized link prediction. The experimental results demonstrate that CANTOR significantly speeds up matrix factorization models with high precision. For instance, CANTOR can achieve 355.1x speedup for inferring recommendations in a million-user network with 99.5% precision@1 to the original system while the state-of-the-art method can only obtain 93.7x speedup with 99.0% precision@1.Yin Zhang (Texas A&M University), Yun He (Texas A&M University), Jianling Wang (Texas A&M University) and James Caverlee (Texas A&M University).
Abstract
Existing sequential recommenders mainly focus on modeling sequential patterns by using user activity sequences. However, purely sequence-based recommendation usually faces challenges in capturing general item relations that are not easily discovered from highly-personalized user sequences. Hence, we propose a novel adaptive hierarchical translation-based recommendation called HierTrans. Specifically, HierTrans first extends traditional item-level relations to the category-level, to help capture dynamic sequence patterns that can generalize across users and time. Then unlike the item-level relation based methods, we build a novel hierarchical temporal graph that contains item multi-relations at the category-level and user dynamic sequences at the item-level to facilitate capturing item multi-relations inside user dynamic sequences. Based on the graph, HierTrans adaptively aggregates the high-order multi-relations among items and dynamic user preferences to capture the dynamic joint influence for next-item recommendation. Specifically, different from traditional translation-based recommenders that assumes a user's translation vector is static and identical, the user translation vector in HierTrans can adaptively change based on both a user's previous interacted items and the item relations inside the user's sequences, as well as the user's personal dynamic preference. Experiments on public datasets demonstrate the proposed model consistently outperforms state-of-the-art sequential recommendation methods and uncovers meaningful patterns in user sequences.Suyu Ge (Tsinghua University), Chuhan Wu (Tsinghua University), Fangzhao Wu (Microsoft), Tao Qi (Tsinghua University) and Yongfeng Huang (Department of Electronic Engineering; Tsinghua University).
Abstract
With the explosion of online news, personalized news recommendation becomes increasingly important for online news platforms to help their users find interested information. Existing news recommendation methods achieve personalization by building accurate news representations from news content and user representations from their direct interactions with news (e.g., click), while ignoring the high-order relatedness between users and news. Here we propose a news recommendation method which can enhance the representation learning of users and news by modeling their relatedness in a graph setting. In our method, users and news are both viewed as nodes in a bipartite graph constructed from historical user click behaviors. For news representations, a transformer architecture is first exploited to build news semantic representations. Then we combine it with the information from neighbor news in the graph via a graph attention network. For user representations, we not only represent users from their historically clicked news, but also attentively incorporate the representations of their neighbor users in the graph. Experiments were conducted on a large-scale real-world dataset. The improved performances validate the effectiveness of our proposed method.Social Network-A (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Dongmin Park (Korea Advanced Institute of Science and Technology), Hwanjun Song (Korea Advanced Institute of Science and Technology), Minseok Kim (Korea Advanced Institute of Science and Technology) and Jae-Gil Lee (Korea Advanced Institute of Science and Technology).
Abstract
Finding low-dimensional embeddings of sparse high-dimensional data objects is important in many applications such as recommendation, graph mining, and natural language processing (NLP). Recently, autoencoder (AE)-based embedding approaches have achieved state-of-the-art performance in many tasks, especially in top-k recommendation tasks with user embedding or node classification tasks with node embedding. However, we find that many real-world data follow the power-law distribution with respect to the data object sparsity. When learning AE-based embeddings of these data, dense inputs move away from sparse inputs in an embedding space even when they are highly correlated. Resultingly, the embedding is distorted, which we call the polarization problem. In this paper, we propose TRAP that leverages two-level regularizers to effectively alleviate this problem. (i) The macroscopic regularizer adds a regularization term in the loss function to generally prevent dense input objects from being distant from other sparse input objects. (ii) The microscopic regularizer introduces a new object-wise parameter to individually entice each object to correlated neighbor objects rather than uncorrelated ones. Importantly, TRAP can be easily coupled with existing AE-based embedding methods with a simple modification. In extensive experiments on two representative embedding tasks using six-real world datasets, TRAP boosted the performance of the state-of-the-art algorithms by up to 31.53% and 94.99% respectively.Zhen Peng (Xi'an Jiaotong University), Wenbing Huang (Tsinghua University), Minnan Luo (Xi'an Jiaotong University), Qinghua Zheng (Xi'an Jiaotong University), Yu Rong (Tencent AI Lab), Tingyang Xu (Tencent AI Lab) and Junzhou Huang (University of Texas at Arlington).
Abstract
The richness in the content of various information networks such as social networks and communication networks provides the unprecedented potential for learning high-quality expressive representations without external supervision. This paper investigates how to preserve and extract the abundant information from graph-structured data into embedding space in an unsupervised manner. To this end, we propose a novel concept, Graphical Mutual Information (GMI), to measure the correlation between input graphs and high-level hidden representations. GMI generalizes the idea of conventional mutual information computations from vector space to the graph domain where measuring mutual information from two aspects of node features and topological structure is indispensable. GMI exhibits several benefits: First, it is invariant to the isomorphic transformation of input graphs--an inevitable constraint in many existing graph representation learning algorithms; Besides, it can be efficiently estimated and maximized by current mutual information estimation methods such as MINE; Finally, our theoretical analysis confirms its correctness and rationality. With the aid of GMI, we develop an unsupervised learning model trained by maximizing GMI between the input and output of a graph neural encoder. Considerable experiments on transductive as well as inductive node classification and link prediction demonstrate that our method outperforms state-of-the-art unsupervised counterparts, and even sometimes exceeds the performance of supervised ones.Zhuoren Jiang (Sun Yat-sen University), Zheng Gao (Indiana University Bloomington), Jinjiong Lan (alibaba), Hongxia Yang (Alibaba Group), Yao Lu (Sun Yat-sen University) and Xiaozhong Liu (Indiana University Bloomington).
Abstract
The recent success of deep graph embedding innovates the graphical information characterization methodologies. However, in real-world applications, such a method still struggles with the challenges of heterogeneity, scalability, and multiplex. To address these challenges, in this study, we propose a novel solution, Genetic hEterogeneous gRaph eMbedding (GERM), which enables flexible and efficient task-driven vertex embedding in a complex heterogeneous graph. Unlike prior efforts for this track of studies, we employ a task-oriented genetic activation strategy to efficiently generate the “Edge Type Activated Vector” (ETAV) over the edge types in the graph. The generated ETAV can not only reduce the incompatible noise and navigate the heterogeneous graph random walk at the graph-schema level, but also activate an optimized subgraph for efficient representation learning. By revealing the correlation between the graph structure and task information, the model interpretability can be enhanced as well. Meanwhile, an activated heterogeneous skip-gram framework is proposed to encapsulate both topological and task-specific information of a given heterogeneous graph. Through extensive experiments on both scholarly and e-commerce datasets, we demonstrate the efficacy and scalability of the proposed methods via various search/recommendation tasks. GERM not only outperforms the state-of-the-art models, but also significantly reduces the running time.Amin Javari (University of Illinois at Urbana-Champaign), Tyler Derr (Michigan State University), Pouya Esmalian (Sharif University), Jiliang Tang (Michigan State University) and Kevin Chang (University of Illinois at Urbana-Champaign).
Abstract
In real-world networks, nodes might have more than one type of relation. Signed networks are an important class of such networks consisting of two types of relations: positive and negative. Recently, embedding signed networks has attracted increasing attention. In general, existing models rely on a path-based closeness measure defined based on social theories. However, this strategy is associated with major drawbacks including the incompleteness of such theories in explaining real-world signed networks. We propose a new approach for embedding signed networks that addresses these shortcomings by relying on a network transformation based strategy. The main idea is that rather finding the similarities of two nodes based on the complex relationships/paths between them, we can find their similarities through simple paths/relationships between different roles carried by them. Based on this idea, the model can be described in three steps: (1) the input directed signed network is transformed into an undirected, unsigned bipartite network where each node is mapped to a set of nodes denoted as role-nodes. Each role-node captures a certain role that a node in the original network plays. (2) The network of role-nodes is embedded. (3) Original network is encoded by aggregating the embedding vectors of role-nodes. According to our experiments, the proposed technique substantially outperforms the existing models on link prediction and label prediction tasks.Zhen Zhang (Zhejiang University), Jiajun Bu (Zhejiang University), Martin Ester (Simon Fraser University), Jianfeng Zhang (Alibaba Group), Chengwei Yao (Zhejiang University), Zhao Li (Alibaba Group) and Can Wang (Zhejiang University).
Abstract
With the increasing demand of mining rich knowledge in graph structured data, graph embedding has become the research focus in both academic and industrial communities due to its powerful capability. The majority of existing work overwhelmingly learn node embeddings in the context of static, plain or attributed, homogeneous graphs. However, many real-world applications frequently involve bipartite graphs with temporal and attributed interaction edges, called temporal interaction graphs. The temporal interactions usually imply different facets of interest and might even evolve over the time, thus putting forward huge challenges in learning effective node representations. Furthermore, most existing graph embedding models embed all the information of each node into a single vector representation, which is insufficient to characterize the node's multifaceted properties. In this paper, we propose a novel framework named TigeCMN to learn node representations from a sequence of temporal interactions. Specifically, we devise two coupled memory networks to store and update node embeddings in the external matrices explicitly and dynamically, which forms deep matrix representations and could enhance the expressiveness of the node embeddings. Then, we generate node embedding from two parts: a static embedding that encodes its stationary properties and a dynamic embedding induced from memory matrix that models its temporal interaction patterns. We conduct extensive experiments on various real-world datasets covering the tasks of node classification, recommendation and visualization. The experimental results empirically demonstrate that TigeCMN can outperform the state-of-the-arts with different gains.User Modeling-A (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Qiaoyu Tan (Texas A&M University), Ninghao Liu (Texas A&M University), Xing Zhao (Texas A&M University), Hongxia Yang (Alibaba), Jingren Zhou (Alibaba) and Xia Hu (Texas A&M University).
Abstract
Graph representation learning has been extensively studied for recommender systems in recent years. Despite its effectiveness in generating continuous embeddings for objects in user-item interaction networks, the computational cost to infer users’ preferences toward large corpus of items is tremendous. To overcome the computational barriers, hashing is often adopted to facilitate efficient approximations of k-nearest-neighbors search. However, such approaches may not yield optimal hash codes to support high quality retrieval due to the separate learning of embedding and hashing. The joint learning of effective hashing and embedding still remains an open challenge.In this paper, we focus on the problem of hashing with graph neural networks (GNNs) for high-quality retrieval. We propose a simple yet effective discrete representation learning framework for jointly learning continuous and discrete codes. Specifically, a deep hashing with GNNs (HashGNN) is presented, which consists of two components, a GNN encoder for learning node representations, and a hash layer for encoding representations to hash codes. The whole architecture is trained end-to-end by jointly optimizing two losses, i.e., reconstruction loss from reconstructing observed links, and ranking loss from preserving the relative ordering of hash codes. A novel discrete optimization strategy based on straight through estimator (STE) with guidance is proposed. The key idea is to avoid gradient magnification in the back-propagation of STE with continuous embedding guidance, in which we begin from learning an easier network that mimics the continuous embedding and let it evolve during the training, until it finally goes back to STE. Comprehensive experiments over three publicly available and one real-world A+ 1 company datasets demonstrate that our model not only can achieve comparable performance compared with its continuous counterpart but also runs multiple times faster during inference.Farhan Khawar (The Hong Kong University of Science and Technology), Leonard Poon (The Education University of Hong Kong) and Nevin L. Zhang (The Hong Kong University of Science and Technology).
Abstract
Autoencoder based recommenders have recently shown state-of-the-art performance in the recommendation task due to their ability to model non-linear item relationships effectively. However, existing autoencoder based recommenders use fully-connected neural network layers and do not employ structure learning. This can lead to inefficient training, especially when the data is sparse as commonly found in collaborative filtering. The aforementioned results in lower generalization ability and reduced performance. In this paper, we introduce structure learning for autoencoder recommenders by taking advantage of the inherent item groups present in the collaborative filtering domain. Due to the nature of items in general, we know that certain items are more related to each other than to other items. Based on this, we propose a method that first learns groups of related items and then uses this information to determine the connectivity structure of an auto-encoding neural network. This results in a network that is sparsely connected. This sparse structure can be viewed as a prior that guides the network training. Empirically we demonstrate that the proposed structure learning enables the autoencoder to converge to a local optimum with a much smaller spectral norm and generalization error bound than the fully-connected network. The resultant sparse network considerably outperforms the state-of-the-art methods like Mult-vae/Mult-dae on multiple benchmarked datasets. In particular, our method achieves more than 13\% improvement over Mult-vae across all metrics on the MSD dataset when the same number of parameters and flops are used. It also has a better cold-start performance.Xueqi Li (Hunan University), Wenjun Jiang (Hunan University), Weiguang Chen (Hunan University), Jie Wu (Temple University), Guojun Wang (Guangzhou University) and Kenli Li (Hunan University).
Abstract
Serendipity recommendation has attracted more and more attention in recent years. It commits to providing recommendations which could not only cater users' preferences but also broaden their horizons. However, existing approaches usually measure user-item relevance with a scalar instead of a vector, ignoring user preference directionality, which increases the risk of unrelated recommendations. To address the limitation, we propose a user-preference-aware and explainable serendipity recommendation method. Specifically, we (1) extract users' long-term preferences (we call it preference directions) with an unsupervised model, GMM (Gaussian mixture model), and capture their short-term demands (we call it current demands) with capsule network; (2) generate recommendations by combining preference directions with current demands; and (3) make the first attempt to provide explanations for serendipitous recommendations via a back-routing scheme. Extensive experiments on real-world datasets show that our approach could effectively improve the serendipity and explainability, and provides a promotion on diversity, comparing with existing serendipity-based methods.Md Mehrab Tanjim (University of California San Diego), Congzhe Su (Etsy), Ethan Benjamin (Etsy), Diane Hu (Etsy), Liangjie Hong (Etsy) and Julian McAuley (University of California San Diego).
Abstract
Users exhibit different intents across e-commerce services (e.g.~discovering new items, purchasing gifts, etc.) which drives them to interact with a wide variety of items in multiple ways (e.g.~click, add-to-cart, add-to-favorite, purchase). To give better recommendations, it is important to capture user intent, in addition to considering their historic interactions. However, these intents are by definition latent, as we observe only a user's interactions and not their underlying intent. To discover such latent intents, and use them effectively for recommendation, in this paper we propose an Attentive Sequential model of latent intent. Our model first learns item similarities from users' interaction histories via a self-attention layer, then uses a Temporal Convolutional Network layer to obtain a latent representation of the user's intent from her action on a particular category. We use this representation to guide an attentive model to predict the next item. Results from our experiments show that our model can capture the dynamics of user behavior and preferences, leading to state-of-the-art performance across datasets from two major e-commerce platforms, namely Etsy and Alibaba.Jian Liu (Soochow University), Pengpeng Zhao (Soochow University), Fuzhen Zhuang (Chinese Academy of Sciences), Yanchi Liu (Rutgers University), Victor S. Sheng (Texas Tech University), Jiajie Xu (soochow university), Xiaofang Zhou (The University of Queensland) and Hui Xiong (Rutgers University New Jersey).
Abstract
Visual aesthetics of products plays an important role in the decision process when purchasing appearance-first products, e.g., clothes. Indeed, user's aesthetic preference, which serves as a personality trait and a basic requirement, is domain independent and could be used as a bridge between domains for knowledge transfer. However, existing work has rarely considered the aesthetic information in product images for cross-domain recommendation. To this end, in this paper, we propose the new deep Aesthetic Cross-Domain Networks (ACDN), in which parameters characterizing personal aesthetic preferences are shared across networks to transfer knowledge between domains. Specifically, we first leverage an aesthetic network to extract aesthetic features. Then, we integrate these features into a cross-domain network to transfer users' domain independent aesthetic preferences. Moreover, network cross-connections are introduced to enable dual knowledge transfer across domains. Finally, the experimental results on real-world datasets show that our proposed model ACDN outperforms benchmark methods in terms of recommendation accuracy. The results also show that users' aesthetic preferences are effective in alleviating the data sparsity issue on cross-domain recommendation.Society (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Márcio Silva (Universidade Federal de Mato Grosso do Sul), Lucas Santos de Oliveira (Universidade Estadual do Sudoeste da Bahia), Athanasios Andreou (INSTITUTE EURECOM), Pedro Olmo Vaz de Melo (UFMG), Oana Goga (UPMC) and Fabricio Benevenuto (Federal University of Minas Gerais (UFMG)).
Abstract
The 2016 United States presidential election was marked by the abuse of targeted advertising on Facebook. Concerned with the risk of the same kind of abuse to happen in the 2018 Brazilian elections, we designed and deployed an independent auditing system to monitor political ads on Facebook in Brazil. To do that we first adapted a browser plugin to gather ads from the timeline of volunteers using Facebook. We managed to convince more than 2000 volunteers to help our project and install our tool. Then, we use a Convolution Neural Network (CNN) to detect political Facebook ads using word embeddings. To evaluate our approach, we manually label a data collection of 20,000 ads as political or non-political and then we provide an in-depth evaluation of proposed approach for identifying political ads by comparing it with classic supervised machine learning methods. Finally, we deployed a real system that shows the ads identified as related to politics. We then compare our detected politic ads with an archive of all political ads provided by Facebook. We noticed that not all political ads were tagged as such over the platform. Our results imply that the decision of what is a political ad and what is not should not be only made by one platform, and emphasized the need for independent auditing platforms.Daniele Rama (ISI Foundation), Kyriaki Kalimeri (ISI Foundation), Yelena Mejova (ISI Foundation), Michele Tizzoni (ISI Foundation) and Ingmar Weber (Qatar Computing Research Institute).
Abstract
In the global move toward urbanization, making sure the people remaining in rural areas are not left behind in terms of development and policy considerations is a priority for governments worldwide. However, it is increasingly challenging to track important statistics concerning this sparse, geographically dispersed population, resulting in a lack of reliable, up-to-date data. In this study, we examine the usefulness of the Facebook Advertising platform, which offers a digital "census" of over two billions of its users, in measuring potential rural-urban inequalities. We focus on Italy, a country where about 30% of the population lives in rural areas. First, we show that the population statistics that Facebook produces suffer from instability across time and incomplete coverage of sparsely populated municipalities. To overcome such limitation, we propose an alternative methodology for estimating Facebook Ads audiences that nearly triples the coverage of the rural municipalities from 19% to 55% and makes feasible fine-grained sub-population analysis. Using official national census data, we evaluate our approach and confirm known significant urban-rural divides in terms of educational attainment and income. Extending the analysis to Facebook-specific user "interests" and behaviors, we provide further insights on the divide, for instance, finding that rural areas show a higher interest in gambling. Notably, we find that the most predictive features of income in rural areas differ from those for urban centres, suggesting researchers need to consider a broader range of attributes when examining rural wellbeing. The findings of this study illustrate the necessity of improving existing tools and methodologies to include under-represented populations in digital demographic studies -- the failure to do so could result in misleading observations, conclusions, and most importantly, policies.Giovanni Quattrone (Middlesex University), Antonino Nocera (University of Pavia), Licia Capra (University College London) and Daniele Quercia (King's College London).
Abstract
Airbnb is one of the most successful examples of sharing economy marketplaces. With rapid and global market penetration, understanding its attractiveness and evolving growth opportunities is key to plan business decision making. There is ongoing debate, for example, about whether Airbnb is an hospitality service that fosters social exchanges between hosts and guests, as the sharing economy manifesto originally stated, or whether it is (or is evolving into) a purely business transaction platform, the way hotels have traditionally operated. To answer these questions, a scalable market analysis approach is needed, affording platform owners the ability to easily examine their market over time and across different locations. In this paper, we propose to do so by means of a novel market analysis approach that exploits customers' reviews. Using a combination of thematic analysis and machine learning techniques, we first build a platform specific dictionary of themes and sub-themes discussed in guests' reviews. Using quantitative linguistic analysis based on this dictionary, we then illustrate how to answer a variety of market research questions, at fine levels of thematic, temporal and spatial granularity.Alexandre Maros (UFMG), Jussara Almeida (UFMG), Fabrício Benevenuto (UFMG) and Marisa Vasconcelos (IBM).
Abstract
WhatsApp is a free messaging app with more than one billion active monthly users which has become one of the main communication platforms in many countries, including Saudi Arabia, Germany, and Brazil. In addition to allowing the direct exchange of messages among pairs of users, the app also enables group conversations, where multiple people can interact with one another. A number of recent studies have shown that WhatsApp groups play an important role as an information dissemination platform, especially during important social mobilization events. In this paper, we build upon those prior efforts by taking a first look into the use of {\it audio} messages in WhatsApp groups, a type of content that is becoming increasingly important in the platform. We present a methodology to analyze audio messages shared in WhatsApp groups, characterizing content properties (e.g, topics and language characteristics), their propagation dynamics and the impact of different types of audios (e.g., speech versus music) on such dynamics.Carolina Vieira (UFMG), Filipe Ribeiro (UFOP), Pedro Olmo Vaz de Melo (UFMG), Fabricio Benevenuto (Federal University of Minas Gerais (UFMG)) and Emilio Zagheni (Max Planck Institute for Demographic Research).
Abstract
Measuring the affinity to a particular culture has been an active area of research. Many cultural aspects characterize regions in terms of cultural attributes, such as clothing, music, art, and food. As one of the central aspects, the cuisine of a country can effectively reflect one of the dominant aspects of its culture. As such, the number of people interested in a typical national dish can be used to estimate the prevalence of that culture inside the host region. In this study, we measure the global spread of Brazilian food culture across countries by exploring Facebook user's preferences for typical Brazilian dishes from Facebook Advertising Platform. But first, to decide which dish will be considered typical from Brazil, we made use of spatial analysis to understand the distribution of interests around the world and to quantify how typical is the dish in Brazil and between the Brazilian immigrants. This methodology can be generalized to other countries to infer cultural elements that immigrants take off to other countries during the migration process. The interest in Brazilian typical dishes can be used to characterize countries in terms of Brazilian cultural exposition. While evaluating the cultural distance between Brazil and the countries most preferred by Brazilian immigrants, we explore several measures of distance by comparing these in the context of affinity to Brazilian cuisine in different parts of the world. These measures of distance between countries evaluated in terms of cultural preferences can complement other metrics of distance applied to gravity-type models, for example, in order to explain flows of people between countries.Security (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Bertil Chapuis (UNIL-HEC Lausanne), Olamide Omolola (TU Graz), Mauro Cherubini (UNIL-HEC Lausanne), Mathias Humbert (armasuisse S+T) and Kévin Huguenin (UNIL-HEC Lausanne).
Abstract
Web developers can (and do) include subresources such as scripts, stylesheets and images in their webpages. Such subresources might be stored on remote servers such as content delivery networks (CDNs). This practice creates security and privacy risks, should a subresource be corrupted, as was recently the case for the British Airways websites. The subresource integrity (SRI) recommendation, released in mid-2016 by the W3C, enables developers to include digests in their webpages in order for web browsers to verify the integrity of subresources before loading them. In this paper, we conduct the first large-scale longitudinal study of the use of SRI on the Web by analyzing massive crawls (~3B unique URLs) of the Web over the last 3.5 years. Our results show that the adoption of SRI is modest (~3.40%), but grows at an increasing rate and is highly influenced by the practices of popular library developers (e.g., Bootstrap) and CDN operators (e.g., jsDelivr). We complement our analysis about SRI with a survey of web developers (N =227): It shows that a substantial proportion of developers know SRI and understand its basic functioning, but most of them ignore important aspects of the specification, such as the case of malformed digests. The results of the survey also show that the integration of SRI by developers is mostly manual – hence not scalable and error prone. This calls for a better integration of SRI in build tools.Yen-Hao Huang (National Tsing Hua University), Ting-Wei Liu (National Tsing Hua University), Ssu-Rui Lee (National Tsing Hua University), Fernando Henrique Calderon Alvarado (National Tsing Hua University) and Yi-Shin Chen (National Tsing Hua University).
Abstract
False information on the Internet has caused severe damage to society. Researchers have proposed methods to determine the credibility of news and have obtained good results. As different media sources (publishers) have different content generators (writers) and may focus on different topics or aspects, the word/topic distribution for each media source is divergent from others. We discover a challenge in the generalizability for the existing content-based methods to perform consistently on the news from media sources which are not in the training set, namely the cross-source failure. A cross-source setting can cause a more than 15-19% decrease in accuracy for current methods; content-sensitive features are considered one of the major causes of cross-source failure for a content-based approach. To overcome this challenge, we propose a credibility pattern embedding neural network (CPENN), which focuses on function words and syntactic structure to learn generalizable representation for credibility analysis and further reinforce the cross-source robustness for different media. Experiments with cross-validation on 194 real-world media sources showed that the proposed method could learn the generalizable features and outperformed the state-of-the-art methods on unseen media sources. Extensive analysis on the embedding feature representation represents a strength of the proposed method compared to current content embedding feature approaches. We envision that the proposed method is more robust for real-life unreliable news detection with CPENN due to its good generalizability.Pubali Datta (University of Illinois at Urbana-Champaign), Prabuddha Kumar (Stony Brook University), Tristan Morris (Silicon Valley Bank), Michael Grace (Samsung Electronics), Amir Rahmati (Stony Brook University) and Adam Bates (University of Illinois at Urbana-Champaign).
Abstract
Serverless Computing has quickly emerged as a dominant cloud computing paradigm, allowing developers to rapidly prototype event-driven applications using a composition of small functions that each perform a single logical task. However, many such application workflows are based in part on publicly-available functions written by third parties, creating the potential for functions to behave in unexpected, or even malicious, ways. At present, developers are not in total control of where and how their data is flowing, creating significant security and privacy risks in growth markets that have embraced serverless (e.g., IoT). As a practical means of addressing this problem, we present Valve, a serverless platform that enables developers to exert complete and fine-grained control of information flows in their applications. Valve enables workflow developers to reason about function behaviors, and specify restrictions, through auditing of network-layer information flows. By proxying network request and propagating taint labels across network flows, Valve is able to restrict function behavior without code modification. We demonstrate that Valve is able defend against known serverless attack behaviors including container reuse-based persistence and data exfiltration over cloud platform APIs with less than 10% runtime overhead, 4.7% deployment overhead and 8.28% teardown overhead.Nikesh Joshi (Boise State University), Francesca Spezzano (Boise State University), Mayson Green (Boise State University) and Elijah Hill (Boise State University).
Abstract
Wikipedia, the free and open-collaboration based online encyclopedia, has millions of pages that are maintained by thousands of volunteer editors. As per Wikipedia's fundamental principles, pages on Wikipedia are written with a neutral point of view and maintained by volunteer editors for free with well-defined guidelines in order to avoid or disclose any conflict of interest. However, there have been several known incidents where editors intentionally violate such guidelines in order to get paid (or even extort money) for maintaining promotional spam articles without disclosing such.In this paper, we address for the first time the problem of identifying undisclosed paid articles in Wikipedia. We propose a machine learning-based framework using a set of features based on both the content of the articles as well as the patterns of edit history of users who create them. To test our approach, we collected and curated a new dataset from English Wikipedia with ground truth on undisclosed paid articles. Our experimental evaluation shows that we can identify undisclosed paid articles with an AUROC of 0.98 and an average precision of 0.91. Moreover, our approach outperforms ORES, a scoring system tool currently used by Wikipedia to automatically detect damaging content, in identifying undisclosed paid articles. Finally, we show that our user-based features can also detect undisclosed paid editors with an AUROC of 0.94 and an average precision of 0.92, outperforming existing approaches.Search (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Zhuyun Dai (Carnegie Mellon University) and Jamie Callan (Carnegie Mellon University).
Abstract
Bag-of-words document representations play a fundamental role in modern search engines, but their power is limited by the shallow frequency-based term weighting scheme. This paper proposes HDTerm, a hierarchical document term weighting framework for document indexing and retrieval. It first estimates the semantic importance of a term at the passage level. The deep and fine-grained term weights are then aggregated into a document-level bag-of-words representation, which can be stored into a standard inverted index for efficient retrieval. This paper also proposes two approaches that enable training HDTerm without relevance labels. Experiments show that an index using HDTerm weights significantly improved the retrieval accuracy over standard term-frequency based index and state-of-the-art embedding-based index.Khoa Doan (Virginia Tech) and Chandan K. Reddy (Virginia Tech).
Abstract
Searching for documents with semantically similar content is a fundamental problem in the information retrieval domain with various challenges, primarily, in terms of efficiency and effectiveness. Despite the promise of modeling structured dependencies in documents, several existing text-hashing methods lack an efficient mechanism to incorporate such vital information. Additionally, the desired characteristics of an ideal hash function, such as robustness to noise, low quantization error and bit balance/uncorrelation, are not effectively learned in existing methods. This is because of the requirement to either tune additional hyper-parameters or optimize additional non-trivial cost functions. In this paper, we propose a Denoising Adversarial Binary Autoencoder (DABA) model which presents a novel representation learning framework that captures structured representation of text documents in the learned hash function. Also, adversarial training provides an alternative direction to implicitly learn a hash function that captures all the desired characteristics of an ideal hash function. Essentially, DABA adopts a novel single-optimization adversarial training procedure that minimizes the Wasserstein distance in its primal domain to regularize the encoder's output of either a recurrent neural network or a convolutional autoencoder. We empirically demonstrate the effectiveness of our proposed method in capturing the intrinsic semantic manifold of the related documents. The proposed method outperforms the current state-of-the-art shallow and deep unsupervised hashing methods for the document retrieval task on several prominent document collections.Shuguang Han (Google), Michael Bendersky (Google), Przemek Gajda (Google), Sergey Novikov (Google), Marc Najork (Google), Bernhard Brodowsky (Google) and Alexandrin Popescul (Pinterest).
Abstract
The rapid growth of commercial web content has driven the development of shopping search services to facilitate users seeking for product information. Due to the dynamic nature of commercial content, an optimal recrawl policy is a key component in a shopping search service; it ensures that users have access to the most up-to-date product details. Prior studies did propose various strategies to maximize the content freshness; however, they often relied on simple heuristics, and overlooked the crawling resource budgets. To address this, Azar et al. [5] recently proposed a joint optimization strategy LambdaCrawl aiming to maximize content freshness within a given resource budget. In this paper, we demonstrate that the effectiveness of LambdaCrawl is governed in large part by how well future change rates can be estimated. Therefore, we adopt a state-of-the-art deep learning model for change rate prediction, which results in a substantial improvement of content freshness over the common LambdaCrawl implementation with change rate estimated from the past history. Moreover, we demonstrate that while LambdaCrawl is a significant advancement upon the existing recrawl strategies, it can be further improved upon by a unified multi-strategy recrawl policy. To this end, we employ a K-armed adversarial bandits algorithm that can provably optimize the overall content freshness by combining multiple strategies. Empirical results over a large-scale production dataset demonstrate that the proposed adversarial bandits approach outperforms LambdaCrawl by a large margin, especially under tight resource budgets.Roee Shraga (Technion - Israel Institute of Technology), Haggai Roitman (IBM), Guy Feigenblat (IBM) and Mustafa Canim (IBM).
Abstract
Given a keyword query, the ad hoc table retrieval task aims at retrieving a ranked list of the top-k most relevant tables in a given table corpus. Previous works have primarily focused on designing table-centric lexical and semantic features, which could be utilized for learning-to-rank (LTR) tables. In this work, we make a novel use of intrinsic (passage-based) and extrinsic (manifold-based) table similarities for enhanced retrieval. Using the WikiTables benchmark, we study the merits of utilizing such similarities for this task. To this end, we combine both similarity types via a simple, yet an effective, cascade re-ranking approach. Overall, our proposed approach results in a significantly better table retrieval quality, which even transcends that of strong semantically-rich baselines.Matteo Lissandrini (Aalborg University), Davide Mottin (Aarhus University), Themis Palpanas (Paris Descartes University) and Yannis Velegrakis (Utrecht University).
Abstract
We consider the task of exploratory search through graph queries on knowledge graphs. We propose to assist the user by expanding the query with intuitive suggestions to provide a more informative (full) query that can retrieve more detailed and relevant answers. To achieve this result, we propose a model that can bridge graph search paradigms with well established techniques for information-retrieval. Our approach does not require any additional knowledge from the user and builds on principled language modelling approaches. We empirically show the effectiveness and efficiency of our approach on a large knowledge graph and how our suggestions are able to help build more complete and informative queries.Mobile (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Xi Chen (Samsung Electronics Canada), Hang Li (Samsung Electronics Canada), Chenyi Zhou (Samsung Electronics Canada), Xue Liu (Samsung Electronics Canada), Di Wu (Samsung Electronics Canada) and Gregory Dudek (Samsung Electronics Canada).
Abstract
Emerging location-aware applications, such as cashier-less shopping, mobile ads targeting and geo-based Augmented Reality (AR),are changing people’s lives fundamentally. In order to fully sup-port these new applications, location information with meter-level resolution (or even higher) is required anytime and anywhere. Unfortunately, most of the current location sources (e.g., check-in data and GPS) either are unavailable indoor or provide only house-level resolutions. To fill the gap, this paper utilizes the ubiquitous WiFi signals to establish a meter-level localization system, which employs WiFi propagation characteristics as location fingerprints.However, an unsolved issue of these WiFi fingerprints is their inconsistency across different users. In other words, WiFi fingerprints collected for one user may not be used to localize another user.To address this issue, we propose a WiFi-based Domain-adaptive system FiDo, which is able to localize many different users with labelled data from only one or two example users. FiDo contains two modules: 1) a data augmenter that introduces data diversity using a Variational Autoencoder (VAE); and 2) a domain-adaptive classifier that adjusts itself to newly collected unlabelled data using a joint classification-reconstruction structure. Compared to the state ofthe art, FiDo increases average F1 score by 11.8% and improves the worst-case accuracy by 20.2%.Suining He (University of Michigan--Ann Arbor & The University of Connecticut) and Kang G. Shin (University of Michigan--Ann Arbor).
Abstract
Accurate bike-flow prediction at the individual station level is essential for bike sharing service. Due to the spatial and temporal complexities of traffic networks and the lack of data-driven design for bike stations, existing methods cannot predict the fine-grained bike flows to/from each station. To remedy this problem, we propose a novel data-driven spatiotemporal Graph attention convolutional neural network for Bike station-level flow prediction (GBikes). We develop data-driven and spatio-temporal designs, and model bike stations (nodes) and interstation bike rides (edges) as a graph. In particular, we design a novel graph attention convolutional neural network (GACNN) with attention mechanisms capturing and differentiating station-to-station correlations. Multi-level temporal closeness, spatial distances and other external factors (e.g., weather and points of interest) are jointly considered for comprehensive learning and accurate prediction of bike flows at each station. Extensive experiments upon a total of over 11 million trips collected from three large-scale bike-sharing systems in New York City, Chicago, and Los Angeles have corroborated GBikes’s significant improvement of accuracy, robustness and effectiveness over prior work.Suining He (University of Michigan--Ann Arbor & University of Connecticut) and Kang G. Shin (University of Michigan--Ann Arbor).
Abstract
Thanks to recent progresses in mobile payment, IoT, electric motors, batteries and location-based services, Dockless E-scooter Sharing (DES) has become a popular means of last-mile commute for a growing number of (smart) cities. As e-scooters are getting deployed dynamically and flexibly across city regions that expand and/or shrink, with subsequent social, commercial and environmental evaluation, accurate prediction of the distribution of e-scooters given reconfigured regions becomes essential for the city planners and service providers.To meet this need, we propose GCScoot, a novel dynamic flow distribution prediction for reconfiguring urban DES systems. Based on the real-world datasets with reconfiguration, we analyze the mobility features of the e-scooter distribution and flow dynamics for the data-driven designs. To adapt to dynamic reconfiguration of DES deployment, we propose a novel spatio-temporal graph capsule neural network within GCScoot to predict the future dockless e-scooter flows given the reconfigured regions. GCScoot preprocesses the historical spatial e-scooter distributions into flow graph structures, where discretized city regions are considered as nodes and their mutual flows as edges. Given data-driven designs regarding distance, ride flows and region connectivity, the dynamic region-to-region correlations embedded within the temporal flow graphs are captured through the graph capsule neural network which accurately predicts the DES flows.We have conducted extensive empirical studies upon three different e-scooter datasets (>2.8 million rides in total) in populous US cities including Austin TX, Louisville KY and Minneapolis MN. The evaluation results have corroborated the accuracy and effectiveness of GCScoot in predicting dynamic distribution of dockless e-scooters’ mobility.Zhihao Wang (Institute of Information Engineering, Chinese Academy of Sciences), Qiang Li (School of Computer and Information Technology, Beijing Jiaotong University), Jinke Song (School of Computer and Information Technology, Beijing Jiaotong University), Haining Wang (Virginia Tech) and Limin Sun (Institute of Information Engineering, Chinese Academy of Sciences).
Abstract
IP-based geolocation is essential for various location-aware Internet applications, such as online advertisement, content delivery, and online fraud prevention. Achieving accurate geolocation enormously relies on the number of high-quality (i.e., the fine-grained and stable over time) landmarks. However, the previous efforts of garnering landmarks have been impeded by the limited visible landmarks on the Internet and manual time cost. In this paper, we leverage the availability of numerous online webcams that are used to monitor physical surroundings as a rich source of promising high-quality landmarks for serving IP-based geolocation. In particular, we present a new framework called {\it GeoCAM}, which is designed to automatically generate qualified landmarks from online webcams, providing IP-based geolocation services with high accuracy and wide coverage. GeoCAM periodically monitors websites that are hosting live webcams and uses the natural language processing technique to extract the IP addresses and latitude/longitude of webcams for generating landmarks at large-scale. We develop a prototype of GeoCAM and conduct real-world experiments for validating its efficacy. Our results show that GeoCam can detect 282,902 live webcams hosted in webpages with 94.2\% precision and 90.4\% recall, and then generate 16,863 stable and fine-grained landmarks, which are two orders of magnitude more than the landmarks used in prior works. Thus, by correlating a large scale of landmarks, GeoCAM is able to provide a geolocation service with high accuracy and wide coverage.Web Mining-B (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Christof Naumzik (ETH Zurich), Patrick Zoechbauer (ETH Zurich) and Stefan Feuerriegel (ETH Zurich).
Abstract
Points-of-interest (POIs; i.e., restaurants, bars, landmarks, and other entities) are common in web-mined data: they greatly explain the spatial distributions of urban phenomena. The conventional modeling approach relies upon feature engineering, yet it ignores the spatial structure among POIs. In order to overcome this shortcoming, the present paper proposes a novel spatial model for explaining spatial distributions based on web-mined POIs. Our key contributions are: (1) We present a rigorous yet highly interpretable formalization in order to model the influence of POIs on a given outcome variable. Specifically, we accommodate for the spatial distributions of both the outcome and POIs. In our case, this is modeled by the sum of latent Gaussian processes. (2) In contrast to previous literature, our model infers the influence of POIs without feature engineering, instead we model the influence of POIs via distance-weighted kernel functions with fully learnable parameterizations. (3) We propose a scalable learning algorithm based on sparse variational approximation. For this purpose, we derive a tailored evidence lower bound (ELBO) and, for appropriate likelihoods, we even show that an analytical expression can be obtained. This allows fast and accurate computation of the ELBO. Finally, the value of our approach for web mining is demonstrated in two real-world case studies. Our findings provide substantial improvements over state-of-the-art baselines with regard to both predictive and, in particular, explanatory performance. Altogether, this yields a novel spatial model for leveraging web-mined POIs. Within the context of location-based social networks, it promises an extensive range of new insights and use cases.Zhengjie Miao (Duke University), Yuliang Li (Megagon Labs), Xiaolan Wang (Megagon Labs) and Wang-Chiew Tan (Megagon Labs).
Abstract
Online services are interested in solutions to opinion mining, which is the problem of extracting aspects, opinions, and sentiments from text. One method to mine opinions is to leverage the recent success of pre-trained language models which can be fine-tuned to obtain high-quality extractions from reviews. However, fine-tuning language models still requires a non-trivial amount of training data.In this paper, we study the problem of how to significantly reduce the amount of labeled training data required in fine-tuning language models for opinion mining. We describe Snippext, an opinion mining system developed over a language model that is fine-tuned through semi-supervised learning with augmented data. A novelty of Snippext is its clever use of a two-prong approach to achieve state-of-the-art (SOTA) performance with little labeled training data through: (1) data augmentation to automatically generate more labeled training data from existing ones, and (2) a semi-supervised learning technique to leverage the massive amount of unlabeled data in addition to the (limited amount of) labeled data. We show with extensive experiments that Snippext performs comparably and can even exceed previous SOTA results on several opinion mining tasks with only half the training data required. Furthermore, it achieves new SOTA results when all training data are leveraged. By comparison to a baseline pipeline, we found that Snippext extracts significantly more fine-grained opinions which enable new opportunities of downstream applications.Wenhao Yu (University of Notre Dame), Wei Peng (Zhejiang University), Yu Shu (Sichuan University), Qingkai Zeng (University of Notre Dame) and Meng Jiang (University of Notre Dame).
Abstract
Data Science has been one of the most popular fields in higher education and research activities. It takes tons of time to read the experimental section of thousands of papers and figure out the performance of the data science techniques. In this work, we build an experimental evidence extraction system to automate the integration of tables (in the paper PDFs) into a database of experimental results. First, it crops the tables and recognizes the templates. Second, it classifies the column names and row names into “method”, “dataset”, or “evaluation metric”, and then unified all the table cells into (method, dataset, metric, score)-quadruples. We propose hybrid features including structural and semantic table features as well as an ensemble learning approach for column/row name classification and table unification. SQL statements can be used to answer questions such as whether a method is the state-of-the-art or whether the reported numbers are conflicting.Weichao Wang (Northeastern University), Shi Feng (Northeastern University, China), Wei Gao (Victoria University of Wellington), Daling Wang (Northeastern University, China) and Yifei Zhang (Northeastern University, China).
Abstract
In open-domain dialogue systems, dialogue cues such as emotion, persona, and emoji can be incorporated into conversation models for strengthening the semantic relevance of generated responses. Existing neural response generation models either incorporate dialogue cue into decoder's initial state or embed the cue indiscriminately into the state of every generated word, which may cause the gradients of the embedded cue to vanish or disturb the semantic relevance of generated words during back propagation. In this paper, we propose a Cue Adaptive Decoder (CueAD) that aims to dynamically determine the involvement of a cue at each generation step in the decoding. For this purpose, we extend the Gated Recurrent Unit (GRU) network with an adaptive cue representation for facilitating cue incorporation, in which an adaptive gating unit is utilized to decide when to incorporate cue information so that the cue can provide useful clues for enhancing the semantic relevance of generated words. Experimental results show that CueAD outperforms state-of-the-art baselines with large margins.Semantics (1)
(UTC/GMT +8) 11:00-13:00, April, 22, Wednesday
Meeting rooms are not available now
Ningyu Zhang (Alibaba Group & AZFT Joint Lab for Knowledge Engine), Shumin Deng (Zhejiang University & AZFT Joint Lab for Knowledge Engine), Zhanlin Sun (Carnegie Mellon University), Jiaoyan Chen (University of Oxford), Wei Zhang (Alibaba Group & AZFT Joint Lab for Knowledge Engine) and Huajun Chen (Zhejiang University & AZFT Joint Lab for Knowledge Engine).
Abstract
Knowledge Graph Completion (KGC) has been proposed to improve Knowledge Graphs by filling in missing connections via link prediction or relation extraction. One of the main difficulties for KGC is the low resource problem. Previous approaches assume sufficient training triples to learn versatile vectors for entities and relations, or a satisfactory number of labeled sentences to train a competent relation extraction model. However, low resource relations are very common in KGs, and those newly added relations often do not have many known samples for training. In this work, we aim at predicting new facts under a challenging setting where only limited training instances are available. We propose a general framework called Weighted Relation Adversarial Network, which utilizes an adversarial procedure to help adapt knowledge/features learned from high resource relations to different but related low resource relations. Specifically, the framework takes advantage of a relation discriminator to distinguish between samples from different relations, and help learn relation-invariant features more transferable from source relations to target relations. Experimental results show that the proposed approach outperforms previous methods regarding low resource settings for both link prediction and relation extraction.Mohamed H Gad-Elrab (Max Planck Institute for Informatics), Evgeny Kharlamov (Bosch Center for Artificial Intelligence), Daria Stepanova (Bosch Center for Artificial Intelligence), Jannik Stroetgen (Bosch Center for Artificial Intelligence) and Trung-Kien Tran (Bosch Center for Artificial Intelligence).
Abstract
Knowledge graphs (KGs) are essential resources for many applications including Web search and Question Answering. As KGs are often automatically constructed (e.g., from the web) and enriched (e.g., using embedding-based completion), they may contain incorrect facts. Detecting them is a crucial, yet extremely expensive task. Prominent solutions detect and explain inconsistencies in KGs with respect to accompanying ontologies that describe the KG domain of interest. Compared to machine learning methods they are more reliable and human-interpretable but scale poorly on large KGs. In this paper, we present a novel approach to dramatically speed up the process of detecting and explaining inconsistencies in large KGs by exploiting KG abstractions that capture prominent data patterns. Though much smaller in size, KG abstractions preserve inconsistency and their explanations. Our experiments with large-scale KGs (e.g., DBpedia and Yago) demonstrate the feasibility of our approach and show that it significantly outperforms the popular baseline. The discovered inconsistency explanations in these large-scale KGs further help in making the results interpretable.Caleb Belth (University of Michigan), Xinyi Zheng (University of Michigan), Jilles Vreeken (Helmholtz Center for Information Security (CISPA)) and Danai Koutra (University of Michigan).
Abstract
Knowledge graphs (KGs) store highly heterogeneous information about the world in the structure of a graph, and are useful for tasks such as question answering and reasoning. However, they often contain errors and are missing information. Vibrant research in KG refinement has worked to resolve these issues, tailoring techniques to either detect specific types of errors or complete a KG.In this work, we introduce a unified solution to KG characterization by formulating the problem as unsupervised KG summarization with a set of inductive, soft rules, which describe what is normal in a KG, and thus can be used to identify what is abnormal, whether it be strange or missing. Unlike first-order logic rules, our rules are labeled, rooted graphs, i.e., patterns that describe the expected neighborhood around a (seen or unseen) node based on its type and information in the KG. Stepping away from the traditional support/confidence-based rule mining techniques, we propose KGIST, Knowledge Graph Inductive SummarizaTion, which learns a summary of inductive rules that best compress the KG according to the Minimum Description Length principle—a formulation that we are the first to use in the context of KG rule mining. We apply our rules to three large KGs (NELL, DBpedia, and Yago), and tasks such as compression, various types of error detection, and identification of incomplete information. We show that KGIST outperforms task-specific, supervised and unsupervised baselines in error detection and incompleteness identification (identifying up to 92.88% of missing entities—at least 10% more than baselines), while also being efficient for large knowledge graphs.Unmesh Joshi (Vrije University) and Jacopo Urbani (Vrije University).
Abstract
Embedding-based models of Knowledge Graphs (KGs) can be used to predict the existence of missing links in the KG by ranking entities according to their likelihood scores computed using the embeddings. An exhaustive computation of all likelihood scores is very expensive if the KG is large. To counter this problem, we propose a technique to reduce the search space by identifying smaller subsets of promising entities. Our technique first creates embeddings of subgraphs using the embeddings from the model. Then, it ranks the subgraphs, based on the metrics and considers only the entities in the top k subgraphs. Our empirical evaluation shows that our technique is able to reduce the search space significantly while maintaining a good recall.Qi Zhu (University of Illinois Urbana-Champaign), Hao Wei (Amazon Inc.), Bunyamin Sisman (Amazon Inc.), Da Zheng (Amazon Inc.), Christos Faloutsos (Carnegie Mellon University), Xin Luna Dong (Amazon Inc.) and Jiawei Han (University of Illinois Urbana-Champaign).
Abstract
Knowledge graph (e.g. Freebase, YAGO) is a multi-relational graph representing rich factual information among entities of various types. Entity alignment is the key step towards knowledge graph integration from multiple sources. It aims to identify entities across different knowledge graphs that refer to the same real-world entity. However, current entity alignment systems overlook the sparsity of different knowledge graphs and can not align multi-type entities by one single model. In this paper, we present a Collective Graph neural network for Multi-type entity Alignment, called CG-MuAlign. Different from previous work, CG-MuAlign jointly aligns multiple types of entities, collectively leverages the neighborhood information and generalizes to unlabeled entity types. Specifically, we propose novel collective aggregation function tailored for this task, that (1) relieves the incompleteness of knowledge graphs via both cross-graph and self attentions, (2) scales up efficiently with mini-batch training paradigm and effective neighborhood sampling strategy. We conduct experiments on real world knowledge graphs with millions of entities and observe the superior performance beyond existing methods. In addition, the running time of our approach is much less than the current state-of-the-art deep learning methods.Research Tracks (2)
Web Mining-A (2)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Wenhao Yu (University of Notre Dame), Mengxia Yu (Peking Univerisity), Tong Zhao (University of Notre Dame) and Meng Jiang (University of Notre Dame).
Abstract
Citing, quoting, and forwarding & commenting behaviors are widely seen in academia, news media, and social media. Existing behavior modeling approaches focused on mining content and describing preferences of authors, speakers, and users. However, behavioral intention plays an important role in generating content on the platforms. In this work, we propose to identify the referential intention which motivates the action of using the referred (e.g., cited, quoted, and retweeted) source and content to support their claims. We adopt a theory in sociology to develop a schema of four types of intentions. The challenge lies in the heterogeneity of observed contextual information surrounding the referential behavior, such as referred content (e.g., a cited paper), local context (e.g., the sentence citing the paper), neighboring context (e.g., the former and latter sentences), and network context (e.g., the academic network of authors, affiliations, and keywords). We propose a new neural framework with Interactive Hierarchical Attention (IHA) to identify the intention of referential behavior by properly aggregating the heterogeneous contexts. Experiments demonstrate that the proposed method can effectively identify the type of intention of citing behaviors (on academic data) and retweeting behaviors (on Twitter). And learning the heterogeneous contexts collectively can improve the performance. This work opens a door for understanding content generation from a fundamental perspective of behavior sciences.Zhen Guo (North Carolina State University), Zhe Zhang (IBM) and Munindar Singh (North Carolina State University).
Abstract
Understanding how people change their views during argumentative discussions is important in applications that involve human communication, e.g., in social media and education. Existing research focuses on lexical features of individual comments, dynamics of discussions, or the personalities of participants but deemphasizes a challenging factor: cumulative influence of the discussion on a participant's mindset that is exerted by the interplay of comments by different participants during the discussion.We make the following contributions. (1) We demonstrate the necessity of considering an individual's perception of comments from other participants for predicting persuasiveness through a human study. (2) We tackle the challenging task of predicting the points where a user's view changes considering the whole discussion, which includes massive noise and plausible alternatives. (3) We present a sequential model for cumulative influence that captures the interplay between comments as both local and nonlocal dependencies, and demonstrate its capability of selecting the most effective information for changing views. (4) We identify contextual and interactive features and propose corresponding sequence structures to incorporate these features. Our empirical evaluation using a Reddit Change My View dataset shows that contextual and interactive features are valuable in predicting view changes, and a sequential model notably outperforms the nonsequential baseline models.Zhuoyi Wang (The University of Texas at Dallas), Yigong Wang (The University of Texas at Dallas), Yu Lin (The University of Texas at Dallas), Evan Delord (The University of Texas at Dallas) and Khan Latifur (The University of Texas at Dallas).
Abstract
Deep Neural Network (DNN) has been largely demonstrated to be effective for closed-world classification problems where the number of categories is fixed. However, DNNs notoriously fail when it meets the label prediction over the non-stationary data stream scenario, which has the continuous emergence of the unknown or novel class (categories not in the training set). To solve this challenge, the DNN should not only be able to detect novel class, but also incrementally learn new concepts from a few of data over time. Limited literature simultaneously address both problems, in this paper, we focus on improving not only the ability of DNNs on the generalization of the novel class, but also the effectiveness of continuously learning novel categories from only a few instances from data stream. Different with existing approaches that heavily relies on abundant labeled instances to train/update the model, our proposed Few Sample and Adversarial Representation Learning (FSAR) framework, it first trains a joint learning model to achieve an intra-class compacted and inter-class separated representation and recognize the novel class; next, through active annotation request, we collect a few samples belong to such new categories and utilize episode-training to exploit the intrinsic feature for few-shot learning. Specifically, we implement an adversarial confusion term based metric learning approach for the first step, which encourage the robustness and generalization ability by reducing over-confidence on the seen classes. Once trained, } is able to extract discriminative features for novel categories and incorporated with joint representation model to facilitate the few-sample learning in the stream. We evaluated \sysname{ on completely different datasets ( CUB-200, EMNIST, FASHION-MNIST and CIFAR-10), extensive experimental results on various benchmarks simulated stream show that \sysname{} effectively outperforms current state-of-the-art approaches.Guiliang Liu (Baidu), Xu Li (Baidu), Jiakang Wang (Baidu), Mingming Sun (Baidu) and Ping Li (Baidu).
Abstract
To extract knowledge from general web text, it requires to build a domain-independent extractor that scales to the entire web corpus. This task is known as Open Information Extraction (OIE). This paper proposes to apply Monte-Carlo Tree Search (MCTS) to accomplish OIE. To achieve this goal, we define a Markov Decision Process for OIE and build a simulator to learn the reward signals, which provides a complete Reinforcement Learning framework for MCTS. Using this framework, MCTS explores candidate words (and symbols) under the guidance of a pre-trained Sequence-to-Sequence (Seq2Seq) predictor and generates abundant exploration samples during training. We apply the exploration samples to update the reward simulator and the predictor, based on which we implement another MCTS to search the optimal predictions during inference. Empirical evaluation demonstrates that the MCTS inference substantially improves the accuracy of prediction (more than 10%) and achieves a leading performance over other state-of-the-art comparison models.Leon Bornemann (Hasso Plattner Institute), Tobias Bleifuß (Hasso Plattner Institute), Dmitri V. Kalashnikov (AT&T Labs - Research), Felix Naumann (Hasso Plattner Institute) and Divesh Srivastava (AT&T Labs-Research).
Abstract
Wikipedia is the largest encyclopedia to date. Scattered among its articles, there is an enormous number of tables that contain structured, relational information. In contrast to database tables, these webtables lack metadata, making it difficult to automatically interpret the knowledge they harbor. The natural key is a particularly important piece of metadata, which acts as a primary key and consists of attributes inherent to an entity. Determining natural keys is crucial for many tasks, such as information integration, table augmentation, or tracking changes to entities over time. To address this challenge, we formally define the notion of natural keys and propose a supervised learning approach to automatically detect natural keys in Wikipedia tables using carefully engineered features. Our solution includes novel features that extract information from time (a table’s version history) and space (other similar tables). On a curated dataset of 1,000 Wikipedia table histories, our model achieves 80% F-measure, which is at least 20% more than all related approaches. We use our model to discover natural keys in the entire corpus of Wikipedia tables and provide the dataset to the community to facilitate future research.Social Network-A (2)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Pengyang Wang (University of Central Florida), Jiaping Gui (NEC Laboratories America, Inc.), Zhengzhang Chen (NEC Laboratories America, Inc.), Junghwan Rhee (NEC Laboratories America, Inc.), Haifeng Chen (NEC Laboratories America, Inc.) and Yanjie Fu (University of Central Florida).
Abstract
Graph Convolutional Networks (GCNs) have shown to be a powerful tool for analyzing graph-structured data. Most of previous GCN methods focus on learning a good node representation by aggregating the representations of neighboring nodes, whereas largely ignoring the edge information. Although few recent methods have been proposed to integrate edge attributes into GCNs to initialize edge embeddings, these methods do not work when edge attributes are (partially) unavailable. Can we develop a generic edge-empowered framework to exploit node-edge enhancement, regardless of the availability of edge attributes? %There lacks a generic edge-empowered framework to exploit edge-node enhancement. To address this, In this paper, we propose a novel framework EE-GCN that achieves node-edge enhancement. In particular, the framework EE-GCN includes three key components: (i) Initialization: this step is to initialize the embeddings of both nodes and edges. Unlike node embedding initialization, we propose a line graph-based method to initialize the embedding of edges regardless of edge attributes. (ii) Feature space alignment: we propose a translation-based mapping method to align edge embedding with node embedding space, and the objective function is penalized by a translation loss when both spaces are not aligned. (iii) Node-edge mutually enhanced updating: node embedding is updated by aggregating embedding of neighboring nodes and associated edges, while edge embedding is updated by the embedding of associated nodes and itself. Through the above improvements, our framework provides a generic strategy for all of the spatial-based GCNs to allow edges to participate in embedding computation and exploit node-edge mutual enhancement. Finally, we present extensive experimental results to validate the improved performances of our method in terms of node classification, link prediction, and graph classification.Man Wu (Florida Atlantic University), Shirui Pan (Monash University), Chuan Zhou (Chinese Academy of Sciences), Xiaojun Chang (Monash University) and Xingquan Zhu (Florida Atlantic University).
Abstract
Graph convolutional networks (GCNs) have achieved impressive success in many graph related analytics tasks. However, most GCNs only work in a single domain (graph) incapable of transferring knowledge from/to other domains (graphs), due to the challenges in both graph representation learning and domain adaptation over graph structures. In this paper, we present a novel approach, unsupervised domain adaptive graph convolutional networks (UDA-GCN), for domain adaptation learning for graphs. To enable effective graph representation learning, we first develop a dual graph convolutional network component, which jointly exploits local and global consistency for feature aggregation. An attention mechanism is further used to produce a unified representation for each node in different graphs. To facilitate knowledge transfer between graphs, we propose a domain adaptive learning module to optimize three different loss functions, namely source classifier loss, domain classifier loss, and target classifier loss as a whole, thus our model can differentiate class labels in the source domain, samples from different domains, the class labels from the target domain, respectively. Experimental results on real-world datasets in the node classification task validate the performance of our method, compared to state-of-the-art graph neural network algorithms.Xinyu Fu (The Chinese University of Hong Kong), Jiani Zhang (The Chinese University of Hong Kong), Ziqiao Meng (The Chinese University of Hong Kong) and Irwin King (The Chinese University of Hong Kong).
Abstract
A large number of real-world graphs or networks are inherently heterogeneous, involving a diversity of node types and relationships between nodes. Heterogeneous graph embedding is to embed rich structural and semantic information of a heterogeneous graph into low-dimensional node representations. Existing models usually define multiple metapaths in a heterogeneous graph to capture the composite relations and guide neighbor selection. However, these models either omit node content features, discard intermediate nodes along the metapath, or only consider one metapath. To address these three limitations, we propose a new model named Metapath Aggregated Graph Neural Network (MAGNN) to boost the final performance. Specifically, MAGNN employs three major components, i,e, the node-type-specific transformation part to encapsulate input node content, the node-level metapath instance aggregation part to incorporate semantic intermediate nodes, and the metapath-level embedding fusion part to combine messages from multiple paths. Extensive experiments on three real-world heterogeneous graph datasets for node classification, node clustering, and link prediction show that MAGNN achieves more accurate prediction results than state-of-the-art baselines.Liang Qu (Shenzhen Key Laboratory of Computational Intelligence, Southern University of Science and Technology), Huaisheng Zhu (Shenzhen Key Laboratory of Computational Intelligence, Southern University of Science and Technology), Qiqi Duan (Shenzhen Key Laboratory of Computational Intelligence, Southern University of Science and Technology) and Yuhui Shi (Shenzhen Key Laboratory of Computational Intelligence, Southern University of Science and Technology).
Abstract
Recently, graph neural networks (GNNs) have been shown to be an effective tool for learning the node representations of the networks and have achieved good performance on the semi-supervised node classification task. However, most existing GNNs methods fail to take networks' temporal information into account, therefore, cannot be well applied to dynamic network applications such as the continuous-time link prediction task. To address this problem, we propose a Temporal Dependent Graph Neural Network (TDGNN), a simple yet effective dynamic network representation learning framework which incorporates the network temporal information into GNNs. TDGNN introduces a novel Temporal Aggregator (TDAgg) to aggregate the neighbor nodes' features and edges' temporal information to obtain the target node representations. Specifically, it assigns the neighbor nodes aggregation weights using an exponential distribution to bias different edges' temporal information. The performance of the proposed method has been validated on six real-world dynamic network datasets for the continuous-time link prediction task. The experimental results show that the proposed method outperforms several state-of-the-art baselines.Flavio Chierichetti (Sapienza University of Rome), Ravi Kumar (Google) and Andrew Tomkins (Google).
Abstract
In this paper we study the limiting dynamics of a sequential process that generalizes \polya's urn. This process has been studied also in the context of language generation, discrete choice, repeat consumption, and models for the web graph. The process we study generates future items by copying from past items. It is parameterized by a sequence of weights describing how much to prefer copying from recent versus more distant locations. We show that, if the weight sequence follows a power law with exponent $\alpha \in [0,1)$, then the sequences generated by the model tend toward a limiting behavior in which the eventual frequency of each token in the alphabet attains a limit. Moreover, in the case $\alpha > 2$, we show that the sequence converges to a token being chosen infinitely often, and each other token being chosen only constantly many times.User Modeling-A (2)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Peijie Sun (Hefei University of Technology), Le Wu (Hefei University of Technology), Kun Zhang (University of Science and Technology of China), Yanjie Fu (University of Central Florida), Richang Hong (Hefei University of Technology) and Meng Wang (Hefei University of Technology).
Abstract
In many recommender systems, users express item opinions through two kinds of behaviors: giving rating preferences and writing detailed reviews. As both kinds of behaviors reflect users' assessment of items, review enhanced recommender systems leverage these two kinds of user behaviors to boost recommendation performance. On the one hand, researchers proposed to better model the user and item embeddings with additional review information for enhancing preference prediction accuracy. On the other hand, some recent works focused on automatically generating item reviews for recommendation explanations with related user and item embeddings. We argue that, while the task of preference prediction with the accuracy goal is well recognized in the community, the task of generating reviews for explainable recommendation is also important to gain user trust and increase conversion rate. Some preliminary attempts have considered jointly modeling these two tasks, with the user and item embeddings are shared. These studies empirically showed that these two tasks are correlated, and jointly modeling them would benefit the performance of both tasks.In this paper, we make a further study of unifying these two tasks for explainable recommendation. Instead of simply correlating these two tasks with shared user and item embeddings, we argue that these two tasks are presented in dual forms. In other words, the input of the primal preference prediction task $p(R|C)$ is exactly the output of the dual review generation task $p(C|R)$, with $R$ and $C$ denote the preference value space and review space. Therefore, we could explicitly model the probabilistic correlation between these two dual tasks with $p(R,C)=p(R|C)p(C)=p(C|R)p(R)$. We design a unified dual framework of how to inject the probabilistic duality of the two tasks in the training stage. Furthermore, as the detailed rating and review information is not available for each user-item pair in the test stage, we propose a transfer learning based model for preference prediction and review generation. Finally, extensive experimental results on two real-world datasets clearly show the effectiveness of our proposed model for both user preference prediction and review generation.Jibang Wu (University of Virginia), Renqin Cai (University of Virginia) and Hongning Wang (University of Virginia).
Abstract
Predicting users' preferences based on their sequential behaviors in history is challenging and crucial for modern recommender systems. Most existing sequential recommendation algorithms focus on transitional structure among the sequential actions, but largely ignore the temporal and context information, when modeling the influence of a historical event to current prediction.In this paper, we argue that the influence from the past events on a user's current action should also vary over the course of time and under different context. Thus, we propose a Contextualized Temporal Attention Mechanism that learns to weigh historical actions' influence on not only what action it is, but also when and how the action took place. More specifically, to dynamically calibrate the relative input dependence from the self-attention mechanism, we deploy multiple parameterized kernel functions to learn various temporal dynamics, and then use the context information to determine which of these reweighing kernel to follow for each input. In empirical evaluations on two large public recommendation datasets, our model consistently outperforms an extensive set of state-of-the-art sequential recommendation methods.Mengyue Yang (University of Chinese Academy of Sciences), Qingyang Li (Didi Research America), Zhiwei Qin (Didi Research America) and Jieping Ye (Didi Chuxing).
Abstract
Contextual multi-armed bandit (MAB) achieves cutting-edge performance on a variety of problems. When it comes to real-world scenarios such as recommendation system and online advertising, however, it is essential to take the resource consumption of exploration into consideration when maximizing the reward of bandit algorithms. In practice, there is typically non-zero cost associated with executing a recommendation (arm) in the environment, and hence, the policy should be learned with a fixed exploration cost constraint. It is challenging to learn a global optimal policy directly, since it is a NP-hard problem and significantly complicates the exploration and exploitation trade-off of bandit algorithms. Existing approaches focus on solving the problems by adopting the greedy policy which estimates the expected rewards and costs and uses a greedy selection based on each arm's expected reward/cost ratio using historical observation until the exploration resource is exhausted. However, existing methods are hard to extend to infinite time horizon, since the learning process will be terminated when there is no more resource. In this paper, we propose a hierarchical adaptive contextual bandit method (HATCH) to conduct the policy learning of contextual bandits with a budget constraint. HATCH adopts an adaptive method to allocate the exploration resource based on the remaining resource/time and the estimation of reward distribution among different user contexts. In addition, we utilize full of contextual feature information to find the best personalized recommendation. Finally, in order to prove the theoretical guarantee of the proposed method, we present a regret bound analysis and prove that HATCH achieves a regret bound as low as $O(T)$. The experimental results demonstrate the effectiveness and efficiency of the proposed method on both the synthetic data set and the real-world applications.Xiaoya Chong (City University of Hong Kong), Qing Li (The Hong Kong Polytechnic University), Howard Leung (City University of Hong Kong), Qianhui Men (City University of Hong Kong) and Xianjin Chao (City University of Hong Kong).
Abstract
Personalized recommendation aims at ranking a set of items according to the learnt preference of the user. Existing method that directly optimized for ranking samples a negative item that the user has not bought yet and assumes that the user prefers the positive item that he has bought to the negative item. The strategy is to exclude irrelevant items from the dataset to narrow down the set of potential positive items to improve ranking accuracy. However, it conflicts with the goal of recommendation from the seller's point of view, which aims to enlarge that set for each user. In this paper, we diminish this limitation by proposing a novel learning method called Hierarchical Visual-aware Minimax Ranking (H-VMMR), in which a new concept of predictive sampling is proposed to sample items in a close relationship with the positive items (e.g. substitutes, compliments). We set up the problem by maximizing the preference discrepancy between positive and negative items, as well as minimizing the gap between positive and predictive items based on visual features. We also build a hierarchical learning model based on co-purchase data to solve the data sparsity problem. Our method can enlarge the set of potential positive items as well as true negative items during ranking. The experimental results show that our H-VMMR can outperform the state-of-the-art learning methods.Xing Zhao (Texas A&M University), Ziwei Zhu (Texas A&M University), Majid Alfifi (Texas A&M University) and James Caverlee (Texas A&M University).
Abstract
Predicting the potential target customers for a product is essential. However, traditional recommender systems typically aim to optimize an engagement metric without considering the overall distribution of target customers, thereby leading to serious distortion problems. In this paper, we conduct a data-driven study to reveal several distortions that arise from conventional recommenders. Toward overcoming these issues, we propose a target customer re-ranking algorithm to adjust the population distribution and composition in the Top-k target customers of an item while maintaining recommendation quality. By applying this proposed algorithm onto a real-world dataset, we find the proposed method can effectively make the class distribution of items' target customers close to the desired distribution, thereby mitigating distortion.Society (2)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Zitao Liu (TAL AI Lab), Guowei Xu (TAL AI Lab), Tianqiao Liu (TAL AI Lab), Weiping Fu (TAL AI Lab), Yubi Qi (TAL AI Lab), Wenbiao Ding (TAL AI Lab), Yujia Song (TAL AI Lab), Chaoyou Guo (TAL AI Lab), Cong Kong (TAL AI Lab), Songfan Yang (TAL AI Lab) and Gale Yan Huang (TAL AI Lab).
Abstract
Verbal fluency is critically important for children growth and personal development cohen1999verbal,berninger1992gender. Due to the limited and imbalanced educational resource in China, elementary students barely have chances to improve their oral language skills in classes. Verbal fluency tasks (VFTs) were invented to let the students practice their oral language skills after school. VFTs are simple but concrete math related questions that ask students to not only report answers but speak out the entire thinking process. In spite of the great success of VFTs, they bring a heavy grading burden to elementary teachers. To alleviate this problem, we develop Dolphin, a verbal fluency evaluation system for Chinese elementary education. Dolphin is able to automatically evaluate both phonological fluency and semantic relevance of students' answers of their VFT assignments. We conduct a wide range of offline and online experiments to demonstrate the effectiveness of Dolphin. In our offline experiments, we show that Dolphin improves both phonological fluency and semantic relevance evaluation performance when compared to state-of-the-art baselines on real-world educational data sets. In our online A/B experiments, we test Dolphin with 183 teachers from 2 major cities (Hangzhou and Xi'an) in China for 10 weeks and the results show that VFT assignments grading coverage is improved by 22\%. To encourage the reproducible results, we make our code public on an anonymous git repo: this https URL.Shan Jiang (Northeastern University), Simon Baumgartner (Google), Abe Ittycheriah (Google) and Cong Yu (Google).
Abstract
Fact-checking, which investigates claims made in public to arrive at a verdict supported by evidence and logical reasoning, has long been a significant form of journalism to combat misinformation in the news ecosystem. Most of the fact-checks share common structured information (called factors) such as claim, claimant, and verdict. In recent years, the emergence of ClaimReview as the standard schema for annotating those factors within fact-checking articles has led to wide adoption of fact-checking features by online platforms (e.g., Google, Bing). However, annotating fact-checks is a tedious process for fact-checkers and distracts them from their core job of investigating claims. As a result, less than half of the fact-checkers worldwide have adopted ClaimReview as of mid-2019. In this paper, we propose the task of factoring fact-checks for automatically extracting structured information from fact-checking articles. Exploring a public dataset of fact-checks, we empirically show that factoring fact-checks is a challenging task, especially for fact-checkers that are under-represented in the dataset. We then formulate the task as a sequence tagging problem and fine-tune the pre-trained BERT models with a modification made from our observations to approach the problem. Through extensive experiments, we demonstrate the performance of our models for well-known fact-checkers and promising initial results for under-represented fact-checkers.Meike Zehlike (MPI Software Systems) and Carlos Castillo (Universitat Pompeu Fabra).
Abstract
Ranked search results have become the main mechanism by which we find content, products, places, and people online. Therefore their ordering contributes not only to the satisfaction of the searcher but also to career and business opportunities, educational placement, and even social success of those searched. Over the past decade, data mining researchers have become increasingly concerned with systematic biases in data-driven ranking models and various methods have been proposed to mitigate discrimination and inequality of opportunity. Most of those post-process a ranking and reorder its items subject to predefined fairness constraints. This procedure however has the disadvantage that it still allows an unfair ranking model to be trained and later deployed. In this paper we explore a new in-processing approach: DELTR, a learning-to-rank framework that addresses potential issues of discrimination and unequal opportunity in rankings at training time. We measure these problems in terms of discrepancies in the average group exposure and design a ranker that optimizes search results in terms of relevance and in terms of reducing such discrepancies. We perform an extensive experimental study showing that being “colorblind” i.e., ignoring protected attributes such as race or gender, can be among the best or the worst choices from the perspective of relevance and exposure, depending on how much and which kind of bias is present in the training set. We show that our in-processing method performs better in terms of relevance and equality of exposure than a pre-processing and a post-processing method across all tested scenarios.Kai Wei (Amazon.com), Yu-Ru Lin (University of Pittsburgh) and Muheng Yan (University of Pittsburgh).
Abstract
There has been growing concern about online users using social media as a tool to spread hate and racist speech. While previous studies have extensively studied online hate speech, how to effectively reduce online prejudice still remains a challenge. Over the past several decades, protests have been a frequently used intervention for countering prejudice. However, research to date has not specifically examined the effects of social protest in online prejudice. In this work, we examine the relationship between protest and online prejudice. Using panel data collected from Twitter, we focus on the changes in users' prejudice against immigrants following recent immigrant protests. The findings of this work have shown that protest is related to the decrease of online users' prejudice, suggesting the possibility of using protests as mitigation of online prejudice.Wenjie Hu (Zhejiang University), Yang Yang (Zhejiang University), Jianbo Wang (State Grid Taizhou Power Supply Co. Ltd.), Xuanwen Huang (Zhejiang University) and Ziqiang Cheng (Zhejiang University).
Abstract
Electricity theft, the behavior that involves users conducting illegal operations on electrical meters to avoid individual electricity bills, is a common phenomenon in the developing countries. Considering its harmfulness to both power grids and the public, several mechanized methods have been developed to automatically recognize electricity-theft behaviors. However, these methods, which mainly assess users' electricity usage records can be insufficient due to the diversity of theft tactics and the irregularity of user behaviors. Moreover, one cannot fully understand the user behaviors that lurk in the massive volume of data using such mechanized methods.To address the abovementioned concerns, in this paper, we propose to recognize electricity-theft behavior via multi-source data. In addition to users' electricity usage records, we analyze user behaviors by means of regional factors (non-technical loss) and climatic factors (temperature) in the corresponding transformer area. By conducting analytical experiments, we unearth several interesting patterns and thereby derive insights into how these different types of information influence users' electricity usage. For instance, electricity thieves are likely to consume much more electrical power than normal users, especially under extremely high or low temperatures. Motivated by these empirical observations, we further design a novel hierarchical framework for identifying electricity thieves. Intuitively, it uniformly leverages multi-source information to extract hierarchical correlations between this information and electricity-theft behavior. Experimental results based on a real-world dataset demonstrate that our proposed model can achieve the best performance in electricity-theft detection (e.g., at least +3.0% in terms of F0.5) compared with several baselines. Last but not least, our work has been applied by the State Grid of China and used to successfully catch electricity thieves in Hangzhou with a precision of 15% (an improvement from 0% attained by several other models the company employed and kept online testing continuously for years) during monthly on-site investigation.Security (2)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Hengtong Zhang (SUNY at Buffalo), Yaliang Li (Alibaba Group), Bolin Ding (Alibaba Group) and Jing Gao (University at Buffalo).
Abstract
Online recommendation systems make use of a variety of information sources to provide users the items that users are potentially interested in. However, due to the openness of the online platform, recommendation systems are vulnerable to data poisoning attacks, where malicious data samples are injected into the training set of the recommendation system by controlled users to promote or demote specific items. Existing attack approaches are either based on simple heuristic rules or designed against specific recommendations approaches.The former often suffers unsatisfactory performance, while the latter requires strong knowledge of the target system. In this paper, we focus on a general next-item recommendation setting and propose a practical poisoning attack approach named LOKI against blackbox recommendation systems. The proposed LOKI utilizes the reinforcement learning algorithm to train the attack agent, which can be used to generate user behavior samples for data poisoning. In real-world recommendation systems, the cost of retraining recommendation models is high, and the interaction frequency between users and a recommendation system is restricted. Given these real-world restrictions, we propose to let the agent interact with a recommender simulator instead of the target recommendation system and leverage the transferability of the generated adversarial samples to poison the target system. We also propose to use the influence function to efficiently estimate the influence of injected samples on the recommendation results, without re-training the models within the simulator. Extensive experiments on two datasets against four representative recommendation models show that the proposed LOKI achieves better attacking performance than existing methods and is effective even when the recommendation system is equipped with an anomaly detector.Yiwei Sun (The Pennsylvania State University), Suhang Wang (The Pennsylvania State University), Xianfeng Tang (The Pennsylvania State University), Tsung-Yu Hsieh (The Pennsylvania State University) and Vasant Honavar (The Pennsylvania State University).
Abstract
In recent years, Graph Neural Networks have achieved immense success for node classification with its power to explore the topological structure in graph data. They are widely adopted in various domains including social media, E-commerce, and FinTech applications. However, recent studies show that GNNs are vulnerable to attacks which aim at adversely impacting the node classification accuracy. Previous studies of graph adversarial attacks mainly focus on manipulating existing graph structures, which usually requires more budgets to modify the existing connections in most real-world applications. In contrast, it is more practical to inject adversarial nodes into existing graphs, which can also potentially reduce the performance of the GNNs on existing nodes. Taking social network as an example, injecting fake profiles with forged links to mislead the predicted labels on existing accounts is much easier than directly modifying the existing graph. Motivated by such observations, in this paper, we study a novel problem of node injection poisoning on graph data. Since establishing links between the injected adversarial nodes and existing node could naturally be formulated as a Markov Decision Process, we propose a reinforcement learning method, namely NIPA, to sequentially modify the labels and adjacent edges of those injected nodes, without changing the link structure between existing nodes. Specifically, we introduce a hierarchical Q-learning network to manipulate the labels of the adversarial nodes and their links with other nodes in the graph, and design steering reward function to guide the RL agent so as to reduce GNNs accuracy. NIPA consistently out-performs state-of-the-art methods on three benchmark datasets, demonstrating its efficacy of poisoning graph data via node injection.Aviad Elyashar (Ben-Gurion University of the Negev), Abigail Paradise (Ben-Gurion University of the Negev), Sagi Uziel (Ben-Gurion University of the Negev) and Rami Puzis (Ben-Gurion University of the Negev).
Abstract
Online social networks (OSNs) are ubiquitous attracting millions of users all over the world. Being a popular communication media OSNs are exploited in a variety of cyberattacks. In this article, we discuss the Chameleon attack technique, a new type of OSN-based trickery where malicious posts and profiles change the way they are displayed to OSN users to conceal themselves before the attack or avoid detection. Using this technique, adversaries can, for example, avoid censorship by concealing true content when it is about to be inspected; acquire social capital to promote new content while piggybacking a trending one; cause embarrassment and serious reputation damage by tricking a victim to like, retweet, or comment a message that he wouldn’t normally do without any indication for the trickery within the OSN. An experiment performed with closed Facebook groups of sports fans shows that (1) Chameleon pages can pass by the moderation filters by changing the way their posts are displayed and (2) moderators do not distinguish between regular and Chameleon pages. We list the OSN weaknesses that facilitate the Chameleon attack and propose a set of mitigation guidelines.Sheng Tian (Ant Financial Services Group; Electronic Information School, Wuhan University) and Tao Xiong (Ant Financial Services Group).
Abstract
Although there are many alternative captcha schemes available, text-based captchas are still one of the most popular security mechanism to maintain Internet security and prevent malicious attacks, due to the user preferences and ease of design. Over the past decade, different methods of breaking captchas have been proposed, which helps captcha keep evolving and become more robust. However, these previous works generally require heavy expert involvement and gradually become ineffective with the introduction of new security features. This paper proposes a generic solver combining unsupervised learning and representation learning to automatically remove the noisy background of captchas and solve text-based captchas. We introduce a new training scheme for constructing mini-batches, which contain a large number of unlabeled hard examples, to improve the efficiency of representation learning. Unlike existing deep learning algorithms, our method requires significantly fewer labeled samples and surpasses the recognition performance of a fully-supervised model with the same network architecture. Moreover, extensive experiments show that the proposed method outperforms state-of-the-art by delivering a higher accuracy on various captcha schemes. We provide further discussions of potential applications of the proposed unified framework. We hope that our work can inspire the community to enhance the security of text-based captchas.Health (1)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Ping Wang (Virginia Tech), Tian Shi (Virginia Tech) and Chandan K. Reddy (Virginia Tech).
Abstract
Electronic health records (EHR) data contains comprehensive patient information and is typically stored in a relational database with multiple tables. Effective and efficient patient information retrieval from EHR data is a challenging task for medical experts. Question-to-SQL generation methods tackle this problem by first predicting the SQL query for a given question about a database, and then, executing the query against the database. However, most of the existing approaches have not been adapted to the healthcare domain due to a lack of healthcare Question-to-SQL datasets for model parameter inferences. Moreover, wide use of the abbreviation of terminologies and possible typos in questions introduce additional challenges for accurately generating the corresponding SQL queries. In this paper, we tackle these challenges by developing a deep learning based TRanslate-Edit Model for Question-to-SQL (TREQS) generation, which adapts the widely used sequence-to-sequence model to directly generate the SQL query for a given question, and further performs the required edits using an attentive-copying mechanism and task-specific look-up tables. Based on the widely used publicly available electronic medical database, we create a new large-scale Question-SQL pair dataset, named MIMICSQL, in order to perform the Question-to-SQL generation task in the healthcare domain. Extensive experimental results are conducted to evaluate the performance of our proposed model on MIMICSQL. Both quantitative and qualitative experimental results indicate the flexibility and efficiency of our proposed method in predicting condition values and its robustness to random questions with abbreviations and typos.Harrisen Scells (The University of Queensland), Guido Zuccon (The University of Queensland), Bevan Koopman (CSIRO) and Justin Clark (Institute for Evidence-Based Healthcare, Bond University).
Abstract
Formulating Boolean queries for systematic review literature search is a challenging task. Commonly, queries are formulated by information specialists using the protocol specified in the review and interactions with the research team. Information specialists have in-depth experience on how to formulate queries in this domain, but may not have in-depth knowledge about the reviews' topics. Query formulation requires a significant amount of time and effort, and is performed interactively; specialists repeatedly formulate queries, attempt to validate their results, and reformulate specific Boolean clauses. In this paper, we investigate the possibility of automatically formulating a Boolean query from the systematic review protocol. We propose a novel five-step approach to automatic query formulation, specific to Boolean queries in this domain, which approximates the process by which information specialists formulate queries. In this process, we use syntax parsing to derive the logical structure of high-level concepts in a query, automatically extract and map concepts to entities in order to perform entity expansion, and finally apply post-processin operations (such as stemming and search filters).Automatic query formulation for systematic review literature search has several benefits: (i) it can provide reviewers with an indication of the types of studies that will be retrieved, without the involvement of an information specialist, (ii) it can provide information specialists with an initial query to begin the formulation process, (iii) it can provide researchers that perform rapid reviews with a method to quickly perform searches.Sebastian Arnold (Beuth University of Applied Sciences Berlin), Betty van Aken (Beuth University of Applied Sciences Berlin), Paul Grundmann (Beuth University of Applied Sciences Berlin), Felix A. Gers (Beuth University of Applied Sciences Berlin) and Alexander Löser (Beuth University of Applied Sciences Berlin).
Abstract
We present Contextual Discourse Vectors (CDV), a distributed document representation for efficient answer retrieval from long healthcare documents. Our approach is based on structured query tuples of entities and aspects from free text and medical taxonomies. Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse. We use our continuous representations to resolve queries with short latency using approximate nearest neighbor search on sentence level. We apply the CDV model for retrieving coherent answer passages from ten English public health resources from the Web, addressing both patients and medical professionals. Because there is no end-to-end training data available for all application scenarios, we train our model with self-supervised data from Wikipedia. We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking and is able to adapt to heterogeneous domains without additional fine-tuning.Harrisen Scells (The University of Queensland), Guido Zuccon (The University of Queensland), Mohamed Sharaf (The University of Queensland) and Bevan Koopman (CSIRO).
Abstract
Searching medical literature for synthesis in a systematic review is a complex and labour intensive task. In this context, expert searchers construct lengthy Boolean queries. The universe of possible query variations can be massive: a single query can be composed of hundreds of field-restricted search terms/phrases or ontological concepts, each grouped by a logical operator nested to depths of sometimes five or more levels deep. With the many choices about how to construct a query, it is difficult to both formulate and recognise effective queries. To address this challenge, automatic methods have recently been explored for generating and selecting effective Boolean query variations for systematic reviews. The limiting factor of these methods is that it is computationally infeasible to process all query variations. To overcome this, we propose novel query variation sampling methods for training Learning to Rank models to rank queries. Our results show that query sampling methods do directly impact the ability of a Learning to Rank model to effectively identify good query variations. Thus, selecting good query sampling methods is a key problem for the automatic reformulation of effective Boolean queries for systematic review literature search. We find that the best sampling strategies are those which balance the diversity of queries with the quantity of queries.Mobile (2)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
Chris Xiaoxuan Lu (University of Liverpool), Yang Li (New York University), Yuanbo Xiangli (The Chinese University of Hong Kong) and Zhengxiong Li (University at Buffalo, SUNY).
Abstract
Along with the benefits of Internet of Things (IoT) come potential privacy risks, since billions of the connected devices are granted to sense information about their users, communicating to other parties over the Internet. Of particular interest to the adversary is the user identity, which, once obtained, can be used for many vicious attacks subsequently. While the exposure of a particular type of physical biometrics or device IDs is extensively studied, the compound leakage interwoven by both sides remains unknown to users in IoT-rich environments. In this work, we explore the feasibility of the compound identity leakage across cyber-physical spaces and unveil that co-located smart device IDs (e.g., smartphone MAC addresses) and physical biometrics (e.g., facial/vocal samples) are side channels to each other. Based on the side channels in combination, our presented approach enables an attacker to automatically compromise users' biometrics and device IDs in tandem. We show that our method is robust to cross-modal mismatch and various observation noise in the wild, comprehensively profiling victims with nearly zero analysis effort from the attacker. Two real-world experiments on different biometrics and WiFi MAC addresses validate the new type of privacy leakage. We show that in extreme cases, the presented approach can compromise more than 70% device IDs and harvests multiple biometric clusters of ~94% purity at the same time.Ammar Tahir (Lahore University of Management Sciences (LUMS)), Muhammad Tahir Munir (Lahore University of Management Sciences (LUMS)), Shaiq Munir Malik (Lahore University of Management Sciences (LUMS)), Zafar Ayyub Qazi (Lahore University of Management Sciences (LUMS)) and Ihsan Ayyub Qazi (Lahore University of Management Sciences (LUMS)).
Abstract
Web Light is a transcoding service introduced by Google to show lighter and faster webpages to users searching on slow mobile clients. The service detects slow clients (e.g., users on 2G) and converts webpages on the fly into a version optimized for these clients. The service promises improved mobile web browsing experience, in particular, for users from developing countries where slow networks can be common. However, there are several concerns around this service, including, its effectiveness in preserving relevant content on a page, improving user performance, showing third-party advertisements as well as privacy concerns.In this paper, we perform the first independent, empirical analysis of Google's Web Light service to shed light on these concerns. Through extensive experiments over thousands of real Web Light pages as well as controlled experiments with synthetic Web Light pages, we (i) deconstruct how Web Light modifies webpages, (ii) investigate how ads are shown on Web Light and which ad networks are supported, (iii) measure and compare Web Light's page load performance, (iv) discuss privacy concerns for users and publishers and (v) investigate the potential use of Web Light as a censorship circumvention tool.Faysal Hossain Shezan (University of Virginia), Hang Hu (Virginia Tech), Jiamin Wang (Virginia Tech), Gang Wang (University of Illinois at Urbana-Champaign) and Yuan Tian (University of Virginia).
Abstract
Voice Personal Assistant (VPA) systems such as Amazon Alexa and Google Home have been used by tens of millions of households. Recent work demonstrated proof-of-concept attacks against their voice interface to invoke unintended applications or operations. However, there is still a lack of empirical understanding of what type of third-party applications that VPA systems support, and what consequences these attacks may cause. In this paper, we perform an empirical analysis of the third-party applications of Amazon Alexa and Google Home to systematically assess the attack surfaces. A key methodology is to characterize a given application by classifying the sensitive voice commands it accepts. We develop a natural language processing tool that classifies a given voice command from two dimensions: (1) whether the voice command is designed to insert action or retrieve information; (2) whether the command is sensitive or nonsensitive. The tool combines a deep neural network and a keyword-based model, and uses Active Learning to reduce the manual labeling effort. The sensitivity classification is based on a user study (N=404) where we measure the perceived sensitivity of voice commands. A ground-truth evaluation shows that our tool achieves over 95\% of accuracy for both types of classifications. We apply this tool to analyze 77,957 Amazon Alexa applications and 4,813 Google Home applications (198,199 voice commands from Amazon Alexa, 13,644 voice commands from Google Home) over two years (2018-2019). In total, we identify 19,263 sensitive ``action injection'' commands and 5,352 sensitive ``information retrieval'' commands. These commands are from 4,596 applications (5.55\% out of all applications), most of which belong to the ``smart home'' category. While the percentage of sensitive applications is small, we show the percentage is increasing over time from 2018 to 2019.Vasudevan Nagendra (Stony Brook University), Arani Bhattacharya (Stony Brook University), Vinod Yegneswaran (SRI International), Amir Rahmati (Stony Brook University) and Samir Das (Stony Brook University).
Abstract
Consumer IoT is characterized by heterogeneous devices with diverse functionality and programming interfaces. This lack of homogeneity makes the integration and security management of IoT infrastructures a daunting task for users and administrators. In this paper, we introduce VISCR, a Vendor-Independent policy Specification and Conflict Resolution engine that enables conflict-free policy specification and enforcement in IoT environments. VISCR converts the topology of the IoT infrastructure into a tree-based abstraction and translates existing policies from heterogeneous vendor-specific programming languages such as Groovy-based SmartThings, OpenHAB, IFTTT-based templates, and MUD-based profiles into a vendor-independent graph-based specification. Using the two, VISCR can automatically detect rouge policies, conflicts, and bugs for coherent automation. Upon detection, VISCR infers new policies and proposes them to users as alternatives to existing policies for fine-tuning and conflict-free enforcement. We evaluated VISCR using a dataset of 907 IoT apps, programmed using heterogeneous automation specifications in a simulated smart-building IoT infrastructure. In our experiments, among 907 IoT apps, VISCR exposed 342 of IoT apps as exhibiting one or more violations. VISCR detected 100% of violations reported by existing state-of-the-art tool, while detecting new types of violations in an additional 266 apps. In terms of performance, VISCR can generate 400 abstraction trees (used in specifying policies) with 100K leaf nodes in <1.2sec. In our experiments, VISCR took 80.7 seconds to analyze our infrastructure of 907 apps; a 14.2× reduction compared to the state-of-the-art. After the initial analysis, VISCR is capable of adopting new policies in sub-second latency to handle changes.Web Mining-B (2)
(UTC/GMT +8) 13:30-15:30, April, 22, Wednesday
Meeting rooms are not available now
David Zeber (Mozilla), Sarah Bird (Mozilla), Camila Oliveira (Mozilla), Walter Rudametkin (INRIA), Ilana Segall (Mozilla), Fredrik Wollsen (Mozilla) and Martin Lopatka (Mozilla).
Abstract
Large-scale web crawls have emerged as the state of the art for studying characteristics of the Web, such as the prevalence of online tracking and browser fingerprinting. Web crawling is an attractive approach to data collection, as crawls can be run at relatively low infrastructure cost and don't require handling sensitive user data such as browsing histories. However, the validity of using crawls as a proxy for human browsing data has not been well studied. Crawls may fail to capture the diversity of user environments, including operating systems, geolocation, cookies, as well as content in authenticated sessions, advertisement campaigns, and other dynamic content. Moreover, the snapshot view of the Web presented by one-time crawls does not reflect its constantly evolving nature, which hinders reproducibility of crawl-based studies. In this paper, we quantify the repeatability and representativeness of Web crawls in terms of common tracking and fingerprinting metrics, considering both variation across crawls and divergence from human browser usage. We observe noticeable variation between simultaneous crawls, run from different operating systems on both residential personal computers and cloud services, relative to the baseline variation measured across simultaneous crawls run from a single common environment. Additionally, we note substantial variation across a collection of crawls run sequentially over time, with the specific scripts loaded, fingerprinting resources encountered, and third party resources all becoming increasingly diverse over time. We also assess the agreement between crawls visiting a standard list of high-traffic websites and actual browsing behaviour measured from an opt-in sample of over 50,000 users of the Firefox Web browser. Our analysis reveals clear differences between the treatment of stateless crawling infrastructure and generally stateful human browsing, showing, for example, that crawlers tend to experience higher rates of third-party activity than human browser users on loading pages from the same domains.Syed Suleman Ahmad (University of Wisconsin-Madison), Muhammad Daniyal Dar (University of Iowa), Rishab Nithyanand (University of Iowa), Narseo Vallina-Rodriguez (IMDEA Networks/ICSI) and Muhammad Fareed Zaffar (LUMS).
Abstract
Data generated by web crawlers has formed the basis for much of our current understanding of the Internet. However, not all crawlers are created equal and crawlers generally find themselves trading off between computational overhead, developer effort, data accuracy, and completeness. Therefore, the choice of crawler has a critical impact on the data generated and knowledge inferred from it. In this paper, we conduct a systematic study of the trade-offs presented by different crawlers and the impact that these can have on different types of measurement studies. We make the following contributions: First, we conduct a survey of all research published since 2015 in the premier security and Internet measurement venues to identify and verify the reproducibility of crawling methodologies deployed for different problem domains and publication venues. Next, we conduct a qualitative evaluation of a subset of all crawling tools identified in our survey. This evaluation allows us to draw conclusions about the suitability of each tool for specific types of data gathering. Finally, we present a methodology and a measurement framework to empirically highlight the differences between different crawlers. We use this framework to show how the choice of crawler can impact our understanding of the web.Wanyue Xu (Fudan University), Yibin Sheng (Fudan University), Zuobai Zhang (Fudan University), Haibin Kan (Fudan University) and Zhongzhi Zhang (Fudan University).
Abstract
The mean hitting time from a node $i$ to a node $j$ selected randomly according to the stationary distribution of random walks is called the Kemeny constant, which has found various applications. It was proved that over all graphs with $N$ vertices, complete graphs have the exact minimum Kemeny constant, growing linearly with $N$. Here we study numerically or analytically the Kemeny constant on many sparse real-world and model networks with scale-free small-world topology, and show that their Kemeny constant also behaves linearly with $N$. Thus, sparse networks with scale-free and small-world topology are favorable architectures with optimal scaling of Kemeny constant. We then present a theoretically guaranteed estimation algorithm, which approximates the Kemeny constant for a graph in nearly linear time with respect to the number of edges. Extensive numerical experiments on model and real networks show that our approximation algorithm is both efficient and accurate.Andrey Gusev (Pinterest) and Jiajing Xu (Pinterest).
Abstract
Detecting near duplicate images is fundamental to the content ecosystem of photo sharing web applications. However, such task is challenging when involving web-scale image corpus containing billions of images. In this paper, we present an efficient system for detecting near duplicate images over 7 billion images. Our system consists of three stages: candidate generation, candidate selection, and clustering. We also demonstrate that this system can be used to greatly improve the accuracy of recommendations and search results across a number of real-world applications.In addition, we include the evolution of the system over the course of six years, bringing out experiences and lessons on how new systems are designed to accommodate organic content growth as well as the latest technology. Finally, we are releasing a human-labeled dataset of \textasciitilde 53,000 pairs of images introduced in this paper.Research Tracks (3)
Web Mining-A (3)
(UTC/GMT +8) 16:00-18:00, April, 22, Wednesday
Meeting rooms are not available now
Jichuan Zeng (The Chinese University of Hong Kong), Jing Li (The Hong Kong Polytechnic University), Yulan He (The University of Warwick), Cuiyun Gao (The Chinese University of Hong Kong), Michael Lyu (The Chinese University of Hong Kong) and Irwin King (The Chinese University of Hong Kong).
Abstract
In our world with full of uncertainty, debates and argumentation contribute to the progress of science and society. Despite of the increasing attention to characterize human arguments, most progress made so far focus on the debate outcome, largely ignoring the dynamic patterns in argumentation processes. This paper presents a study that automatically analyzes the key factors in argument persuasiveness, beyond simply predicting who will persuade whom. Specifically, we propose a novel neural model that is able to dynamically track the changes of latent topics and discourse in argumentative conversations, allowing the investigation of their roles in influencing the outcomes of persuasion. Extensive experiments have been conducted on argumentative conversations on both social media and supreme court. The results show that our model outperforms state-of-the-art models in identifying persuasive arguments via explicitly exploring dynamic factors of topic and discourse. We further analyze the effects of topics and discourse on persuasiveness, and find that they are both useful — topics provide concrete evidence while superior discourse styles may bias participants, especially in social media arguments. In addition, we draw some findings from our empirical results, which will help people better engage in future persuasive conversations.Yiyan Qi (Xi'an Jiaotong University), Pinghui Wang (Xi'an Jiaotong University), Yuanming Zhang (Xi'an Jiaotong University), Junzhou Zhao (Xi'an Jiaotong University), Guangjian Tian (Huawei Noah's Ark Lab) and Xiaohong Guan (Xi'an Jiaotong University).
Abstract
The well-known Gumbel-Max Trick for sampling from a categorical distribution (or more generally a nonnegative vector) and its variants have been widely used in areas such as machine learning and information retrieval. To sample a random element i (or a Gumbel-Max variable i) in proportion to its positive weight v_i, the Gumbel-Max Trick first computes a Gumbel random variable g_i for each positive-weight element i, and then samples the element i with the largest value of g_i+ ln v_i. Recently, applications including similarity estimation and graph embedding require to generate k independent Gumbel-Max variables from the elements of high dimensional vectors. However, it is computationally expensive for a large k (e.g., hundreds or even thousands) when using the traditional Gumbel-Max Trick. To solve this problem, we propose a novel algorithm, FastGM, that reduces the time complexity from O(kn^+) to O(k ln k + n^+), where n^+ is the number of positive elements in the vector of interest. Instead of computing k independent Gumbel random variables directly, we find that there exists a technique to generate these variables in descending order. Using this technique, our method FastGM computes variables g_i+ ln v_i for all positive elements i in descending order. As a result, FastGM significantly reduces the computation time because we can early stop the procedure of Gumbel random variables computing for many elements especially for those with small weights. Experiments on a variety of real-world datasets show that FastGM is orders of magnitude faster than state-of-the-art methods without sacrificing accuracy and incurring additional expenses.Han Zhang (Tsinghua University), Wenhao Zheng (Alibaba Youku), Charley Chen (Tsinghua University), Kevin Gao (Tsinghua University), Yao Hu (Alibaba Youku Cognitive and Intelligent Lab), Ling Huang (AHI Fintech) and Wei Xu (Tsinghua University).
Abstract
Since the label collecting is prohibitive and time-consuming, unsupervised methods are preferred in applications such as fraud detection. Meanwhile, such applications usually require modeling the intrinsic clusters in high-dimensional data, which usually displays heterogeneous statistical patterns as the patterns of different clusters may appear in different dimensions. Existing methods propose to model the data clusters on selected dimensions, yet omitting any dimension globally may damage the pattern of certain clusters. In order to address the above issues, we propose a novel unsupervised generative framework called FIRD, which utilizes adversarial distributions to fit and disentangle the heterogeneous statistical patterns. When applying to discrete spaces, FIRD effectively distinguishes the synchronized fraudsters from the normal users. In addition, FIRD also provides superior performance on anomaly detection datasets compared with SOTA anomaly detection methods (over 5\% average AUC improvement). The significant experiment results on various datasets verify that the proposed method can better model the heterogeneous statistical patterns in high-dimensional data and benefit downstream applications.Xingwen Zhang (Ant Financial Services Group), Feng Qi (Ant Financial Services Group), Zhigang Hua (Ant Financial Services Group) and Shuang Yang (Ant Financial Services Group).
Abstract
Real-world resource allocation tasks are often approached by solving knapsack problems (KPs), which are NP-hard and have been tractable only at a relatively small scale. This paper examines KPs in a slightly generalized form, and shows large-scale KPs can be solved nearly optimally in a scalable distributed paradigm via synchronous coordinate descent(SCD). The proposed algorithm can be implemented with off-the-shelf distributed computing frame-works (e.g. MPI, Hadoop, Spark) fairly easily. As an example, our implementation leads to one of the most efficient KP solvers known to date, and it is capable to solve resource allocation problems at an unprecedented scale (e.g., KPs with 1 billion decision variables and 1 billion constraints can be solved within 1 hour). Both synthetic tests and live A/B experiments were conducted to analyze the performance of our approach. The system has been deployed to production and called on a daily basis, yielding significant business impacts.Peng Yang (Baidu US) and Ping Li (Baidu).
Abstract
Conventional multi-task model restricts the task structure to be linearly related, which may not be suitable when data is linearly nonseparable. To remedy this issue, we propose a kernel algorithm for online multi-task classification, as the large approximation space provided by reproducing kernel Hilbert spaces often contains an accurate function. Specifically, it maintains a local-global Gaussian distribution over each task model that guides the direction and scale of parameter updates. Nonetheless, optimizing over this space is computationally expensive. Most multi-task learning methods require accessing to the entire data for the learning algorithm, which is luxury unavailable in large-scale streaming datasets. To address this issue, we propose a random sampling technique across multiple tasks for adaptive sketching. Instead of requiring labels of all inputs, the proposed algorithm determines whether to learn an input or not via considering the confidence from its related tasks over label prediction. Theoretically, the algorithm learned on actively sampled labels can achieve a comparable result with one learned on all labels. Empirically, the proposed algorithm is able to achieve promising learning efficacy, while reducing the computational complexity and labeling cost simultaneously.Social Network-A (3)
(UTC/GMT +8) 16:00-18:00, April, 22, Wednesday
Meeting rooms are not available now
Xiaoyang Wang (Beijing Jiaotong University), Yao Ma (Michigan State University), Yiqi Wang (Michigan State University), Wei Jin (Michigan State University), Xin Wang (Changchun Institute of Technology), Jiliang Tang (Michigan State University), Caiyan Jia (Beijing Jiaotong University) and Jian Yu (Beijing Jiaotong University).
Abstract
Traffic flow analysis, prediction and management are keystones for building smart cities in the new era. With the help of deep neural networks and big traffic data, we can better understand the latent patterns hidden in the complex transportation networks. The dynamics of traffic flow not only depends on the sequential patterns in the temporal dimension but also relies on other roads in the spatial dimension. Although there are existing works on predicting the future traffic flow dynamics, the majority of them have certain limitations on modeling both spatial and temporal dependencies. In this paper, we propose a novel spatial temporal graph neural network for traffic flow prediction, which can comprehensively capture spatial and temporal patterns. In particular, the framework offers a learnable positional attention mechanism to effectively aggregate information from adjacent roads. Meanwhile, it provides a sequential component to model the traffic flow dynamics which can exploit both local and global temporal dependencies. Experimental results on various real traffic datasets demonstrate the effectiveness of the proposed framework.Liang Yang (Hebei University of Technology), Yuanfang Guo (Beihang University), Xiaochun Cao (Chinese Academy of Sciences), Junhua Gu (Hebei University of Technology), Di Jin (Tianjin University), Fan Wu (Hebei University of Technology) and Chuan Wang (Chinese Academy of Sciences).
Abstract
To alleviate the overfitting issue of Probabilistic Latent Semantic Indexing (pLSI), Latent Dirichlet Allocation (LDA) introduces Dirichlet priors for latent variables. Many following correlated topic modeling approaches are proposed to prevent the failure of capturing the rich topical correlations among topics, which is introduced from the independent occurrence assumption of the introduced Dirichlet priors. However, they usually possess the drawback of high inference complexity.In this paper, we open up a new way to overcome the overfitting issue of pLSI by using the amortized inference with word embedding as input instead of introducing Dirichlet prior as in LDA. For generative topic model, the large number of free latent variables is the root of overfitting. To reduce the number of parameters, the amortized inference replaces the inference of latent variable with a function which possesses the shared (amortized) learnable parameters. The number of the shared parameters is fixed and independent of the scale of the corpus. In order to overcome the limited application of amortized inference to independent and identically distributed (i.i.d) data, a novel graph neural network, Graph Attention TOpic Network (GATON), is introduced to model the topic structure of non-i.i.d documents according to the following two findings. First, pLSI is interpreted as the stochastic block model (SBM) on a specific bi-partite graph. Second, graph attention network (GAT) is explained as the semi-amortized inference of SBM, which relaxes the i.i.d data assumption of vanilla amortized inference. The GATON provides a novel way, i.e. graph convolution operation, to integrate word similarity and word co-occurrence structure. Specifically, the bag-of-words document representation is modeled as the bi-partite graph topology, while word embedding, which captures the word similarity, is modeled as the attribute of the word node and the term frequency vector is treated as the attribute of the document node. By the weighted (attention) graph convolution operation, the word co-occurrence structure and word similarity patterns are seamlessly integrated for topic identification. Extensive experiments demonstrate that the effectiveness of GATON on topic identification not only benefits the document classification, but also significantly refines the input word embedding.Kai Zhao (Beijing University of Posts and Telecommunications), Ting Bai (Beijing University of Posts and Telecommunications), Bin Wu (Beijing University of Posts and Telecommunications), Bai Wang (Beijing University of Posts and Telecommunications), Youjie Zhang (Beijing University of Posts and Telecommunications), Yuanyu Yang (Beijing University of Posts and Telecommunications) and Jian-Yun Nie (University of Montreal).
Abstract
Heterogeneous information network (HIN) contains multiple types of entities and relations. Most of existing HIN embedding methods learn the semantic information based on the heterogeneous structures between different entities, which are implicitly assumed to be complete. However, in real world, it is common that some relations are partially observed due to privacy or other reasons, resulting in a sparse network, in which the structure may be incomplete, and the "unseen" links may also be positive due to the missing relations in data collection. To address this problem, we propose a novel and principled approach: a Multi-View Adversarial Completion Model (MV-ACM). Each relation space is characterized in a single viewpoint, enabling us to use the topological structural information in each view. Based on the multi-view architecture, an adversarial learning process is utilized to learn the reciprocity (i.e. complementary information) between different relations: In the generator, MV-ACM generates the complementary views by computing the similarity of the semantic representation of the same node in different views; while in the discriminator, MV-ACM discriminates whether the view is complementary by the topological structural similarity. Then we update the node's semantic representation by aggregating neighborhoods information from the syncretic views. We conduct systematical experiments on six real-world networks from varied domains: AMiner, PPI, YouTube, Twitter, Amazon and Alibaba. Empirical results show that MV-ACM significantly outperforms the state-of-the-art approaches for both link prediction and node classification tasks.Ziniu Hu (University of California, Los Angeles), Yuxiao Dong (Microsoft), Kuansan Wang (Microsoft) and Yizhou Sun (University of California, Los Angeles).
Abstract
Recent years have witnessed the emergent success of graph neural networks (GNNs) for modeling structured data. However, most GNNs are designed for homogeneous networks, in which all nodes or edges have the same feature space and representation distribution, making them infeasible for representing evolving heterogeneous structures. In this paper, we present the Heterogeneous Graph Transformer (HGT) architecture for modeling Web-scale heterogeneous and dynamic graphs. To model heterogeneity, we design node- and edge-type dependent parameters to model the heterogeneous attention over each edge, empowering HGT to maintain dedicated representations for different types of nodes and edges. To capture graph dynamics, rather than slicing the graph based on time, we keep the whole graph with each edge/node associated with its timestamp and propose the relative temporal encoding strategy to capture the dynamic dependency with arbitrary durations. To handle Web-scale data, we design the heterogeneous mini-batch graph sampling algorithm with an inductive timestamp assignment method for efficient and scalable training. Extensive experiments on the Open Academic Graph of 179 million nodes and 2 billion edges show that the proposed HGT model consistently outperforms all the state-of-the-art GNN baselines by 14.6%--24.0% on various downstream tasks.Liang Zhang (Xidian University), Xudong Wang (Xidian University), Hongsheng Li (Xidian University), Guangming Zhu (Xidian University), Peiyi Shen (Xidian University), Ping Li (ShangHai BNC), Xiaoyuan Lu (ShangHai BNC), Syed Afaq Ali Shah (The University of Western Australia) and Mohammed Bennamoun (The University of Western Australia).
Abstract
Various methods to deal with graph data have been proposed in recent years. However, most of these methods focus on graph feature aggregation rather than graph pooling. Besides, the existing top-k selection graph pooling methods have a few problems. First, to construct the pooled graph topology, current top-k selection methods evaluate the importance of the node from a single perspective only, which is simplistic and unobjective. Second, the feature information of unselected nodes is directly lost during the pooling process, which inevitably leads to a massive loss of graph feature information. To solve these problems mentioned above, we propose a novel graph self-adaptive pooling method with the following objectives: (1) to construct a reasonable pooled graph topology, structure and feature information of the graph are considered simultaneously, which provide additional veracity and objectivity in node selection; and (2) to make the pooled nodes contain sufficiently effective graph information, node feature information is aggregated before discarding the unimportant nodes; thus, the selected nodes contain information from neighbor nodes, which can enhance the use of features of the unselected nodes. Experimental results on four different datasets demonstrate that our method is effective in graph classification and outperforms state-of-the-art graph pooling methods.User Modeling-A (3)
(UTC/GMT +8) 16:00-18:00, April, 22, Wednesday
Meeting rooms are not available now
Amin Javari (University of Illinois at Urbana-Champaign), Zhankui He (UCSD), Zijie Huang (University of California, Los Angeles), Jeetu Raj (University of Illinois at Urbana-Champaign) and Kevin Chang (University of Illinois at Urbana-Champaign).
Abstract
Personalized trending hashtag recommendation for users could substantially promote user engagement in microblogging websites: users can easily discover recent microblogs aligned with their interests and information needs. However, user profiling and making personalized recommendations on microblogging websites is challenging because most users tend not to generate content data. Our core idea to address the problem is to build a network-based interest profile of users and incorporate it into hashtag recommendation. Indeed, user's followee/follower connections implicitly indicate their interests. Considering that microblogging networks are scale-free networks, to maintain the efficiency and effectiveness of the model, rather than analyzing the entire network, we model users by focusing on their links towards popular/hub nodes. That is, hashtags and hub nodes in the network are projected into a shared latent space. To predict the relevance of a user to a hashtag, a projection of the user is built by aggregating the embeddings of her hub neighbors guided by an attention model and then compared with the target hashtag. Classically, attention models with low complexity can be trained in an end to end manner. However, due to the high complexity of our problem, we propose a novel weak supervision model for the attention component, which significantly improves the effectiveness of the model. We performed extensive experiments on two datasets collected from Twitter and Weibo, and the results confirm that our method substantially outperforms the baseline methods.Xueliang Guo (School of Computer Science, Beijing Institute of Technology), Chongyang Shi (Beijing Institute of Technology School of Computer Science) and Chuanming Liu (Computer Science and Information Engineering, National Taipei University of Technology).
Abstract
Recently, sequential recommendation has attracted substantial attention from researchers due to its status as an essential service for e-commerce. Accurately understanding user intention is an important factor to improve the performance of recommendation system. However, user intention is highly time-dependent and flexible, so it is very challenging to learn the latent dynamic intention of users for sequential recommendation.To this end, in this paper, we propose a novel intention modeling from ordered and unordered facets (IMfOU) for sequential recommendation. Specifically, the global and local item embedding (GLIE) we proposed can comprehensively capture the sequential context information in the sequences and highlight the important features that users care about. We further design ordered preference drift learning (OPDL) and unordered purchase motivation learning (UPML) to obtain user's the process of preference drift and purchase motivation respectively. With combining the users' dynamic preference and current motivation, it considers not only sequential dependencies between items but also flexible dependencies and models the user purchase intention more accurately from ordered and unordered facets respectively. Evaluation results on three real-world datasets demonstrate that our proposed approach achieves better performance than the state-of-the-art sequential recommendation methods achieving improvement of AUC by an average of 2.26\%.Chao Wang (University of Science and Technology of China), Hengshu Zhu (Baidu Inc.), Chen Zhu (Baidu Talent Intelligence Center), Xi Zhang (College of Management and Economics, Tianjin University), Enhong Chen (University of Science and Technology of China) and Hui Xiong (Rutgers University).
Abstract
As a major component of strategic talent management, learning and development (L\&D) aims at improving the individual and organization performances through planning tailored training for employees to increase and improve their skills and knowledge. While many companies have developed the learning management systems (LMSs) for facilitating the online training of employees, a long-standing important issue is how to achieve personalized training recommendations with the consideration of their needs for future career development. To this end, in this paper, we propose an explainable personalized online course recommender system for enhancing employee training and development. A unique perspective of our system is to jointly model both the employees' current competencies and their career development preferences in an explainable way. Specifically, the recommender system is based on a novel end-to-end hierarchical framework, namely Demand-aware Collaborative Bayesian Variational Network (DCBVN). In DCBVN, we first extract the latent interpretable representations of the employees' competencies from their skill profiles with autoencoding variational inference based topic modeling. Then, we develop an effective demand recognition mechanism for learning the personal demands of career development for employees. In particular, all the above processes are integrated into a unified Bayesian inference view for obtaining both accurate and explainable recommendations. Finally, extensive experimental results on real-world data clearly demonstrate the effectiveness and the interpretability of DCBVN, as well as its robustness on sparse and cold-start scenarios.Xuhai Xu (University of Washington), Ahmed Hassan Awadallah (Microsoft), Susan T. Dumais (Microsoft), Farheen Omar (Microsoft), Bogdan Popp (Microsoft), Robert Rounthwaite (Microsoft) and Farnaz Jahanbakhsh (Massachusetts Institute of Technology).
Abstract
Personalized document recommendation systems aim to provide users with a quick shortcut to the documents they may want to access next, usually with an explanation about why the document is recommended. Previous work explored various methods on better recommendations and better explanations for different domains including news, movies, products, etc. However, there are few efforts that closely study how users react to the recommended items in a document recommendation scenario. We conducted a large-scale log study of users' interaction behavior with the explainable recommendation on one of the largest cloud document platforms. Our analysis reveals a number of factors, including display position, file type, authorship, recency of last access, and most importantly, the recommendation explanations, that are associated with whether users will recognize or open the recommended documents. Moreover, we specifically focus on explanations and conducted an online experiment to investigate the influence of different explanations on user behavior. Our analysis indicates that the recommendations help users access their documents significantly faster, but sometimes users miss a recommendation and resort to other more complicated methods to open the documents. Our results suggest opportunities to improve explanations and more generally the design of systems that provide and explain recommendations for documents.Wonbin Kweon (Pohang University of Science and Technology), Seongku Kang (Pohang University of Science and Technology), Junyoung Hwang (Pohang University of Science and Technology) and Hwanjo Yu (Pohang University of Science and Technology).
Abstract
Recent recommender systems started to use rating elicitation, which asks new users to rate the small seed items for inferring their preferences, to improve the quality of initial recommendations. The key challenge of the rating elicitation is to choose the most ''representative'' seed items to best infer the new users’ preference. The state-of-the-art approaches have two critical limitations: 1) They cannot capture the non-linear characteristics of collaborative filtering (CF) information, 2) They cannot fully consider the interactions between the whole seed items at a time, because they select the seed items in a greedy fashion. This paper proposes a novel end-to-end deep learning framework, called DRE, which chooses all the seed items at a time with consideration of the non-linear interactions. To this end, it first defines categorical distributions to sample seed items from the entire itemset, then it trains both the categorical distributions and a neural reconstruction network to infer users’preferences on the remaining items from CF information of the sampled items. Through the end-to-end training, the categorical distributions are learned to select the most representative seed items while reflecting the complex non-linear interactions. Experimental results show that DRE outperforms the state-of-the-art methods in the recommendation quality by accurately inferring the new users’ preferences, and its seed itemset better represents the latent space than the seed itemset obtained by the other methods.Crowdsourcing (1)
(UTC/GMT +8) 16:00-18:00, April, 22, Wednesday
Meeting rooms are not available now
Carlos Toxtli (West Virginia University), Angela Richmond (Universidad Nacional Autonoma de Mexico) and Saiph Savage (West Virginia University).
Abstract
Gig markets rely on reviews to help customers or employers identify the workers they want to hire. However, gig markets have been plagued with unfair assessments containing inaccurate reputation signals about workers that can not only limit workers’ future job opportunities, but can also result in workers not getting paid or even being terminated from the marketplace. Unfair reviews are generally created because employers have a hard time differentiating the factors within the workers' control and the ones that have little to do with their performance (e.g., when they complain about an Uber driver getting stuck in traffic). However, because market power is typically placed in the hands of employers, a bad worker review can result in the worker losing her entire livelihood. To address this problem, we present Reputation Agent, a review validation system that helps employers to generate fair reviews. Reputation Agent implements an intelligent interface that: (1) uses deep learning to automatically detect when an individual has included unfair factors into her review (factors that are outside the control of the gig worker, according to the policies of the market); and (2) prompts the individual to reconsider her review if she has incorporated unfair factors. To study the effectiveness of Reputation Agent, we conducted a controlled experiment over different gig markets. Our experiment illustrates that across markets, Reputation Agent, in contrast with traditional approaches, motivates customers and employers to review gig workers' performance more fairly. We discuss how tools that bring more transparency to employers about the policies of a gig market can help build empathy, spark discussions around the established gig market rules, and could be used to help platform maintainers identify potential injustices towards workers generated by their interfaces. Our vision is that with truth and transparency we can bring fairer treatment of gig workers.Weiping Pei (Colorado School of Mines), Arthur Mayer (Colorado School of Mines), Kaylynn Tu (Colorado School of Mines) and Chuan Yue (Colorado School of Mines).
Abstract
Attention check questions have become commonly used in online surveys published on popular crowdsourcing platforms as a key mechanism to filter out inattentive respondents and improve data quality. However, little research considers the vulnerabilities of this important quality control mechanism that can allow attackers including irresponsible and malicious respondents to automatically answer attention check questions for efficiently achieving their goals. In this paper, we perform the first study to investigate such vulnerabilities, and demonstrate that attackers can leverage deep learning techniques to pass attention check questions automatically. We propose AC-EasyPass, an attack framework with a concrete model, that combines convolutional neural network and weighted feature reconstruction to easily pass attention check questions. We construct the first attention check question dataset that consists of both original and augmented questions, and demonstrate the effectiveness of AC-EasyPass. We explore two simple defense methods, adding adversarial sentences and adding typos, for survey designers to mitigate the risks posed by AC-EasyPass; however, these methods are fragile due to their limitations from both technical and usability perspectives, underlining the challenging nature of defense. We hope our work will raise sufficient attention of the research community towards developing more robust attention check mechanisms. More broadly, our work intends to prompt the research community to seriously consider the emerging risks posed by the malicious use of machine learning techniques to the quality, validity, and trustworthiness of crowdsourcing and social computing.Ines Arous (University of Fribourg), Jie Yang (Amazon Research), Mourad Khayati (University of Fribourg) and Philippe Cudre-Mauroux (University of Fribourg).
Abstract
Finding social influencers is a fundamental task in many online applications ranging from brand marketing to opinion mining. Existing methods heavily rely on the availability of expert labels, whose collection is usually a laborious process even for domain experts. Using open-ended questions, crowdsourcing provides a cost-effective way to find a large number of social influencers in a short time. Individual crowd workers, however, only possess fragmented knowledge that is often of low quality. To tackle those issues, we present OpenCrowd, a unified Bayesian framework that seamlessly incorporates supervised learning and crowdsourcing for effectively finding social influencers. To infer a set of influencers, OpenCrowd bootstraps the learning process using a small number of expert labels and then jointly learns a feature-based answer quality model and the reliability of the workers. Model parameters and worker reliability are updated iteratively, allowing their learning processes to benefit from each other until an agreement on the quality of the answers is reached. We derive a principled optimization algorithm based on variational inference with efficient rules update for learning OpenCrowd parameters. Experimental results on finding social influencers in different domains show that our approach substantially improves state of the art by 11.5% AUC.Health (2)
(UTC/GMT +8) 16:00-18:00, April, 22, Wednesday
Meeting rooms are not available now
Junyi Gao (IQVIA), Cao Xiao (IQVIA), Yasha Wang (Peking University), Wen Tang (Peking University Health Science Center), Lucas Glass (IQVIA) and Jimeng Sun (Georgia Institute of Technology).
Abstract
Deep learning has demonstrated success in health risk prediction especially for patients with chronic and progressing conditions. Most existing works focus on learning chronic disease patterns from longitudinal patient data, but pay little attention to the disease progression stage itself. To fill the gap, we propose a Stage-aware neural Network (StageNet) model to extract disease stage information from patient data and integrate it into risk prediction. StageNet is enabled by (1) a stage-aware long short-term memory (LSTM) module that extracts health stage variations unsupervisedly; (2) a stage-adaptive convolutional module that incorporates stage-related variation patterns into risk prediction. We evaluate StageNet on two real-world datasets and show that StageNet outperforms state-of-the-art models in risk prediction task and patient subtyping task. Compared to the best baseline model, StageNet achieves up to 12% higher AUPRC for risk prediction task on two real world patient datasets. StageNet also achieves over 58% higher Calinski-Harabasz score (a cluster quality metric) for a patient subtyping task.Xingyao Zhang (Tsinghua University), Cao Xiao (IQVIA), Lucas Glass (IQVIA) and Jimeng Sun (Georgia Institute of Technology).
Abstract
Clinical trials are essential for drug development but often suffer from expensive, inaccurate and insufficient patient recruitment. The core problem of patient-trial matching is to find qualified patients for a trial, where patient information is stored in electronic health records (EHR) while trial eligibility criteria (EC) are described in text documents available on the web. How to represent longitudinal patient EHR? How to extract complex logical rules from EC? Most existing works rely on manual rule-based extraction, which is time consuming and inflexible for complex inference. To address these challenges, we proposed DeepEnroll, a cross-modal inference learning model to jointly encode enrollment criteria (text) and patients records (tabular data) into a shared latent space for matching inference. DeepEnroll applies a pre-trained Bidirectional Encoder Representations from Transformers(BERT) model to encode clinical trial information into sentence embedding. And uses a hierarchical embedding model to represent patient longitudinal EHR. In addition, DeepEnroll is augmented by a numerical information embedding and entailment module to reason over numerical information in both EC and EHR. These encoders are trained jointly to optimize patient-trial matching score. We evaluated DeepEnroll on the trial-patient matching task with demonstrated on real world datasets. DeepEnroll outperformed the best baseline by up to 12.4% in average F1.Rahul Duggal (Georgia Institute of Technology), Scott Freitas (Georgia Institute of Technology), Cao Xiao (IQVIA), Duen Horng Chau (Georgia Institute of Technology) and Jimeng Sun (Georgia Institute of Technology).
Abstract
In recent years, significant attention has been devoted towards integrating deep learning technologies in the healthcare domain. However, to safely and practically deploy deep learning models for home health monitoring, two significant challenges must be addressed: the models should be (1) robust against noise; and (2)compact and energy-efficient. We propose REST, a new method that simultaneously tackles both issues via 1) adversarial training and controlling the Lipschitz constant of the neural network through spectral regularization while 2) enforcing sparsity on whole filters. We demonstrate that REST produces highly-robust and efficient models that substantially outperform the original full-sized models in the presence of noise. For the sleep staging task over single-channel electroencephalogram (EEG), REST achieves a macro-F1 score of 0.69 vs. 0.33 for the Vanilla model in the presence of adversarial noise while obtaining 19x parameter reduction and 15x MFLOPS reduction on two large, real-world EEG datasets. By deploying these models to an Android application on a smartphone, we quantitatively observe that REST allows models to achieve up to 17x energy reduction and 9x faster inference.Economics (1)
(UTC/GMT +8) 16:00-18:00, April, 22, Wednesday
Meeting rooms are not available now
Yuan Deng (Duke University), Sébastien Lahaie (Google), Vahab Mirrokni (Google) and Song Zuo (Google).
Abstract
An incentive-compatible auction incentivizes buyers to truthfully reveal their private valuations. However, many ad auction mechanisms deployed in practice are not incentive-compatible, such as first-price auctions (for display advertising) and the generalized second-price auction (for search advertising). We introduce a new metric to quantify incentive compatibility in both static and dynamic environments. Our metric is data-driven and can be computed directly through black-box auction simulations without relying on reference mechanisms or complex optimizations. We provide interpretable characterizations of our metric and prove that it is monotone in auction parameters for several mechanisms used in practice, such as soft floors and dynamic reserve prices. We empirically evaluate our metric on ad auction data from a major ad exchange and a major search engine to demonstrate its broad applicability in practice.Geoffrey Ramseyer (Stanford University), Ashish Goel (Stanford University) and David Mazieres (Stanford University).
Abstract
In order to scale transaction rates for deployment across the global web, many cryptocurrencies have deployed so-called "Layer-2" networks of private payment channels. An idealized payment network behaves like a Credit Network, a model for transactions across a network of bilateral trust relationships. Credit Networks capture many aspects of traditional currencies as well as new virtual currencies and payment mechanisms. In the traditional credit network model, if an agent defaults, every other node that trusted it is vulnerable to loss. In a cryptocurrency context, trust is manufactured by capital deposits, and thus there arises a natural tradeoff between network liquidity (i.e. the fraction of transactions that succeed) and the cost of capital deposits.In this paper, we introduce constraints that bound the total amount of loss that the rest of the network can suffer if an agent (or a set of agents) were to default - equivalently, how the network changes if agents can support limited solvency guarantees.We show that these constraints preserve the analytical structure of a credit network. Furthermore, we show that aggregate borrowing constraints greatly simplify the network structure and in the payment network context achieve the optimal tradeoff between liquidity and amount of escrowed capital.Renato Paes Leme (Google), Balasubramanian Sivan (Google) and Yifeng Teng (University of Wisconsin-Madison).
Abstract
We consider a setting in which bidders participate in multiple auctions run by different sellers, and optimize their bids for the aggregate auction. We analyze this setting by formulating a game between sellers, where a seller’s strategy is to pick an auction to run. Our analysis aims to shed light on the recent change in the Dis-play Ads market landscape: here, ad exchanges (sellers) were mostly running second price auctions earlier and over time they switched to variants of the first price auction, culminating in Google’s Ad Exchange moving to a first price auction in 2019. Our model and results offer an explanation for why the first price auction occurs as a natural equilibrium in such competitive markets.Riccardo Colini Baldeschi (Facebook, Core Data Science), Stefano Leonardi (Sapienza University of Rome), Okke Schrijvers (Facebook) and Eric Sodomka (Facebook, Core Data Science).
Abstract
Incentive compatibility (IC) is a desirable property for any auction mechanism, including those used in online advertising. However, in real world applications practical constraints and complex environments often result in mechanisms that lack incentive compatibility. Recently, several papers investigated the problem of deploying black-box statistical tests to determine if an auction mechanism is incentive compatible. Unfortunately, most of those methods are costly, since they require the execution of many counterfactual experiments.In this work, we show that similar results can be obtained using the notion of IC-Envy. The advantage of IC-Envy is its efficiency: it can be computed using only the auction's outcome. In particular, we focus on two relevant environments: position auctions and Ad Types auctions. For position auctions, we show that for a large class of pricing schemes (which includes e.g. VCG and GSP), IC-Envy >= IC-Regret (and IC-Envy = IC-Regret under mild supplementary conditions). Next, we consider non-separable CTRs in the Ad Types environment. In this setting, we show that for a generalization of the GSP mechanism IC-Envy >= IC-Regret holds as well. Our theoretical results are completed showing that, in the position auction environment, IC-Envy can be used to bound the loss in social welfare due to the advertiser untruthful behavior.Finally, we show experimentally that IC-Envy can be used as a feature to predict IC-Regret in settings not covered by the theoretical results. In particular, using IC-Envy yields better results than training models using only price and value features.Xiaotie Deng (Peking University), Tao Lin (Peking University) and Tao Xiao (Shanghai Jiao Tong University).
Abstract
The sponsored search auction has been the first successful mechanism to commercialize an Internet service, since some 20 years ago. It is a market between a seller of online advertisement slots and many buyers of the slots to place their adverts. Conceptually, the auction is repeated billions of times for it to become an atypical auction where the market maker would be able to observe buyers' data and to adapt the auction protocol to its knowledge of buyers' value distributions. We formulate the auction under the above scenario as a Private Data Manipulation game between the seller and buyers: the seller first announces an auction whose allocation and payment rules are based on the buyers' distributions, then every buyer submits a value distribution for the auction (implemented by its submitted data following this distribution), finally the allocation and payment rules are carried out. We are interested in whether and how rational buyers would submit value distributions. Taking this consideration into account, we re-evaluate the theory, methodology and techniques that have been the most intensively studied in Internet economics.Systems (1)
(UTC/GMT +8) 16:00-18:00, April, 22, Wednesday
Meeting rooms are not available now
Meng Ma (Peking University), Ping Wang (Peking University), Jing Min Xu (IBM Research - China), Yuan Wang (IBM CRL), Pengfei Chen (Sun Yat-sen University) and Zonghua Zhang (IMT Lille Douai, Institut Mines-Télécom).
Abstract
The high complexity and dynamics of the microservice architecture make its application diagnosis extremely challenging. In this study, we design a novel tool, named AutoMAP, which enables dynamic generation of service correlations and automated diagnosis leveraging multiple types of metrics. In AutoMAP, we propose the concept of anomaly behavior graph to describe the correlations between services associated with different types of metrics. Two binary operations, as well as a similarity function on behavior graph are defined to help AutoMAP choose appropriate diagnosis metric in any particular scenario. Following the behavior graph, we design a heuristic investigation algorithm by using forward, self, and backward random walk, with an objective to identify the root cause services. To demonstrate the strengths of AutoMAP, we develop a prototype and evaluate it in both simulated environment and real-work enterprise cloud system. Experimental results clearly indicate that AutoMAP achieves over 90% precision, which significantly outperforms other selected baseline methods. AutoMAP can be quickly deployed in a variety of microservice-based systems without any system knowledge. It also supports introduction of various expert knowledge to improve accuracy.Austin Hounsel (Princeton University), Kevin Borgolte (Princeton University), Paul Schmitt (Princeton University), Jordan Holland (Princeton University) and Nick Feamster (University of Chicago).
Abstract
Nearly every service on the Internet relies on the Domain Name System (DNS), which translates a human-readable name to an IP address before two endpoints can communicate. Today, DNS traffic is unencrypted, leaving users vulnerable to eavesdropping and tampering. Past work has demonstrated that DNS queries can reveal a user's browsing history and even what smart devices they are using at home. In response to these privacy concerns, two new protocols have been proposed: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT). Instead of sending DNS queries and responses in the clear, DoH and DoT establish encrypted connections between users and resolvers. By doing so, these protocols provide privacy and security guarantees that traditional DNS (Do53) lacks.In this paper, we measure the effect of Do53, DoT, and DoH on query response times and page load times from five global vantage points. We find that although DoH and DoT response times are generally higher than Do53, both protocols can perform better than Do53 in terms of page load times. However, as throughput decreases and substantial packet loss and latency are introduced, web pages load fastest with Do53. Additionally, web pages successfully load more often with Do53 and DoT than DoH. Based on these results, we provide several recommendations to improve DNS performance, such as opportunistic partial responses and wire format caching.Kevin Borgolte (Princeton University) and Nick Feamster (University of Chicago).
Abstract
Advertisements and behavioral tracking have become an invasive nuisance on the Internet in recent years. Privacy advocates and expert users consider the invasion significant enough to warrant the use of ad blockers and anti-tracking browser extensions. At the same time, one of the largest advertisement companies in the world, Google, is developing the most popular browser, Google Chrome. This conflict of interest, that is developing a browser (a user agent) and being financially motivated to track users' online behavior, possibly violating their privacy expectations, while claiming to be a "user agent," did not remain unnoticed. As a matter of fact, Google recently sparked an outrage when proposing changes to Chrome how extensions can inspect and modify requests to "improve extension performance and privacy," which would render existing privacy-focused extensions inoperable.In this paper, we analyze how eight popular privacy-focused browser extensions for Google Chrome and Mozilla Firefox, the two desktop browsers with the highest market share, affect browser performance. We measure browser performance through several metrics focused on user experience, such as page-load times, number of fetched resources, as well as response sizes. To address potential regional differences in advertisements or tracking, such as influenced by the European General Data Protection Regulation (GDPR), we perform our study from two vantage points, the United States of America and Germany. Moreover, we also analyze how these extensions affect system performance, in particular CPU time, which serves as a proxy indicator for battery runtime of mobile devices. Contrary to Google's claims that extensions which inspect and block requests negatively affect browser performance, we find that a browser with privacy-focused request-modifying extensions performs similar or better on our metrics compared to a browser without extensions. In fact, even a combination of such extensions performs no worse than a browser without any extensions. Our results highlight that privacy-focused extensions not only improve users' privacy, but can also increase users' browsing experience.Research Tracks (4)
Web Mining-A (4)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Shen Gao (Peking University), Xiuying Chen (Peking University), Chang Liu (Peking University), Li Liu (INCEPTION INSTITUTE OF ARTIFICIAL INTELLIGENCE), Dongyan Zhao (Peking University) and Rui Yan (Peking University).
Abstract
Stickers with vivid and engaging expressions are becoming increasingly popular in online messaging apps, and some works are dedicated to automatically select sticker response by matching text labels of stickers with previous utterances. However, due to their large quantities, it is impractical to require text labels for the all stickers. Hence, in this paper, we propose to recommend an appropriate sticker to user based on multi-turn dialog context history without any external labels. Two main challenges are confronted in this task. One is to learn semantic meaning of stickers without corresponding text labels. Another challenge is to jointly model the candidate sticker with the multi-turn dialog context. To tackle these challenges, we propose a sticker response selector (SRS) model. Specifically, SRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker with each utterance in the dialog history. SRS then learns the short-term and long-term dependency between all interaction results by a fusion network to output the the final matching score. To evaluate our proposed method, we collect a large-scale real-world dialog dataset with stickers from one of the most popular online chatting platform. Extensive experiments conducted on this dataset show that our model achieves the state-of-the-art performance for all commonly-used metrics. Experiments also verify the effectiveness of each component of SRS. To facilitate further research in sticker selection field, we release this dataset of 350K multi-turn dialog and sticker pairs.Jing Li (Inception Institute of Artificial Intelligence), Shuo Shang (Inception Institute of Artificial Intelligence) and Ling Shao (Inception Institute of Artificial Intelligence).
Abstract
Recent advances in named entity recognition (NER) using deep neural models have yielded state-of-the-art performance on single domain data such as newswires. However, they still suffer from (i) requiring massive amounts of training data to avoid overfitting; (ii) huge performance degradation when there is a domain shift in the data distribution between training and testing. To make an NER system more broadly useful, it is crucial to reduce its training data requirements and transfer knowledge to other domains. In this paper, we investigate the problem of domain adaptation for NER under homogeneous and heterogeneous settings. We propose MetaNER, a novel meta-learning approach for domain adaptation in NER. Specifically, MetaNER incorporates meta-learning and adversarial training strategies to encourage robust, general and transferable representations for sequence labeling. The key advantage of MetaNER is that it is capable of accurately and quickly adapting to new unseen domains with a small amount of annotated data from those domains. We extensively evaluate MetaNER on multiple datasets under homogeneous and heterogeneous settings. The experimental results show that MetaNER achieves state-of-the-art performance against eight baselines. Impressively, MetaNER surpasses the in-domain performance using only 16.17% and 34.76% of target domain data on average for homogeneous and heterogeneous settings, respectively. We conduct experiments to further analyze the parameter settings and architectural choices. We also present a study for qualitative analysis.Xiaotao Gu (University of Illinois at Urbana-Champaign), Yuning Mao (University of Illinois at Urbana-Champaign), Jiawei Han (University of Illinois at Urbana-Champaign), Jialu Liu (Google), You Wu (Google), Cong Yu (Google), Daniel Finnie (Google), Hongkun Yu (Google), Jiaqi Zhai (Google) and Nicholas Zukoski (Google).
Abstract
Millions of news articles are published online every day, which can be overwhelming for readers to follow. Grouping articles that are reporting the same event into news stories is a common way of assisting readers in their news consumption. However, it remains a challenging research problem to efficiently and effectively generate a representative headline for each story. Automatic summarization of a document set has been studied for decades, while few studies have focused on generating representative headlines for a set of articles. Unlike summaries, which aim to capture most information with least redundancy, headlines aim to capture information jointly shared by the story articles in short length, and exclude information that is too specific to each individual article.In this work, we study the problem of generating representative headlines for news stories. We develop a distant supervision approach to train large-scale generation models without any human annotation. This approach centers on two technical components. First, we propose a multi-level pre-training framework that incorporates massive unlabeled corpus with different quality-vs.-quantity balance at different levels. We show that models trained within this framework outperform those trained with pure human curated corpus. Second, we propose a novel self-voting-based article attention layer to extract salient information shared by multiple articles. We show that models that incorporate this layer are robust to potential noises in news stories and outperform existing baselines with or without noises. We can further enhance our model by incorporating human labels, and we show our distant supervision approach significantly reduces the demand on labeled data. Finally, to serve the research community, we publish the first manually curated benchmark dataset, NewSHead, which contains367k stories(each with3-5articles), 6.5times larger than the current largest multi-document summarization dataset.Yanxiang Ling (Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology), Fei Cai (Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology), Honghui Chen (Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology) and Maarten de Rijke (Informatics Institute, University of Amsterdam).
Abstract
Question generation in open-domain dialogue systems is a challenging but less-explored task. It is aimed at enhancing interactiveness and persistence of human-machine interactions. Previous work mainly focuses on question generation in the setting of single-turn dialogues, or investigates it as a data augmentation method for machine comprehension. We propose a Context-augmented Neural Question Generation (CNQG) model that leverages the conversational context to generate questions for promoting interactiveness and persistence of multi-turn dialogues. More specifically, we formulate the task of question generation as a two-stage process. First, we employ an encoder-decoder framework to predict a question pattern, which denotes a set of representative interrogatives, and identify the potential topics from the conversational context by employing point-wise mutual information. Then, we generate the question by decoding the concatenation of the current dialogue utterance, the pattern, and the topics with an attention mechanism. To the best of our knowledge, ours is the first work on question generation in multi-turn opendomain dialogue systems. Our experimental results on two publicly available multi-turn conversation datasets show that CNQG outperforms the state-of-the-art baselines in terms of BLEU-1, BLEU-2, Distinct-1 and Distinct-2. In addition, we find that CNQG allows one to efficiently distill useful features from long contexts, and maintain robust effectiveness even for short contexts.Subendhu Rongali (University of Massachusetts Amherst), Luca Soldaini (Amazon Alexa Search), Emilio Monti (Amazon Alexa) and Wael Hamza (Amazon Alexa AI).
Abstract
Virtual assistants such as Amazon Alexa, Apple Siri, and Google Assistant often rely on a semantic parsing component to understand which action(s) to execute for an utterance spoken by its users. Traditionally, rule-based or statistical slot-filling systems have been used to parse ''simple'' queries; that is, queries that contain a single action and can be decomposed into a set of non-overlapping entities. More recently, shift-reduce parsers have been proposed to process more complex utterances. These methods, while powerful, impose specific limitations on the type of queries that can be parsed; namely, they require a query to be representable as a parse tree.In this work, we propose a unified architecture based on Sequence to Sequence models and Pointer Generator network to handle both simple and complex queries. Unlike other works, our approach does not impose any restriction on the semantic parse schema. Furthermore, experiments show that it achieves state of the art performance on three publicly available datasets (ATIS, SNIPS, Facebook TOP), relatively improving between 3.4% and 13.2% in exact match accuracy over any previous systems. Finally, we show the effectiveness of our approach on two internal datasets.Social Network-A (4)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Jia Li (The Chinese University of Hong Kong), Honglei Zhang (Georgia Institute of Technology), Zhichao Han (The Chinese University of Hong Kong), Yu Rong (Tencent AI Lab), Hong Cheng (The Chinese University of Hong Kong) and Junzhou Huang (Tencent AI Lab).
Abstract
It has been demonstrated that adversarial graphs, i.e., graphs with imperceptible perturbations added, can cause deep graph models to fail on node/graph classification tasks. In this paper, we extend adversarial graphs to the community detection problem which is much more difficult. We focus on black-box attack and aim to hide targeted individuals from the detection of deep graph community detection models, which has many applications in real world scenarios, for example, protecting personal privacy in social networks and understanding camouflage patterns in transaction networks. We propose an iterative learning framework that takes turns to update two modules: one working at the constrained graph generation and the other at the surrogate community detection model. We also find that the adversarial graphs generated by our method can be transferred to other learning based community detection models.Ekta Gujral (University of California, Riverside), Ravdeep Pasricha (University Of California Riverside) and Evangelos Papalexakis (University of California Riverside).
Abstract
How are communities in real multi-aspect or multi-view graphs structured? How we can effectively and concisely summarize and explore those communities in a high-dimensional, multi-aspect graph without losing important information? State-of-the-art studies focused on patterns in single graphs, identifying structures in a single snapshot of a large network or in time-evolving graphs and stitch them over time.However, to the best of our knowledge, there is no method that discovers and summarizes community structure from a multi-aspect graph, by jointly leveraging information from all aspects. State-of-the-art in multi-aspect/tensor community extraction is limited to discovering clique structure in the extracted communities, or even worse, imposing a clique structure where it does not exist.In this paper, we bridge that gap by empowering tensor-based methods to extract rich community structure from multi-aspect graphs. In particular, we introduce cLL1, a novel constrained Block Term Tensor Decomposition, that is generally capable of extracting higher than rank-1 but still interpretable structure from a multi-aspect dataset. Subsequently, we propose RICHCOM, a community structure extraction and summarization algorithm that leverages cLL1 to identify rich community structure (e.g., cliques, stars, chains, etc) while leveraging higher-order correlations between the different aspects of the graph.Our contributions are four-fold: (a) Novel algorithm : we develop cLL1, an efficient framework to extract rich and interpretable structure from general multi-aspect data; (b) Graph summarization and exploration : we provide cLL1B, a summarization, and encoding scheme to discover and explore structures of communities identified by cLL1; (c) Multi-aspect graph generator: we provide a simple and effective way synthetic multi-aspect graph generator, and (d) Real-world utility: we present empirical results on small and large real datasets that demonstrate performance on par or superior to existing state-of-the-art.Shweta Jain (University of California, Santa Cruz) and C. Seshadhri (University of California, Santa Cruz).
Abstract
Clique and near-clique counts are important graph properties with applications in graph generation, graph modeling, graph analytics, community detection among others. They are the archetypal examples of dense subgraphs. While there are several different definitions of near-cliques, most of them share the attribute that they are cliques that are missing a small number of edges. Clique counting is itself considered a challenging problem. Counting near-cliques is significantly harder more so since the search space for near-cliques is orders of magnitude larger than that of cliques.We give a formulation of a near-clique as a clique that is missing a constant number of edges. We exploit the fact that a near-clique contains a smaller clique, and use techniques for clique sampling to count near-cliques. This method allows us to count near-cliques with 1 or 2 missing edges, in graphs with tens of millions of edges. To the best of our knowledge, there was no known efficient method for this problem, and we obtain a 10x-100x speedup over existing algorithms for counting near-cliques.Our main technique is a space efficient adaptation of the Tur\'{a}n Shadow sampling approach, recently introduced by Jain and Seshadhri (WWW 2017). This approach constructs a large recursion tree (called the Tur\'{a}n Shadow) that represents cliques in a graph. We design a novel algorithm that builds an estimator for near-cliques, using a online, compact construction of the Tur\'{a}n Shadow.Kasper Green Larsen (Aarhus University), Michael Mitzenmacher (Harvard University) and Charalampos Tsourakakis (Boston University).
Abstract
Clustering, i.e., finding groups in the data, is a problem that permeates multiple fields of science and engineering. Recently, the problem of clustering with a noisy oracle has drawn attention due to various applications including crowdsourced entity resolution verroios2015entity, and predicting signs of interactions in large-scale online social networks leskovec2010signed,leskovec2010predicting. Here, we consider the following fundamental model for two clusters as proposed by Mitzenmacher and Tsourakakis mitzenmacher2016predicting, and Mazumdar and Saha mazumdar2017clustering; there exist $n$ items, belonging to two unknown groups. We are allowed to query any pair of nodes whether they belong to the same cluster or not, but the answer to the query is corrupted with some probability $0Shaohua Fan (Beijing University of Posts and Telecommunications), Xiao Wang (Beijing University of Posts and Telecommunications), Chuan Shi (Beijing University of Posts and Telecommunications), Emiao Lu (Tencent), Ken Lin (Tencent) and Bai Wang (Beijing University of Posts and Telecommunications).
Abstract
Multi-view graph clustering, which seeks a partition of the graph with multiple views that often provide more comprehensive yet complex information, has received considerable attention in recent years. Although some efforts have been made for multi-view graph clustering and achieve decent performances, most of them employ shallow model to deal with the complex relation within multi-view graph, which may seriously restrict the capacity for modeling multi-view graph information. In this paper, we make the first attempt to employ deep learning technique for attributed multi-view graph clustering, and propose a novel task-guided One2Multi graph autoencoder clustering framework. The One2Multi graph autoencoder is able to learn node embeddings by employing one informative graph view and content data to reconstruct multiple graph views. Hence, the shared feature representation of multiple graphs can be well captured. Furthermore, a self-training clustering objective is proposed to iteratively improve the clustering results. By integrating the self-training and autoencoder's reconstruction into a unified framework, our model can jointly optimize the cluster label assignments and embeddings suitable for graph clustering. Experiments on real-world attributed multi-view graph datasets well validate the effectiveness of our model.User Modeling-A (4)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Qinyong Wang (The University of Queensland), Hongzhi Yin (The University of Queensland), Tong Chen (The University of Queensland), Zi Huang (The University of Queensland), Hao Wang (Alibaba AI Labs), Yanchang Zhao (CSIRO) and Quoc Viet Hung Nguyen (Griffith University).
Abstract
In the modern tourism industry, next point-of-interest (POI) recommendation is one of the most important mobile services as it effectively aids hesitating travelers to decide the next POI to visit. Currently, most next POI recommender systems are built upon a cloud-based paradigm, where the recommendation models (usually deep learning-based) are trained and deployed on the powerful cloud servers. When a recommendation request is made by a user via mobile devices, the current contextual information (e.g., location and time) will be uploaded to the cloud servers to help the well-trained models generate personalized recommendation results. However, in reality, this paradigm heavily relies on high-quality network connectivity, and is subject to high energy footprint in the operation and increasing privacy concerns among the public.To bypass the defects of cloud-based recommendation paradigm, in this paper, we propose a novel Light Location Recommender System (LLRec) to perform next POI recommendation locally on resource-constrained mobile devices. To make LLRec fully compatible with the limited computing resources and memory space, we leverage FastGRNN, a lightweight but effective gated Recurrent Neural Network (RNN) as its main building block, and significantly compress the model size by adopting the tensor-train composition in the embedding layer. As a compact model, LLRec maintains its robustness via an innovative teacher-student training framework, where a powerful teacher model is trained on the cloud to learn essential knowledge from available contextual data, and the simplified student model LLRec is trained under the guidance of the teacher model. The final LLRec is downloaded and deployed on users' mobile devices to generate accurate recommendations solely utilizing users' local data. As a result, LLRec significantly reduces the dependency on cloud servers, thus allowing for next POI recommendation in a stable, cost-effective and secure way. Extensive experiments on two large-scale recommendation datasets further demonstrate the superiority of our proposed solution.Xiang Wang (National University of Singapore), Yaokun Xu (Southeast University), Xiangnan He (University of Science and Technology of China), Yixin Cao (National University of Singapore), Meng Wang (HeFei University of Technology) and Tat-Seng Chua (National University of Singapore).
Abstract
Properly handling missing data is a fundamental challenge in recommendation. Most present work performs negative sampling from missing data to supply the training of recommender models with negative signals. Nevertheless, existing negative sampling strategies, either static or dynamic ones, are insufficient to yield high-quality negative samples — both informative to model training and reflective of user real tastes.In this work, we hypothesize that item knowledge graph (KG), which provides rich and unbiased relations among users, items, and KG entities, could be useful to infer informative and factual negative samples. We develop a new negative sampling model, Knowledge Graph Policy Network (KGPolicy), which works as a reinforcement learning agent to explore high-quality negatives. Specifically, by conducting our designed exploring operations, it navigates from the target positive interaction, adaptively receives attribute-based negative signals, and ultimately yields a potential negative item to train the recommender. Empirically, matrix factorization (MF) equipped with KGPolicy achieves significant improvements over both state-of-the-art sampling methods like DNS and IRGAN, and KG-enhanced recommender models like RippleNet and KGAT. Further analysis on how knowledge graph facilitates the recommender learning provides insights of knowledge-aware negative sampling. Code and parameter settings will be released upon acceptance.Fajie Yuan (Tencent), Xiangnan He (University of Science and Technology of China), Haochuan Jiang (Tencent), Guibing Guo (Northeastern University), Jian Xiong (tencent), Zhezhao Xu (tencent) and Yilin Xiong (Tencent).
Abstract
Session-based recommender systems have attracted much attention recently. To capture the sequential dependencies, existing methods resort either to data augmentation techniques or left-to-right style autoregressive training. Since these methods are aimed to model the sequential nature of user behaviors, they ignore the future data of a target interaction when constructing the prediction model for it. However, we argue that the future interactions after a target interaction, which are also available during training, provide valuable signal on user preference and can be used to enhance the recommendation quality.Properly integrating future data into model training, however, is non-trivial to achieve, since it disobeys machine learning principles and can easily cause data leakage. To this end, we propose a new encoder-decoder framework named Gap-filling based Recommender (GRec), which trains the encoder and decoder by a gap-filling mechanism. Specifically, the encoder takes a partially-complete session sequence (where some items are masked by purpose) as input, and the decoder predicts these masked items conditioned on the encoded representation. We instantiate the general GRec framework using convolutional neural network with sparse kernels, giving consideration to both accuracy and efficiency. We conduct experiments on two real-world datasets covering short-, medium-, and long-range user sessions, showing that GRec significantly outperforms the state-of-the-art sequential recommendation methods. More empirical studies verify the high utility of modeling future contexts under our GRec framework.Yichao Zhou (University of California, Los Angeles), Shaunak Mishra (Yahoo Research), Manisha Verma (Yahoo Research), Narayan Bhamidipati (Yahoo Research) and Wei Wang (University of California, Los Angeles).
Abstract
There is a perennial need in the online advertising industry to refresh ad creatives, i.e., images and text used for enticing online users towards a brand. Such refreshes are required to reduce the likelihood of ad fatigue among online users, and to incorporate insights from other successful campaigns in related product categories. However, given a brand, to come up with themes for a new ad is a painstaking and time consuming process for creative strategists. Among other things, strategists typically draw inspiration from the images and text used for past ad campaigns, as well as world knowledge on the brands. To automatically infer ad themes via such multimodal sources of information in past ad campaigns, we propose a theme (keyphrase) recommender system for ad creative strategists. In particular, the theme recommender is based on aggregating results from a visual question answering (VQA) task, which ingests the following: (i) ad images, (ii) text associated with the ads as well as Wikipedia pages on the brands in the ads, and (iii) questions around the ad. To harness the multimodal nature of the above inputs, we leverage transformer based cross-modality encoders to train visual-linguistic representations for our VQA task. We study two formulations for the VQA task along the lines of classification and ranking; via experiments on a public dataset, we show that cross-modal representations lead to significantly better classification accuracy and ranking precision-recall metrics. Specifically, cross-modal representations show better performance compared to separate image and text representations. In addition, the use of multimodal information shows a significant lift over using only textual or visual information. Finally, we share creative strategy insights on selected product categories in the public dataset using our approach.Jiayi Xie (Wuhan University), Yaochen Zhu (Wuhan University), Zhibin Zhang (Wuhan University), Jian Peng (Wuhan University), Jing Yi (Wuhan University), Yaosi Hu (Wuhan University), Hongyi Liu (Wuhan University) and Zhenzhong Chen (Wuhan University).
Abstract
Recently, popularity prediction for user generated contents (UCGs) has received substantial attention among researchers. As a particular form of UCGs, micro-videos in real-world applications are usually accompanied with several contents, such as title, tags, background music. Unlike movies which is published officially, micro-videos made and uploaded arbitrarily by online users are personalized, and thus their quality cannot be guaranteed. For example, textual modality can be irrelevant to the visual modality for the purpose of eye-catching, or even missing. Besides, whether certain video comes into fashion after its release is also affected by lots of external uncertainties. Thus, the mapping from feature space to popularity space is essentially non-deterministic, and such randomness poses a great challenge for the popularity prediction of micro-videos. In light of this, we propose a multimodal variational encoder-decoder framework that can explicitly capture the randomness. Specifically, features of different modalities are stochastically embedded into hidden representations, which is then fused together by Bayesian reasoning such that information from all modalities is well utilized. Then, the learned hidden representation is fed into a recurrent neural network as a warm start to predict the popularity sequence of certain micro-video. Experiments conducted in the real-world dataset we collected demonstrate the effectiveness of our proposed model in the micro-video popularity prediction task.Society (3)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Pantelis Pipergias Analytis (University of Southern Denmark), Daniel Barkoczi (University of Southern Denmark), Philipp Lorenz-Spreen (Max Planck Institute for Human Development) and Stefan Herzog (Max Planck Institute for Human Development).
Abstract
The ability of people to influence the opinion of others on matters of taste varies greatly—both in the offline world and in recommender systems. What are the mechanisms underlying this striking inequality? We use the weighted k-nearest-neighbor algorithm to represent an array of social learning strategies and show—using network theory—how this gives rise to networks of social influence in six real-world domains of taste. By doing so, we show three novel results that apply both to offline advice taking and online recommender settings. First, influential individuals have mainstream tastes and high dispersion in their taste similarity with others. Second, the fewer people an individual or algorithm consults (i.e., the lower k) and the more sensitive an individual or algorithm is to how similar other people are, the smaller the group of people with substantial influence. Third, the influence networks that emerge are hierarchically organized. Our results shed new light on classic empirical findings in communication and network science and can help improve our understanding of social influence in the offline and online world.Gourab K Patro (Indian Institute of Technology Kharagpur), Arpita Biswas (Indian Institute of Science Bangalore), Niloy Ganguly (Indian Institute of Technology Kharagpur), Krishna P. Gummadi (MPI-SWS) and Abhijnan Chakraborty (Max Planck Institute for Software Systems).
Abstract
Major online platforms today (such as Amazon, Netflix, Spotify, LinkedIn, AirBnB) can be thought of as two-sided markets with producers and customers of goods and services. Traditionally, search and recommendation services in these platforms have focused on maximizing customer satisfaction by tailoring the results according to the personalized preferences of individual customers. However, our investigation reveals that such customer-centric design of these services may lead to unfair distribution of exposure to the producers and adversely impact their well-being. As more and more people are depending on such platforms to earn a living, it is important to ensure fairness to both producers and customers. In this work, by mapping the problem of personalized recommendation to the problem of fair allocation of indivisible goods, we propose to provide fairness guarantees for both sides. More formally, our proposed FairRec algorithm guarantees at least Maxi-Min Share (MMS) exposure for majority of the producers, and Envy-Free upto One Good (EF1) fairness for all the customers. Extensive evaluations over multiple real-world datasets show the effectiveness of FairRec in ensuring two-sided fairness while incurring little loss in overall recommendation quality.Minje Choi (University of Michigan), Luca Maria Aiello (Nokia Bell Labs), Varga Krisztian (Nokia Bell Labs) and Daniele Quercia (Nokia Bell Labs).
Abstract
Decades of social science research identified ten fundamental dimensions that provide the conceptual building blocks to describe the nature of human relationships. Yet, it is not clear to what extent these concepts are expressed in everyday language and what role they have in shaping observable dynamics of social interactions. After annotating conversational text through crowdsourcing, we train NLP tools to detect the presence of these types of interaction from conversations, and apply them to 160M messages written by geo-referenced Reddit users, 290k emails from the Enron corpus and 300k lines of dialogue from movie scripts. We show that social dimensions can be predicted purely from conversations with an AUC up to 0.98, and the combination of the predicted dimensions suggests both the types of relationships people entertain (conflict vs. support) and the types of real-world communities (wealthy vs. deprived) they shape.Tiziano Piccardi (Ecole Polytechnique Fédérale de Lausanne), Miriam Redi (Wikimedia Foundation), Giovanni Colavizza (University of Amsterdam) and Robert West (Ecole Polytechnique Fédérale de Lausanne).
Abstract
Wikipedia, the free online encyclopedia that anyone can edit, is one of the most visited sites on the Web and a common source of information for many users. As an encyclopedia, Wikipedia is not a source of original information, but was conceived as a gateway summary of secondary sources: according to Wikipedia's guidelines, most facts must be backed up by reliable sources that reflect the full spectrum of views on the topic. Although citations lie at the very heart of Wikipedia, little is known about how users interact with them. To close this gap, we built client-side instrumentation for logging all clicks on links leading from English Wikipedia articles to cited references during one month, and conducted the first ever analysis of readers interaction with citations on Wikipedia. We find that overall engagement with citations is low: about one in 300 page views results in a reference click (0.3% overall; 0.6% on desktop; 0.1% on mobile). A causal analysis of the factors associated with reference clicking reveals that clicks occur more frequently on shorter pages and on pages of lower quality, suggesting that references are consulted more commonly when Wikipedia itself does not contain the information sought by the user. Moreover, we observe that references about life events (births, deaths, marriages, etc.) are particularly popular. Taken together, our findings open the door to a deeper understanding of Wikipedia's role in a global information economy where reliability is ever less certain, and source attribution ever more vital.Martin Pawelczyk (University of Tuebingen), Klaus Broelemann (Schufa AG) and Gjergji Kasneci (University of Tuebingen).
Abstract
Counterfactual explanations can be obtained by identifying the smallest change made to a feature vector to qualitatively influence a prediction in a positive way from a user’s viewpoint; for example, from ’loan rejected’ to ’awarded’ or from ’high risk of cardiovascular disease’ to ’low risk’. Previous approaches would not ensure that the produced counterfactuals be proximate (i.e., not local outliers) and connected to regions with substantial data density (i.e., close to correctly classified observations), two requirements known as counterfactual faithfulness. These requirements are fundamental when making suggestions to individuals that are indeed attainable. Our contribution is twofold. First, drawing ideas from the manifold learning literature, we develop a framework, called C-CHVAE, that generates faithful counterfactuals. Second, we suggest to complement the catalog of counterfactual quality measures [13] using a criterion to quantify the degree of difficulty for a certain counterfactual suggestion. Our real world experiments suggest that faithful counterfactuals come at the cost of higher degrees of difficulty.Security (3)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Iskander Sanchez-Rola (University of Deusto, NortonLifeLock Research Group), Davide Balzarotti (EURECOM), Christopher Kruegel (UC Santa Barbara), Giovanni Vigna (UC Santa Barbara) and Igor Santos (University of Deusto).
Abstract
Web pages have evolved into very complex dynamic applications, which are often very opaque and difficult for non-experts to understand. At the same time, security researchers push for more transparent web applications, which can help users in taking important security-related decisions about which information to disclose, which link to visit, and which online service to trust.In this paper, we look at one of the most simple but also most representative aspects that captures the struggle between these opposite demands: a mouse click. In particular, we present the first comprehensive study of the possible security and privacy implications that clicks can have from a user perspective, analyzing the disconnect that exists between what is shown to users and what actually happens after. We started by identifying and classifying possible problems. We then implemented a crawler that performed nearly 2.5M clicks looking for signs of misbehavior. We analyzed all the interactions created as a result of those clicks, and discovered that the vast majority of domains are putting users at risk by either obscuring the real target of links or by not providing sufficient information for users to make an informed decision. We conclude the paper by proposing a set of countermeasures.Tobias Urban (Institute for Internet Security), Martin Degeling (Ruhr-University Bochum), Thorsten Holz (Ruhr-Universität Bochum) and Norbert Pohlmann (Institute for Internet Security).
Abstract
In the modern Web, service providers often heavily rely on third parties to run their services. For example, they use ad networks to finance their services, externally hosted libraries to quickly develop them, and analytical services to gain insights into users' behavior. This can lead to a situation where service providers do not know which third parties will be embedded, for example when these third parties request additional content as it is common in real-time ad auctions. In this paper, we present a large-scale measurement study to analyzes the magnitude of these new challenges. To better reflect the connectedness of third parties, we measured their relations in a model we call third party trees, which reflects the loading dependencies of all third parties embedded into a given website. Using this notion, we show that including a single third party can lead to the subsequent loading of several further parties. Our data shows that embedding a third party might lead to branches of depth of up to eight. Furthermore, our findings indicate that the services that are embedded on a page load are not always deterministic and 93% of the analyzed websites embedded third parties that are located in regions that might not be in line with the current legal framework. An important finding of our study is that previous work that mostly focused on landing pages of websites only measured a lower bound as subsites show a significantly increase of privacy invasive techniques. For example, our results show an increase of used cookies by 36%.Benjamin Eriksson (Chalmers University of Technology) and Andrei Sabelfeld (Chalmers University of Technology).
Abstract
Undesired navigation in browsers powers a significant class of attacks on web applications. In a move to mitigate risks associated with undesired navigation, the security community has proposed a standard that gives control to web pages to restrict navigation. The standard draft introduces a new navigate-to directive of the Content Security Policy (CSP). The directive is currently being implemented by mainstream browsers. This paper is a first evaluation of navigate-to, focusing on security, performance, and automatization of navigation policies. We present new vulnerabilities introduced by the directive into the web ecosystem, opening up for at- tacks such as probing to detect if users are logged in to other websites or have active shopping carts, bypassing third- party cookie blocking, exfiltrating secrets, as well as leaking browsing history. Unfortunately, the directive triggers vulnerabilities even in websites that do not use the directive in their policies. We identify both specification- and implementation- level vulnerabilities and propose countermeasures to mitigate both. To aid developers in configuring navigation policies, we develop and implement AutoNav, an automated black-box mechanism to infer navigation policies. AutoNav leverages the benefits of origin-wide policies in order to improve security without degrading performance. We evaluate the viability of navigate-to and AutoNav by an empirical study on Alexa’s top 10,000 websites.Sadegh Farhang (The Pennsylvania State University), Mehmet Bahadir Kirdan (Technical University of Munich), Aron Laszka (University of Houston) and Jens Grossklags (Technical University of Munich).
Abstract
Mobile devices encroach on almost every activity of our lives including work and leisure, and contain a wealth of personal and sensitive information. It is, therefore, imperative that these devices uphold high security standards. A key aspect is the security of the underlying operating system platform. In particular, Android, the most dominant platform in this ecosystem with more than one billion active devices and its openness, which allows different vendors to adopt it, plays a critical role. Like other platforms, Android maintains security via monthly security patches and announces them via the Android security bulletin. To absorb this information successfully across the Android ecosystem, impeccable coordination by many different vendors is required.In this paper, we perform a comprehensive study of 3,174 Android related vulnerabilities and study to which degree they are reflected in the Android security bulletin, as well as in the security bulletins of leading vendors: Samsung, LG, and Huawei. In our analysis, we focus on the metadata of these security bulletins (e.g., timing, affected layers, severity, and CWE data) to better understand commonalities and differences among vendors. Some of our findings are: (i) the studied vendors in the Android ecosystem have adopted different structures for vulnerability reporting, (ii) vendors are less likely to react with delay for CVEs with Android Git repository references, (iii) vendors handle Qualcomm-related CVEs different from the rest of external layer CVEs.Weikang Bian (The Chinese University of Hong Kong), Wei Meng (The Chinese University of Hong Kong) and Mingxue Zhang (The Chinese University of Hong Kong).
Abstract
In-browser cryptojacking is an urgent threat to web users, where the attackers abuse the users' local computing resources without obtaining the users' consent. Many in-browser mining programs are developed in WebAssembly (Wasm) for its great performance. Several prior works have measured cryptojacking in the wild and proposed detection methods using static features and dynamic features. However, there exists no good defense mechanism within the user's browser to stop the malicious drive-by mining behavior. The users still primarily depend on ad blocking software that relies on community-maintained blacklists which can be easily bypassed.In this work, we propose MineThrottle, a browser based defense mechanism to defend against Wasm cryptojacking by leveraging block-level semantic information of a program. We show that the cryptocurrency mining Wasm programs exhibit very different block-level semantic information from other Wasm programs (e.g., games). In particular, the majority of computation workload are spent in a small number of basic blocks, and the instructions used are significantly different from the other basic blocks. MineThrottle instruments Wasm code on the fly to label mining related code blocks and detect mining behavior using block-level program profiling. It then throttles drive-by mining behavior based on a user-configurable policy. Our evaluation of MineThrottle with the Alexa top 1M websites demonstrates that it can accurately detect and mitigate in-browser cryptojacking with both a low false positive rate and a low false negative rate.Search (2)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Kaitao Zhang (Tsinghua University), Chenyan Xiong (Carnegie Mellon University; Microsoft), Zhenghao Liu (Tsinghua University) and Zhiyuan Liu (Tsinghua University).
Abstract
This paper democratizes neural information retrieval to scenarios where large scale relevance training signals are not available. We revisit the classic IR intuition that anchor-document relation approximates query-document relevance and propose a reinforcement weak supervision selection method, ReInfoSelect, which learns to select anchor-document pairs that best train neural ranking models, guided by only a handful of human relevance labels. ReInfoSelect uses the NDCG on the target relevance benchmark as the reward and learns to classify whether each anchor-document pair should be used as a training signal (action). It iterates through anchor-document pairs and converges when the neural ranker's performance peaks on target relevance benchmarks. Our experiments on ClueWeb09-B and Robust04 demonstrate the necessity and effectiveness of ReInfoSelect in leveraging anchor data as weak supervision. On these TREC benchmarks, the neural rankers trained with our ReInfoSelect significantly outperform feature-based learning to rank and match the training effectiveness of Bing User Clicks, while ReInfoSelect only uses publicly available anchor data. Our human evaluation confirms that ReInfoSelect effectively leverages the reward from neural rankers to select anchors that are more similar to search queries and linked documents that are more relevant to the anchor.Zhijing Wu (Tsinghua University), Jiaxin Mao (Tsinghua University), Yiqun Liu (Tsinghua University), Jingtao Zhan (Tsinghua University), Yukun Zheng (Tsinghua University), Min Zhang (Tsinghua University) and Shaoping Ma (Tsinghua University).
Abstract
Document ranking is one of the most studied but challenging problems in information retrieval (IR) researches. A number of existing document ranking models capture relevance signals at the whole document level. Recently, more and more researches have begun to address this problem from fine-grained document modeling. Several works leveraged fine-grained passage-level relevance signals in ranking models. However, most of these works focus on context-independent passage-level relevance signals and ignore the context information, which may lead to inaccurate estimation of passage-level relevance. In this paper, we investigate how information gain accumulates with passages when users sequentially read a document. We propose the context-aware Passage-level Cumulative Gain (PCG), which aggregates relevance scores of passages and avoids the need to formally split a document into independent passages. Next, we incorporate the patterns of PCG into a BERT-based sequential model called Passage-level Cumulative Gain Model (PCGM) to predict the PCG sequence. Finally, we apply PCGM to the document ranking task. Experimental results on two public ad hoc retrieval benchmark datasets show that PCGM outperforms most existing ranking models and also indicates the effectiveness of PCG signals. We believe that this work contributes to improving ranking performance and providing more explainability for document ranking.Jianghong Zhou (Emory University) and Eugene Agichtein (Emory University).
Abstract
To support complex search tasks, where the initial information requirements are complex or may change during the search, a search engine must adapt the information delivery as the user’s information requirements evolve. To support this dynamic ranking paradigm effectively, search result ranking must incorporate both the user feedback received, and the information displayed so far. To address this problem, we introduce a novel reinforcement learning-based approach, RLIRank. We first build an adapted reinforcement learning framework to integrate the key components of the dynamic search. Then, we implement a new Learning to Rank (LTR) model for each iteration of the dynamic search, using a recurrent LongShort Term Memory neural network (LSTM), which estimates the gain for each next result, learning from each previously ranked document. To incorporate the user’s feedback, we develop a word-embedding variation of the classic Rocchio Algorithm, to help guide the ranking towards the high-value documents. Those innovationsenableRLIRankto outperform the previously reported methods from the TREC Dynamic Domain Tracks 2017 and exceed all the methods in the 2016 TREC Dynamic Domain after multiple search iterations, advancing the state of the art for dynamic search.Ruilin Li (Georgia Institute of Technology), Zhen Qin (Google), Xuanhui Wang (Google), Suming J. Chen (Google) and Donald Metzler (Google).
Abstract
Neural search ranking models, which have been actively studied in the information retrieval community, have also been widely adopted in real-world industrial applications. However, due to the high non-convexity and stochastic nature of neural model formulations, the obtained models are unstable in the sense that model predictions can significantly vary across two models trained with the same configuration. In practice, new features are continuously introduced and new model architectures are explored to improve model effectiveness. In these cases, the instability of neural models leads to unnecessary document ranking changes for a large fraction of queries. Such changes lead to an inconsistent user experience and also adds noise to online experiment results, thus slowing down the model development life-cycle. How to stabilize neural search ranking models during model update is an important but largely unexplored problem. Motivated by trigger analysis, we suggest balancing the trade-off between performance improvements and the number of affected queries. We formulate this as an optimization problem where the objective is to maximize the average effect over the affected queries. We propose two heuristics and one theory-guided method to solve the optimization problem. Our proposed methods are evaluated on two of the world's largest personal search services: Gmail search and Google Drive search. Empirical results show that our proposed methods are highly effective in optimizing the proposed objective and are applicable to different model update scenarios.Mobile (3)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Arvind Narayanan (University of Minnesota), Eman Ramadan (University of Minnesota), Jason Carpenter (University of Minnesota), Qingxu Liu (University of Minnesota), Yu Liu (University of Minnesota), Feng Qian (University of Minnesota) and Zhi-Li Zhang (University of Minnesota).
Abstract
We conduct to our knowledge a first measurement study of commercial mmWave 5G performance on smartphones by closely examining 5G networks of three carriers (two mmWave carriers, one mid-band 5G carrier) in three U.S. cities. We conduct extensive field tests on 5G performance in diverse urban environments. We systematically analyze the handoff mechanisms in 5G and their impact on network performance, and explore the feasibility of using location and possibly other environmental information to predict the network performance. We also study the app performance (web browsing, HTTP download, and volumetric video streaming) over 5G. Our study consumes more than 15 TB of data. Conducted when 5G just made its debut, it provides a "baseline" for studying how 5G performance evolves, and identifies key research directions on improving 5G users' experience in a cross-layer manner.Tianming Liu (Beijing University of Posts and Telecommunications), Haoyu Wang (Beijing University of Posts and Telecommunications), Li Li (Monash University), Xiapu Luo (The Hong Kong Polytechnic University), Feng Dong (Beijing University of Posts and Telecommunications), Yao Guo (Peking University), Liu Wang (Beijing University of Posts and Telecommunications), Tegawendé F. Bissyandé (SnT, University of Luxembourg) and Jacques Klein (University of Luxembourg).
Abstract
Advertisement drives the economy of the mobile app ecosystem. As a key component in the mobile ad business model, mobile ad content has been overlooked by the research community, which poses a number of threats, e.g., propagating malware and undesirable contents. To understand the practice of these devious ad behaviors, we perform a large-scale study on the app contents harvested through automated app testing. In this work, we first provide a comprehensive categorization of devious ad contents, including five kinds of behaviors belonging to two categories: ad loading content and ad clicking content. Then, we propose MadDroid, a framework for automated detection of devious ad contents. MadDroid leverages an automated app testing framework with a sophisticated ad view exploration strategy for effectively collecting ad-related network traffic and subsequently extracting ad contents. We then integrate dedicated approaches into the framework to identify devious ad contents. We have applied MadDroid to 40,000 Android apps and found that roughly 6\% of apps deliver devious ad contents, e.g., distributing malicious apps that cannot be downloaded via traditional app markets. Experiment results indicate that devious ad contents are prevalent, suggesting that our community should invest more effort into the detection and mitigation of devious ads towards building a trustworthy mobile advertising ecosystem.Yangyu Hu (BUPT), Haoyu Wang (Beijing University of Posts and Telecommunications), Ren He (Beijing University of Posts and Telecommunications), Li Li (Monash University), Gareth Tyson (Queen Mary University of London), Ignacio Castro (Queen Mary University of London), Yao Guo (Peking University), Lei Wu (Zhejiang University) and Guoai Xu (Beijing University of Posts and Telecommunications).
Abstract
Domain squatting, the adversarial tactic where attackers register domain names that mimic popular ones, has been observed for decades. However, there has been growing anecdotal evidence that this style of attack has spread to other domains. In this paper, we explore the presence of squatting attacks in the mobile app ecosystem. In ``App Squatting'', attackers release apps with identifiers (e.g., app, package or developer name) that are confusingly similar to those of popular apps or well-known Internet brands. This paper presents the first in-depth measurement study of app squatting to show its prevalence and implications. We first identify 11 common deformation approaches of app squatters and propose \textit {``AppCrazy''}, a tool for automatically generating variations of app identifiers. We have applied AppCrazy to the top-500 most popular apps in Google Play, generating 224,322 deformation keywords which we then use to test for app squatters on popular markets. Through this, we confirm the scale of the problem, identifying 10,553 squatting apps (an average of over 20 squatting apps for each legitimate one). Our investigation reveals that more than 51\% of the squatting apps are malicious, with some being extremely popular (up to 10 million downloads). Meanwhile, we also find that app markets have not been successful in identifying and eliminating squatting apps. Our findings demonstrate the urgency to identify and prevent app squatting abuses. To this end, we have publicly released all the identified squatting apps, as well as our tool AppCrazy.Web Mining-B (3)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Yuanxing Liu (Harbin Institute of Technology), Zhaochun Ren (Shandong University), Wei-Nan Zhang (Harbin Institute of Technology), Wanxiang Che (Harbin Institute of Technology), Ting Liu (Harbin Institute of Technology) and Dawei Yin (JD.com).
Abstract
By exploring fine-grained user behaviors, session-based recommendation predicts a user's next action from short-term behavior sessions. Most of previous work learns about a user's implicit behavior by merely taking the last click action as the supervision signal. However, in e-commerce scenarios, large-scale products with elusive click behaviors make such task challenging because of the low inclusiveness problem, i.e., many relevant products that satisfy the user's shopping intention are neglected by recommenders. Since similar products with different IDs may share the same intention, we argue that the textual information (e.g., keywords of product titles) from sessions can be used as additional supervision signals to tackle above problem through learning more shared intention within similar products. Therefore, to improve the performance of e-commerce session-based recommendation, we explicitly infer the user's intention by generating keywords entirely from the click sequence in the current session. In this paper, we propose the e-commerce session-based recommendation model with keywords generation (abbreviated as ESRM-KG) to integrate keywords generation into e-commerce session-based recommendation. Specifically, the ESRM-KG model firstly encodes an input action sequence into a high dimensional representation; then it presents a bi-linear decoding scheme to predict the next action in the current session; synchronously, the ESRM-KG model addresses incepts the high dimensional representation of its encoder to generate explainable keywords for the whole session. We carried out extensive experiments in the context of click prediction on a large-scale real-world e-commerce dataset. Our experimental results show that the ESRM-KG model outperforms state-of-the-art baselines with the help of keywords generation. % We also show the effectiveness of the generated keywords with a case study and error analysis. We also discuss how keywords generation helps the e-commerce session-based recommendation with case studies and error analysis.Defu Lian (University of Science and Technology of China), Haoyu Wang (University at Buffalo), Zheng Liu (MSRA), Jianxun Lian (MSRA), Enhong Chen (University of Science and Technology of China) and Xing Xie (MSRA).
Abstract
Deep recommender system has achieved remarkable improvements in recent years. Despite its superior ranking precision, the running efficiency and memory consumption turn out to be severe bottlenecks in reality. To overcome both limitations, we propose LightRec, a lightweight recommender system which enjoys fast online inference and economic memory consumption. The backbone of LightRec is a total of $B$ codebooks, each of which is composed of $W$ latent vectors, known as codewords. On top of such a structure, LightRec will have an item represented as additive composition of $B$ codewords, which are optimally selected from each of the codebooks. To effectively learn the codebooks from data, we devise an end-to-end learning workflow, where challenges on the inherent differentiability and diversity are conquered by the proposed techniques. In addition, to further improve the representation quality, several distillation strategies are employed, which better preserves user-item relevance scores and relative ranking orders. LightRec is extensively evaluated with four real-world datasets, which gives rise to two empirical findings: 1) compared with those the state-of-the-art lightweight baselines, LightRec achieves over 11\% relative improvements in terms of recall performance; 2) compared to conventional recommendation algorithms, LightRec merely incurs negligible accuracy degradation while leads to more than 27x speedup in top-k recommendation.Ye Yuan (Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China), Xin Luo (Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China), Mingsheng Shang (Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China) and Di Wu (Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China).
Abstract
Recommender systems (RSs) commonly describe its user-item preferences with a high-dimensional and sparse (HiDS) matrix filled with non-negative data. A non-negative latent factor (NLF) model relying on a single latent factor-dependent, non-negative and multiplicative update (SLF-NMU) algorithm is frequently adopted to process such an HiDS matrix. However, an NLF model mostly adopts Euclidean distance for its objective function, which is naturally a special case of α-β-divergence. Moreover, it frequently suffers slow convergence. For addressing these issues, this study proposes a generalized and fast-converging non-negative latent factor (GFNLF) model. Its main idea is two-fold: a) adopting α-β-divergence for its objective function, thereby enhancing its representation ability for HiDS data; b) deducing its momentum-incorporated non-negative multiplicative update (MNMU) algorithm, thereby achieving its fast convergence. Empirical studies on two HiDS matrices emerging from real RSs demonstrate that with carefully-tuned hyperparameters, a GFNLF model outperforms state-of-the-art models in both computational efficiency and prediction accuracy for missing data of an HiDS matrix.Chong Chen (Tsinghua University), Min Zhang (Tsinghua University), Weizhi Ma (Tsinghua University), Yiqun Liu (Tsinghua University) and Shaoping Ma (Tsinghua University).
Abstract
To provide more accurate recommendation, it is important to go beyond modeling user-item interactions and take context information into account. Factorization Machines (FM) with negative sampling is a popular solution for context-aware recommendation. However, it can be insufficient as sampling is not robust and usually leads to non-optimal performance in practical. While several recent efforts have enhanced FM with deep learning architectures for modelling high-order feature interactions, they either focus on rating prediction task only, or typically adopt the negative sampling strategy for optimizing the ranking performance. Due to the dramatic fluctuation of sampling, it is reasonable to argue that these sampling-based FM methods are still suboptimal for ranking tasks.In this paper, we propose to learn FM without sampling for ranking tasks, which is particularly intended for context-aware recommendation. Despite soundness, such a non-sampling strategy poses strong efficiency challenge in learning the model. To address this, we design a new ideal framework named Efficient Non-Sampling Factorization Machines (ENSFM). ENSFM not only seamlessly connects the relationship between FM and Matrix Factorization (MF), but also resolves the challenging efficiency issue via novel designs of memorization strategies. Through extensive experiments on three real-world public datasets, we show that 1) the proposed ENSFM consistently and significantly outperforms the state-of-the-art methods on context-aware Top-K recommendation, and 2) ENSFM achieves significant advantages in training efficiency, which makes it more applicable to real-world large-scale systems. Moreover, the empirical results indicate that a proper learning method is even more important than advanced neural network structures for Top-K recommendation task.Semantics (2)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Ermei Cao (Nanjing University), Difeng Wang (Nanjing University), Jiacheng Huang (Nanjing University) and Wei Hu (Nanjing University).
Abstract
Knowledge bases (KBs) have gradually become a valuable asset for many AI applications. While many current KBs are quite large, they are widely acknowledged as incomplete, especially lacking facts of long-tail entities, e.g., less famous persons. Existing approaches enrich KBs mainly on completing missing links or filling missing values. However, they only tackle a part of the enrichment problem and lack specific considerations about long-tail entities. In this paper, we propose a full-fledged approach to knowledge enrichment, which predicts missing properties and infers true facts of long-tail entities from the open Web. Prior knowledge from popular entities is leveraged to improve every enrichment step. Our experiments on the synthetic and real-world datasets and comparison with related work demonstrate the feasibility and superiority of the proposed approach.Jiaoyan Chen (University of Oxford), Xi Chen (Jarvis Lab Tencent), Ian Horrocks (University of Oxford), Ernesto Jimenez-Ruiz (City, University of London; University of Oslo) and Erik B. Myklebust (Norwegian Institute for Water Research; University of Oslo).
Abstract
The usefulness and usability of knowledge bases (KBs) is often limited by quality issues. One common issue is the presence of erroneous assertions, often caused by lexical or semantic confusion. We study the problem of correcting such assertions, and present a general correction framework which combines lexical matching, semantic embedding, soft constraint mining and semantic consistency checking. The framework is evaluated using DBpedia and an enterprise medical KB.Niklas Kolbe (University of Luxembourg), Pierre-Yves Vandenbussche (Elsevier), Sylvain Kubler (Université de Lorraine) and Yves Le Traon (University of Luxembourg).
Abstract
Ontology search and ranking are key building blocks to establish and reuse shared conceptualisations of domain knowledge on the Web. However, the effectiveness of proposed ontology ranking models is difficult to compare since these are often evaluated on diverse datasets that are limited by their static nature and scale. In this paper, we first introduce the LOVBench dataset as a benchmark for ontology term ranking. With inferred relevance judgments for more than 7000 queries, LOVBench is large enough to perform a comparison study using learning to rank (LTR) with complex ontology ranking models. Instead of relying on relevance judgments from a few experts, we consider implicit feedback from many actual users collected from the Linked Open Vocabularies (LOV) platform. Our approach further enables continuous updates of the benchmark, capturing the evolution of ontologies' relevance in an ever-changing data community. Second, we compare the performance of several feature configurations from the literature using LOVBench in LTR settings and discuss the results in the context of the observed real-world user behaviour. Our experimental results show that feature configurations which are (i) well-suited to the user behaviour, (ii) cover all features types, and (iii) consider decomposition of features can significantly improve the ranking performance.Tom Harting (Delft University of Technology), Sepideh Mesbah (Delft University of Technology) and Christoph Lofi (Delft University of Technology).
Abstract
We introduce a Language-consistent multi-lingual Open Relation Extraction Model (LOREM) for finding relation tuples of any type between entities in unstructured texts. LOREM does not rely on language-specific knowledge or external NLP tools such as translators or PoS-taggers, and exploits information and structures that are consistent over different languages. This allows our model to be easily extended with only limited training efforts to new languages, but also provides a boost to performance for a given single language. An extensive evaluation performed on 5 languages shows that LOREM outperforms state-of-the-art mono-lingual and cross-lingual open relation extractors. Moreover, experiments on languages with no or only little training data indicate that LOREM generalizes to other languages than the languages that it is trained on.Social Network-B (1)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
David García-Soriano (ISI Foundation), Konstantin Kutzkov (IT University of Copenhagen), Francesco Bonchi (Fondazione ISI) and Charalampos Tsourakakis (Harvard University).
Abstract
Correlation clustering is arguably the most natural formulation of clustering. Given $n$ objects and a pairwise similarity measure, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters. A main drawback of correlation clustering is that it requires as input the $\Theta(n^2)$ pairwise similarities. This is often infeasible to compute or even just to store. In this paper we study query-efficient algorithms for correlation clustering. Specifically, we devise a correlation clustering algorithm that, given a budget of $Q$ queries, attains a solution whose expected number of disagreements is at most $3\cdot \opti + O(n^3{Q})$, where $\opti$ is the optimal cost of the instance. Its running time is $O(Q)$, and can be easily made non-adaptive with the same guarantees. Up to constant factors, our algorithm yields a provably optimal trade-off between the number of queries $Q$ and the worst-case error attained, even for adaptive algorithms. Finally, we perform an experimental study of our proposed method on both synthetic and real data, showing the scalability and the accuracy of our algorithm.Deyu Bo (Beijing University of Posts and Telecommunications), Xiao Wang (Beijing University of Posts and Telecommunications), Chuan Shi (Beijing University of Posts and Telecommunications), Meiqi Zhu (Beijing University of Posts and Telecommunications), Emiao Lu (Tencent Ltd) and Peng Cui (Tsinghua University).
Abstract
Clustering is a fundamental task in data analysis. Recently, deep clustering, which derives inspiration primarily from deep learning approaches, achieves state-of-the-art performance and has attracted considerable attention. Current deep clustering methods usually boost the clustering results by means of the powerful representation ability of deep learning, e.g., autoencoder, suggesting that learning an effective representation for clustering is a crucial requirement. The strength of deep clustering methods is to extract the useful representations from the data itself, rather than the structure of data, which recessives scarce attention in representation learning. Motivated by the great success of Graph Convolutional Network (GCN) in encoding the graph structure, we propose a Structural Deep Clustering Network (SDCN) to integrate the structural information into deep clustering. Specifically, we design a delivery operator to transfer the representations learned by autoencoder to the corresponding GCN layer, and a dual self-supervised mechanism to unify these two different deep neural architectures and guide the update of the whole model. In this way, the multiple structures of data, from low-order to high-order, are naturally combined with the multiple representations learned by autoencoder. Furthermore, we theoretically analyze the delivery operator, i.e., with the delivery operator, GCN improves the autoencoder-specific representation as a high-order graph regularization constraint and autoencoder helps alleviate the over-smoothing problem in GCN. Through comprehensive experiments, we demonstrate that our propose model can perform consistently better over the state-of-the-art techniques.Yu Chen (University of Pennsylvania), Sampath Kannan (University of Pennsylvania) and Sanjeev Khanna (University of Pennsylvania).
Abstract
Suppose a graph $G$ is stochastically created by uniformly sampling vertices along a line segment and connecting each pair of vertices with a probability that is a known decreasing function of their distance. We ask if it is possible to reconstruct the actual positions of the vertices in $G$ by only observing the generated unlabeled graph. We study this question for two natural edge probability functions --- one where the probability of an edge decays exponentially with the distance and another where this probability decays only linearly. We initiate our study with the weaker goal of recovering only the order in which vertices appear on the line segment. For a segment of length $n$ and a precision parameter $\delta$, we show that for both exponential and linear decay edge probability functions, there is an efficient algorithm that correctly recovers (up to reflection symmetry) the order of all vertices that are at least $\delta$ apart, using only $O(n{\delta ^ 2})$ samples (vertices). Building on this result, we then show that $O(n^2 \log n{\delta ^2})$ vertices (samples) are sufficient to additionally recover the location of each vertex on the line to within a precision of $\delta$. We complement this result with an $\Omega (n^{1.5}{\delta})$ lower bound on samples needed for reconstructing positions (even by a computationally unbounded algorithm), showing that the task of recovering positions is information-theoretically harder than recovering the order. We give experimental results showing that our algorithm recovers the positions of almost all points with great accuracy.User Modeling-B (1)
(UTC/GMT +8) 10:30-12:30, April, 23, Thursday
Meeting rooms are not available now
Tong Li (The Hong Kong University of Science and Technology; University of Helsinki), Mingyang Zhang (The Hong Kong University of Science and Technology), Hancheng Cao (Stanford University), Yong Li (Tsinghua University), Sasu Tarkoma (University of Helsinki) and Pan Hui (The Hong Kong University of Science and Technology; University of Helsinki).
Abstract
The prevalence of smartphones has promoted the popularity of mobile apps in recent years. Although lots of efforts have been made to understand mobile app usage, existing studies are based primarily on short-term datasets with limited time spans, e.g., a few months. As a result, many fundamental facts on the long-term evolution of mobile app usage are yet unknown. In this paper, we aim to gain insight into the way how mobile app usage evolves across a long-term period. We first introduce an app usage collection platform named Carat, from which we gathered detailed app usage records of 1,465 mobile users over six years from 2012 to 2017 around the globe. We then conduct the first study on the long-term evolution processes from both macro-level, i.e., app-category usage, and micro-level, i.e., exact app usage. We discover that, in both levels, there is a growth stage triggered by the development of technologies. Also, there is a plateau stage in both levels caused by high correlations across app categories and the Pareto effect of app usage, respectively. Additionally, the evolution of exact app usage undergoes an elimination stage since the fierce intra-competition. Nevertheless, the diversity of app-category usage and app usage exhibits opposite trends: the diversity of app-category usage declines, while app usage diversifies. Our study provides useful implications for app developers, market intermediaries, and service providers.Qiwei Zhong (Alibaba Group, Hangzhou China), Yang Liu (Institute of Computing Technology, Chinese Academy of Sciences), Xiang Ao (Institute of Computing Technology, Chinese Academy of Sciences), Binbin Hu (Ant Financial Services Group, Hangzhou China), Jinghua Feng (Alibaba Group, Hangzhou China), Jiayu Tang (Alibaba Group, Hangzhou China) and Qing He (Institute of Computing Technology, Chinese Academy of Sciences).
Abstract
Default user detection plays one of the backbones in credit risk forecasting and management. It aims at, given a set of corresponding features, e.g., patterns extracted from trading behaviors, predicting the polarity indicating whether a user will fail to make required payments in the future. Recent efforts attempted to incorporate attributed heterogeneous information network (AHIN) for extracting complex interactive features of users and achieved remarkable success on discovering specific default users such as fraud, cash-out users, etc. In this paper, we consider default users, a more general concept in credit risk, and propose a multi-view attributed heterogeneous information network based approach coined MAHINDER to remedy the special challenges. First, multiple views of user behaviors are adopted to learn personal profile due to the endogenous aspect of financial default. Second, local behavioral patterns are specifically modeled since financial default is adversarial and accumulated. With the real datasets contained 1.38 million users on Alibaba platform, we investigate the effectiveness of MAHINDER, and the experimental results exhibit the proposed approach is able to improve AUC over 2.8% and Recall@Precision=0.1 over 13.1% compared with the state-of-the-art methods. Meanwhile, MAHINDER has as good interpretability as tree-based methods like GBDT, which buoys the deployment in online platforms.Ang Li (University of Pittsburgh), Alice Wang (Spotify), Zahra Nazari (Spotify), Praveen Chandar (Spotify) and Benjamin Carterette (Spotify).
Abstract
Over the past decade, podcasts have been one of the fastest growing online streaming media. Many online audio streaming platforms such as Pandora, Spotify, etc. that traditionally focused on music content have started to incorporate services related to podcasts. Although incorporating new media types such as podcasts has created tremendous opportunities for these streaming platforms to expand their content offering, it also introduces new challenges. Since the functional use of podcasts and music may largely overlap for many people, the two types of content may compete with one another for the finite amount of time that users may allocate for audio streaming. As a result, incorporating podcast listening may influence and change the way users have originally consumed music. Adopting quasi-experimental techniques, the current study assesses the causal influence of adding a new class of content on user listening behavior by using large scale observational data collected from a widely used audio streaming platform. %Specifically, we investigate the change and characterize the differences of users listening habits for podcast versus music after the influence. Our results demonstrate that podcast and music consumption compete slightly but do not replace one another -- users open another time window to listen to podcasts. In addition, users who have added podcasts to their music listening demonstrate significantly different consumption habits for podcasts vs. music in terms of the streaming time, duration and frequency. Taking all the differences as input features to a machine learning model, we demonstrate that a podcast listening session is predictable at the start of a new listening session. Our study provides a novel contribution for online audio streaming and consumption services to understand their potential consumers and to best support their current users with an improved recommendation system.Ashton Anderson (University of Toronto), Lucas Maystre (Spotify, Inc.), Ian Anderson (Spotify, Inc.), Rishabh Mehrotra (Spotify, Inc.) and Mounia Lalmas (Spotify, Inc.).
Abstract
On many online platforms, users can engage with millions of pieces of content, which they discover either by searching on their own or through algorithmically-generated recommendations. The user experience, in turn, is largely shaped by the content that users interact with. In this work, we study the user experience on Spotify, a popular music streaming service, through the lens of diversity: how coherent the set of items a user consumes is, and find it is a fundamental attribute. We construct high-fidelity embeddings of millions of songs based on listening behavior on Spotify, and use these embeddings to quantify how musically diverse every user is. We find that musical diversity is strongly associated with important metrics, such as user conversion and retention. On the other hand, we find that algorithmically-driven listening through recommendations pushes users towards being less musically diverse. Furthermore, we study users who become more diverse in their consumption over time and find that they do so by reducing their algorithmic consumption and increasing their organic consumption. Finally, we deploy a randomized experiment to further shed light on the relationship between recommendation and musical diversity.Our work illuminates a central tension in online platforms: how to recommend content that users are likely to enjoy in the short-term while simultaneously ensuring they can remain diverse in their consumption in the long-term.Research Tracks (5)
Web Mining-A (5)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Wenxuan Zhou (University of Southern California), Hongtao Lin (Pinterest Inc.), Bill Yuchen Lin (University of Southern California), Ziqi Wang (Tsinghua University), Junyi Du (University of Southern California), Leonardo Neves (Snapchat Inc.) and Xiang Ren (University of Southern California).
Abstract
Deep neural models for relation extraction tend to be less reliable when perfectly labeled data is limited, despite their success in label-sufficient scenarios. Instead of seeking more instance-level labels from human annotators, here we propose to annotate frequent surface patterns to form labeling rules. These rules can be automatically mined from large text corpora and generalized via a soft rule matching mechanism. Prior works use labeling rules in an exact matching fashion, which inherently limits the coverage of sentence matching and results in the low-recall issue. In this paper, we present a neural approach to ground rules for RE, namedNero, which jointly learns a relation extraction module and a soft matching module. One can employ any neural relation extraction models as the instantiation for the RE module. The soft matching module learns to match rules with semantically similar sentences such that raw corpora can be automatically labeled and leveraged by the RE module (in a much better coverage) as augmented supervision, in addition to the exactly matched sentences. Extensive experiments and analysis on two public and widely-used datasets demonstrate the effectiveness of the proposedNeroframework, comparing with both rule-based and semi-supervised methods. Through user studies, we find that the time efficiency for a human to annotate rules and sentences are similar (0.30 vs. 0.35 min per label). In particular, NERO’s performance using 270 rules is comparable to the models trained using 3,000 labeled sentences, yielding a 9.5x speedup. Moreover, Nero can predict for unseen relations at test time and provide interpretable predictions. We will release our code to the community for future research in this direction.Yu Meng (University of Illinois at Urbana-Champaign), Jiaxin Huang (University of Illinois Urbana-Champaign), Guangyuan Wang (University of Illinois at Urbana-Champaign), Zihan Wang (University of Illinois at Urbana-Champaign), Chao Zhang (Georgia Institute of Technology), Yu Zhang (University of Illinois at Urbana-Champaign) and Jiawei Han (University of Illinois at Urbana-Champaign).
Abstract
Mining a set of meaningful and distinctive topics automatically from massive text corpora has broad applications. Topic models, which discover latent topics via modeling the corpus generative process, have proven fruitful on this task. However, such purely unsupervised approaches often generate topics that do not fit the user’s particular need and yield suboptimal performances on downstream tasks. To this end, we propose a new task, discriminative topic mining, which leverages a set of user provided category names to mine distinctive topics from text corpora. This new task not only helps the user understand clearly and distinctively the topics he/she is most interested in, but also benefits directly keyword-driven classification tasks. We develop a novel category-name guided text embedding method CatE for discriminative topic mining. We conduct a comprehensive set of experiments to show that CatE mines high-quality set of topics guided by category names only, and benefits a variety of downstream applications including weakly-supervised classification and lexical entailment direction identification.Jianxing Yu (Sun Yat-sen University), Xiaojun Quan (Sun Yat-sen University), Qinliang Su (Sun Yat-sen University) and Jian Yin (Sun Yat-sen University).
Abstract
This paper focuses on the topic of multi-hop question generation, which aims to generate questions needed reasoning over multiple sentences and relations to obtain answers. In particular, we first build an entity graph to integrate various entities scattered on text by capturing their contextual relations. We then extract the sub-graph satisfying certain conditions on the relations and reasoning type, so as to obtain the reasoning chain for each question. Guided by the chain, we propose a holistic generator-evaluator network to form the questions, where such guidance helps ensure the reasonability of generated questions which need multi-hop deduction to correspond to the answers. The generator is a sequence-to-sequence model, designed with several techniques to make the questions syntactically and semantically valid. The evaluator optimizes the generator network by employing a hybrid mechanism combined of supervised and reinforced learning. Experimental results on HotpotQA data set demonstrate the effectiveness of our approach, where the generated samples can be used as pseudo training data to alleviate the data shortage problem for neural network and help learn the state-of-the-arts for multi-hop machine comprehension.Yaowei Zheng (Beihang University), Richong Zhang (Beihang University), Suyuchen Wang (Beihang University), Samuel Mensah (Beihang University) and Yongyi Mao (University of Ottawa).
Abstract
Supervised learning relies heavily on readily available labeled data to infer an effective classification function. However, proposed methods under the supervised learning paradigm are faced with scarcity of labeled data within domains, and are not generalized enough to adapt to other tasks. Transfer learning has proved to be a worthy choice to address these issues, by allowing knowledge to be shared across domains and tasks. In this paper, we propose two transfer learning methods Anchored Model Transfer (AMT) and Soft Instance Transfer (SIT), which are both based on multi-task learning, and account for model transfer and instance transfer, and can be combined into a common framework. We demonstrate the effectiveness of AMT and SIT for aspect-level sentiment classification showing the competitive performance against baseline models on benchmark datasets. Interestingly, we show that the integration of both methods AMT+SIT achieve state-of-the-art performance on the same task.Haggai Roitman (IBM Research AI), Guy Feigenblat (IBM Research AI), Doron Cohen (IBM Research AI), Odellia Boni (IBM Research AI) and David Konopnicki (IBM Research AI).
Abstract
We propose Dual-CES -- a novel unsupervised, query-focused, multi-document extractive summarizer. Dual-CES builds on top of the Cross Entropy Summarizer (CES) and is designed to better handle the tradeoff between saliency and focus in summarization. To this end, Dual-CES employs a two-step dual-cascade optimization approach with saliency-based pseudo-feedback distillation. Overall, Dual-CES significantly outperforms all other state-of-the-art unsupervised alternatives. Dual-CES is even shown to be able to outperform strong supervised summarizers.Social Network-A (5)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Yuxuan Shi (Nanjing University), Gong Cheng (Nanjing University) and Evgeny Kharlamov (Bosch Center for Artificial Intelligence).
Abstract
Keyword search is a prominent approach to querying Web data that has been extensively studied. For graph-structured data, a widely accepted semantics for keywords is based on group Steiner trees. For this NP-hard problem, existing algorithms with provable quality guarantees have prohibitive run time on large graphs. In this paper, we propose a series of practical approximation algorithms with a guaranteed quality of computed answers and very low run time. Our algorithms rely on Hub Labeling (HL), a structure that labels each vertex in a graph with a list of vertices reachable from it, which we use to compute distances and shortest paths. We devise two HLs: a conventional static HL that uses a new heuristic to improve the existing pruned landmark labeling, and a novel dynamic HL that inverts and aggregates query-relevant static labels to more efficiently process vertex sets. We show that our approach allows to compute a reasonably good approximation of answers to keyword queries in milliseconds on knowledge graphs with millions of vertices.Hongming Zhang (The Hong Kong University of Science and Technology), Xin Liu (The Hong Kong University of Science and Technology), Haojie Pan (The Hong Kong University of Science and Technology), Yangqiu Song (The Hong Kong University of Science and Technology) and Cane Wing-Ki Leung (Wisers AI Lab).
Abstract
Understanding human's language requires complex world knowledge. However, existing large-scale knowledge graphs mainly focus on knowledge about entities while ignoring knowledge about activities, states, or events, which are used to describe how entities or things act in the real world. To fill this gap, we develop ASER (activities, states, events, and their relations), a large-scale eventuality knowledge graph extracted from more than 11-billion-token unstructured textual data. ASER contains 15 relation types belonging to five categories, 194-million unique eventualities, and 64-million unique edges among them. Both human and extrinsic evaluations demonstrate the quality and effectiveness of ASER.Huajie Shao (University of Illinois at Urbana-Champaign), Dachun Sun (University of Illinois at Urbana-Champaign), Jiahao Wu (University of Illinois Urbana-Champaign), Zecheng Zhang (University of Illinois at Urbana-Champaign), Aston Zhang (Amazon), Shuochao Yao (University of Illinois at Urbana-Champaign), Shengzhong Liu (University of Illinois at Urbana-Champaign), Tianshi Wang (University of Illinois at Urbana-Champaign), Chao Zhang (Georgia Institute of Technology) and Tarek Abdelzaher (University of Illinois at Urbana-Champaign).
Abstract
GitHub has become a popular social application platform, where a large number of users post their open-source projects. In particular, an increasing number of researchers release repositories of source code related to their research papers in order to attract more people to follow their work. Motivated by this trend, we describe a novel item-item cross-platform recommender system, paper2repo, that recommends relevant repositories on GitHub that match a given paper in an academic search system such as Microsoft Academic. The key challenge is to identify the similarity between an input paper and its related repositories across the two platforms, without the benefit of human labeling. Towards that end, paper2repo integrates text encoding and constrained graph convolutional networks (GCN) to automatically learn and map the embeddings of papers and repositories into the same space, where proximity offers the basis for recommendation. To make our method more practical in real-life systems, labels used for model training are computed automatically from features of user actions on GitHub. In machine learning, such automatic labeling is often called distant supervision. To the authors' knowledge, this is the first distant-supervised cross-platform (paper to repository) matching system. We evaluate the performance of paper2repo on real-world data sets collected from GitHub and Microsoft Academic. Results demonstrate that it outperforms other state of the art recommendation methods.Hechan Tian (State Key Laboratory of Mathematical Engineering and Advanced Computing), Meng Zhang (State Key Laboratory of Mathematical Engineering and Advanced Computing), Xiangyang Luo (State Key Laboratory of Mathematical Engineering and Advanced Computing), Fenlin Liu (State Key Laboratory of Mathematical Engineering and Advanced Computing) and Yaqiong Qiao (State Key Laboratory of Mathematical Engineering and Advanced Computing).
Abstract
Social network user location prediction technology has been widely used in various geospatial applications like public health monitoring and local advertising recommendation. Due to insufficient consideration of relationships between users and location indicative words, most of existing prediction methods estimate label propagation probabilities solely based on statistical features, such as mention frequency and the number of common followed users, resulting in large location prediction error. In this paper, a Twitter user location prediction method based on representation learning and label propagation is proposed. Firstly, the heterogeneous connection relation graph is constructed based on relationships between Twitter users and relationships between users and location indicative words, and relationships unrelated to geographic attributes are filtered. Then, vector representations of users are learnt by using a series of user node sequences generated from the connection relation graph. Finally, label propagation probabilities between adjacent users are calculated based on vector representations, and the locations of unknown users are predicted through iterative label propagation. Experiments on two representative Twitter datasets - GeoText and TwUs, show that the proposed method can accurately calculate label propagation probabilities based on vector representations and improve the accuracy of location prediction. Compared with existing typical Twitter user location prediction methods - GCN and MLP-TXT+NET, the median error distance of the proposed method is reduced by 18% and 16%, respectively.Hind Almerekhi (Qatar Foundation), Haewoon Kwak (Qatar Computing Research Institute), Joni Salminen (Qatar Computing Research Institute, HBKU; and Turku School of Economics) and Bernard Jansen (Qatar Computing Research Institute, Hamad Bin Khalifa University).
Abstract
Managing the safety of online discussions from toxicity is a challenge that online communities struggle with. Therefore, identifying the causes or triggers of toxicity is essential for preventing toxic comments from manifesting in online discussions. In this research, we begin with defining toxicity triggers within discussion threads as non-toxic contributions that lead to other toxic comments. Then, we build an LSTM neural network for toxicity trigger detection using more than 221 thousand submissions containing more than 2.2 million comments from Reddit. The prediction model includes text-based features and derives features from past studies that pertain to shifts in sentiment, topic flow, and discussion context across comments in discussion threads. Our findings show that triggers of toxicity contain identifiable features, such as named entities and that incorporating shift features with the discussion context improves the performance of the prediction model by 6%, achieving an overall AUC score of 0.87. Topic and sentiment shifts frequently occur in discussions that contain toxicity triggers, indicating that shift analyses combined with the discussion context are useful for toxicity trigger detection in online discussions. We discuss implications for online communities and also provide a rich dataset for further analysis of online toxicity and its root causes.User Modeling-A (5)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Huafeng Liu (Beijing Jiaotong University), Jingxuan Wen (Beijing Jiaotong University), Zhicheng Wu (Beijing Jiaotong University), Jiaqi Wang (Beijing Jiaotong University), Liping Jing (Beijing Jiaotong University) and Jian Yu (Beijing Jiaotong University).
Abstract
Deep generative model, especially variational auto-encoder (VAE), has been successfully employed by more and more recommendation systems. The reason is that it combines the flexibility of probabilistic generative model with the powerful non-linear feature representation ability of deep neural networks. The existing VAEbased recommendation models are usually proposed under global assumption by incorporating simple priors, e.g., a single Gaussian, to regularize the latent variables. This strategy, however, is ineffective when the user is simultaneously interested in different kinds of items, i.e., the user’s preference may be highly diverse. In this paper, thus, we propose a Deep Global and Local Generative Model for recommendation to consider both local and global structure among users (DGLGM) under the Wasserstein auto-encoder framework. Besides keeping the global structure like the existing model, DGLGM adopts a non-parametric Mixture Gaussian distribution with several components to capture the diversity of the users’ preferences. Each component is corresponding to one local structure and its optimal size can be determined via the automatic relevance determination technique. These two parts can be seamlessly integrated and enhance each other. The proposed DGLGM can be efficiently inferred by minimizing its penalized upper bound with the aid of local variational optimization technique. Meanwhile, we theoretically analyze its generalization error bounds to guarantee its performance in sparse feedback data with diversity. By comparing with the state-of-the-art methods, the experimental results demonstrate that DGLGM consistently benefits the recommendation system in top-N recommendation task.Fuqiang Yu (Shandong University), Lizhen Cui (Shandong University), Wei Guo (Shandong University), Xudong Lu (Shandong University), Qingzhong Li (Shandong University) and Hua Lu (Aalborg University).
Abstract
In location-based social networks (LBSNs), considerable amounts of POI check-in data have been accumulated. As a result, successive point-of-interest (POI) recommendation is increasingly popular. Existing successive POI recommendation methods only predict where user will go next, ignoring when this behavior will occur. In this work, we focus on predicting POIs that will be visited by users in the next 24 hours, a more meaningful and rational task. Moreover, as check-in data in LBSN is very sparse,it is challenging to accurately capture user preferences in temporal patterns. In this paper, we propose a category-aware deep model CatDM that incorporates POI category and geographical influence to reduce search space to overcome data sparsity. We design two deep encoders based on LSTM to model the time series data. The first encoder captures user preferences in POI categories, whereas the second encoder exploits user preferences in POIs. Considering clock influence in the second encoder, we divide each user’s check-in history into several different time windows and develop a personalized attention mechanism for each window to facilitate CatDM to exploit temporal patterns. Moreover, to sort candidate set, we consider four specific dependencies: user-POI, user-category, POI-time and POI-user current preferences. Extensive experiments are conducted on two large real datasets. The experimental results demonstrate that our CatDM outperforms the state-of-the-art models for successive POI recommendation on sparse data.Gaole He (Renmin University of China), Junyi Li (School of Information, Renmin University of China), Xin Zhao (Renmin University of China, School of Information), Peiju Liu (Peking University) and Ji-Rong Wen (School of Information, Renmin University of China).
Abstract
The task of Knowledge Graph Completion (KGC) aims to automatically infer the missing fact information in Knowledge Graph (KG). In this paper, we take a new perspective that aims to leverage rich user-item interaction data (user interaction data for short) for improving the KGC task. Our work is inspired by the observation that many KG entities correspond to online items in application systems. However, the two kinds of data sources have very different intrinsic characteristics, and it is likely to hurt the original representation performance using simple fusion strategy.To address this challenge, we propose a novel adversarial learning approach for leveraging user interaction data for the KGC task. Our generator is isolated from user interaction data, and improves itself according to the feedback from the discriminator. The discriminator takes the learned useful information from user interaction data as input, and gradually enhances the evaluation capacity in order to identify the fake samples generated by the generator. To discover implicit entity preference of users, we design an elaborate collaborative learning algorithms based on graph neural networks, which will be jointly optimized with the discriminator. Such an approach is effective to alleviate the issues about data heterogeneity and semantic complexity for the KGC task. Extensive experiments on three real-world datasets have demonstrated the effectiveness of our approach on the KGC task.Minghong Fang (Iowa State University), Neil Zhenqiang Gong (Duke University) and Jia Liu (Iowa State University).
Abstract
Recommender system is an essential component of web services to engage users. Popular recommender systems model user preferences and item properties using a large amount of crowdsourced user-item interaction data, e.g., rating scores; then top-N items that match the best with a user's preference are recommended to the user, where the matching is determined by the modeled user preference and item properties. In this work, we show that an attacker can launch a data poisoning attack to a recommender system, i.e., an attacker can spoof a recommender system to make recommendations as the attacker desires via injecting fake users with carefully crafted user-item interaction data. Specifically, an attacker can spoof a recommender system to recommend a target item to as many normal users as possible. We focus on matrix factorization based recommender systems because they have been widely deployed in the industry. Given the number of fake users the attacker can inject, we formulate the crafting of rating scores for the fake users as an optimization problem, whose objective is to maximize the number of normal users to whom the target item is recommended. However, this optimization problem is challenging to solve as it is a non-convex integer programming problem. To address the challenge, we develop several techniques to indirectly solve the optimization problem. For instance, we leverage influence function to select a subset of normal users who are influential to the recommendations and solve our formulated optimization problem based on these influential users. We show the effectiveness of our attacks on two benchmark datasets. Moreover, we show that even if the recommender system detects fake users based on statistical analysis of their rating scores, our attacks are still effective as the detector misses a large fraction of fake users.Ruirui Li (University of California, Los Angeles), Xian Wu (University of Notre Dame), Xiusi Chen (University of California, Los Angeles) and Wei Wang (University of California, Los Angeles).
Abstract
The proliferation of GPS-enabled devices, such as smartphones, establishes the prosperity of location-based social networks (LBSNs), which results in a tremendous amount of user check-ins. These check-ins bring in preeminent opportunities to understand users' preferences and facilitate matching between users and businesses. However, the user check-ins are extremely sparse due to the huge user and business bases, which makes matching a daunting task. In this work, we investigate the recommendation problem in the context of identifying potential new customers for businesses in LBSNs. In particular, we focus on investigating the geographical influence, composed of geographical convenience and geographical dependency. In addition, we leverage metric-learning-based few-shot learning to fully utilize the user check-ins and facilitate the matching between users and businesses. To evaluate our proposed method, we conduct a series of experiments to extensively compare with 13 baselines using two real-world datasets. The results demonstrate that the proposed method outperforms all these baselines by a significant margin.Society (4)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Victor Kristof (Ecole Polytechnique Fédérale de Lausanne), Matthias Grossglauser (Ecole Polytechnique Fédérale de Lausanne) and Patrick Thiran (Ecole Polytechnique Fédérale de Lausanne).
Abstract
A body of law is an example of a dynamic corpus of text documents that is jointly maintained by a group of editors, who compete and collaborate in complex constellations. Our goal is to develop predictive models for this process, thereby shedding light on the competitive dynamics of parliamentarians making laws. For this purpose, we curated a dataset of 450 000 legislative edits introduced by European parliamentarians over the last ten years. An edit modifies the status quo of a law, and may be in competition with another edit if it modifies the same part of that law. We propose a model for predicting the success of such edits, in the face of both inertia of the status quo, and competition between overlapping edits. We include various features of the parliamentarians and of the edits to analyze the dynamics of the legislative process. The parameters of this model can be interpreted in terms of the influence of parliamentarians and of the controversy of laws. We show that the intrinsic influence of the parliamentarians helps them pass edits for laws of high controversy, but is of lesser importance for laws of low controversy. We finally show that incorporating additional latent features further boosts the predictive power by 14%, and that these features lend themselves to meaningful interpretation.Panagiotis Papadopoulos (Brave Software Inc.), Peter Snyder (Brave Software Inc.), Dimitrios Athanasakis (Brave Software Inc.) and Benjamin Livshits (Brave Software Inc., Imperial College London).
Abstract
Funding the production quality online content is a pressing problem for content producers. The most common current funding method, online advertising, is rife with well known performance and privacy harms, and an intractable subject-agent conflict; many users do not want to see advertisements, depriving the site of needed funding.Because of these negative aspects of advertisement-based-funding, paywalls are an increasingly popular alternative for websites. This shift to an increasingly ``pay-for-access'' web is one that has potentially huge implications for the web and society. Instead of a system where information (nominally) flows freely, paywalls create a web where high quality information is available to fewer and fewer people, leaving the other web users with less information, and possibly of lower quality and less accurate. Despite the potential significance of a move from an ``advertising-but-open'' web to a “paywalled” web, we find this issue understudied.This work addresses this gap in our understanding by measuring how widely paywalls have been adopted, what kinds of sites use paywalls, and the distribution of policies enforced by paywalls. A partial list of our findings include that (i) paywall use has increased, and at an increasing rate (2x more paywalls every 6 months), (ii) paywall adoption differs by country (e.g., 18.75% in US, 12.69% in Australia), (iii) paywall deployment significantly changes how users interact with the site (e.g., higher bounce rates, less incoming links), (iv) the median cost of an annual paywall access is 108 USD per site, and (v) paywalls are in general trivial to circumvent. Finally, we present the design of a novel, automated system for detecting whether a site uses a paywall, through the combination of runtime browser instrumentation and repeated programmatic interactions with the site. We intend this classifier to augment future, longitudinal measurements of paywall use and behavior.Danaja Maldeniya (University of Michigan), Ceren Budak (University of Michigan), Lionel P. Robert Jr. (University of Michigan) and Daniel M. Romero (University of Michigan).
Abstract
Collaborative crowdsourcing is a well-established model of work in the information economy. This is apparent nowhere more than in the case of open source software development. The structure and operating dynamics of these virtual and loosely-knit teams differ from traditional organizations. As a result, little is known about how their behavior may change in response to an increase in external attention. To understand these changes, we analyze millions of actions of thousands of contributors in over 1200 open source software projects that topped the GitHub Trending Projects page and thus experienced a large increase in attention, in comparison to a control group of projects identified through propensity score matching. In carrying out our research, we use the lens of organizational change management, which considers the challenges teams face during rapid growth and how they adapt their work routines, organizational structure, and management style. We show that, relative to the control group, trending results in an explosive growth in the effective team size. However, most newcomers make only shallow and transient contributions such as reporting and fixing a specific bug, while a few show levels of commitment matching that of the original members. In response, the original team transitions towards administrative roles, responding to requests and reviewing and integrating work done by newcomers. In the resulting aftermath, trending projects evolve towards a more distributed coordination model with newcomers becoming more central, albeit in limited ways. Additionally, project teams become more modular with subgroups specializing in different aspects of the project. We discuss broader implications for collaborative crowdsourcing teams that face attention shocks.Jonathan P. Chang (Cornell University), Justin Cheng (Facebook) and Cristian Danescu-Niculescu-Mizil (Cornell University).
Abstract
Discourse involves two perspectives: a person's intention in making an utterance and others' perception of that utterance. Previous studies of online discussions have largely taken the latter third-party perspective, e.g., relying on crowdsourced labels to quantify properties like sentiment and subjectivity. By contrast, in this work we present a computational framework for exploring both perspectives: the speaker's intentions and how they are perceived.Intention is, however, difficult to capture as only the actual author of an utterance knows their intention with certainty. To address this, we combine logged data about public comments on Facebook with a survey of almost 20,000 people about their intentions in writing these comments or about their perceptions of comments that others had written. In particular, we focus on judgments of whether a comment was stating a fact or an opinion, since prior work has shown that these are often confused.We show that intentions and perceptions diverge in consequential ways. People are more likely to perceive opinions than to intend them, and linguistic cues that signal how an utterance is intended can differ from those that signal how it will be perceived. Furthermore, this misalignment between intentions and perceptions can be linked to the future health of a conversation: when a comment whose author intended to share a fact is misperceived as sharing an opinion, the subsequent conversation is more likely to derail into uncivil behavior than when the comment is perceived as intended. Altogether, these findings may inform the design of discussion platforms that better promote positive interactions.Security (4)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Rolf van Wegberg (Delft University of Technology), Fieke Miedema (Delft University of Technology), Ugur Akyazi (Delft University of Technology), Arman Noroozian (Delft University of Technology), Bram Klievink (Leiden University) and Michel van Eeten (Delft University of Technology).
Abstract
Many cybercriminal entrepreneurs lack the skills and techniques to provision certain parts of their business model, leading them to outsource these parts to specialized criminal vendors. Online anonymous markets, from Silk Road to AlphaBay, have been used to search for these products and contract with their criminal vendors. While one listing of a product generates high sales numbers, another identical listing fails to sell. In this paper, we investigate which factors determine the performance of cybercrime products. Does success depend on the characteristics of the product or of the vendor? Or neither?To answer this question, we analyze scraped data on the business-to-business cybercrime segments of the AlphaBay market (2015-2017), consisting of 7,543 listings from 1,339 vendors which have been sold at least 126,934 times. We construct variables to capture price and product differentiators, like refund policies and customer support. We capture the influence of vendor characteristics by identifying five distinct vendor profiles based on latent profile analysis of six properties, such as experience and reputation. We leverage these product and vendor characteristics to empirically predict the number of sales of cybercrime solutions, whilst controlling for the lifespan and the type of solution. We find that all vendor profiles - either positively or negatively - influence sales. Consistent with earlier insights into carding forums, we identify prevalent product differentiators to be influencing the relative success of a product. While all these product differentiators do correlate significantly with product performance, their explanatory power is lower than that of vendor profiles. When outsourcing, the vendor seems to be of more importance to the buyers than product differentiators or the price.Alexander Sjösten (Chalmers University of Technology), Peter Snyder (Brave Software Inc.), Antonio Pastor (Universidad Carlos III de Madrid), Panagiotis Papadopoulos (Brave Software Inc.) and Benjamin Livshits (Brave Software Inc & Imperial College of London).
Abstract
Filter lists play a large and growing role in protecting and assisting web users. The vast majority of popular filter lists are crowd-sourced, where a large number of people manually label resources related to undesirable web resources (e.g. ads, trackers, paywall libraries), so that they can be blocked by browsers and extensions.Because only a small percentage of web users participate in the generation of filter lists, a crowd-sourcing strategy works well for blocking either uncommon resources that appear on "popular" websites, or resources that appear on a large number of "unpopular" websites. A crowd-sourcing strategy will performs poorly for parts of the web with small "crowds", such as regions of the web serving languages with (relatively) few speakers.This work addresses this problem through the combination of two novel techniques: (i) deep browser instrumentation that allows for the accurate generation of request chains, in a way that is robust in situations that confuse existing measurement techniques, and (ii) an ad classifier that uniquely combines perceptual and page-context features to remain accurate across multiple languages.We apply our unique two-step filter list generation pipeline to three regions of the web that currently have poorly maintained filter lists: Sri Lanka, Hungary, and Albania. We generate new filter lists that complement existing filter lists. Our complementary lists block an additional 2,270 of ad and ad-related resources (1,901 unique) when applied to 6,475 pages targeting these three regions.We hope that this work can be part of an increased effort at ensuring that the security, privacy, and performance benefits of web resource blocking can be shared with all users, and not only those in dominant linguistic or economic regions.Simon Woo (skku), Hyoungshick Kim (Sungkyunkwan University), Hanbin Jang (SKKU) and Woojung Ji (SKKU).
Abstract
A package tracking number (PTN) is widely used to monitor and track a shipment. Usually, a package tracking number, which is a sequence of digits, is associated with information about a sender and a receiver, as well as the package delivery status. Through the lenses of security and privacy, however, a package tracking number can possibly reveal certain personal information, leading to privacy breaches.In this work, we examine the privacy issues associated with online package tracking systems used in the top three most popular package delivery service providers in the world (FedEx, DHL, and UPS) and found that those websites provide users' personal data with a PTN. Moreover, we discovered that PTNs are highly structured and predictable via PTN enumeration attacks. Further, such users' personal data from PTNs can be massively collected. We found that there is no security policy to limit the number of consecutive attempts in package tracking services. We experimented and analyzed more than one million package tracking records obtained from Fedex, DHL, and UPS, and showed that within 5 attempts, an attacker can efficiently guess more than 90\% of PTNs for FedEx and DHL, and close to 50\% of PTNs for UPS exploiting consecutive PTN patterns. In addition, we present two practical concrete case studies: 1) to infer business transactions information and 2) to uniquely identify recipients. We demonstrate that some companies can intentionally obtain their competitors' business and customer information with massively collected PTNs. Also, we found that more than 109 recipients can be uniquely identified with less than 10 comparisons by linking the PTN information with the online people search service, Whitepages. Our research is the first to uncover how PTNs can be used to leak other personal information, and to reveal the fact that current PTNs system can be misused to jeopardize user privacy.Janith Weerasinghe (New York University), Bailey Flanigan (Drexel University), Aviel Stein (Drexel University), Damon McCoy (New York University) and Rachel Greenstadt (New York University).
Abstract
Online Social Network (OSN) Users' demand to increase their account popularity has driven the creation of an underground ecosystem that provides services or techniques to help users manipulate content curation algorithms. One method of subversion that has recently emerged occurs when users form groups, called pods, to facilitate reciprocity abuse, where each member reciprocally interacts with content posted by other members of the group. We collect 1.8 million Instagram posts that were posted in pods hosted on Telegram. We first summarize the properties of these pods and how they are used, uncovering that they are easily discoverable by Google search and have a low barrier to entry. We then create two machine learning models for detecting Instagram posts that have gained interaction through two different kinds of pods, achieving 0.91 and 0.94 AUC, respectively. Finally, we find that pods are effective tools for increasing users' Instagram popularity, we estimate that pod utilization leads to a significantly increased level of likely organic comment interaction on users' subsequent posts.Search (3)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Hamed Zamani (Microsoft), Susan Dumais (Microsoft), Nick Craswell (Microsoft), Paul Bennett (Microsoft) and Gord Lueck (Microsoft).
Abstract
Search queries are often short, and the underlying user intent may be ambiguous. This makes it challenging for search engines to predict possible intents, only one of which may pertain to the current user. To address this issue, search engines often diversify the result list and present documents relevant to multiple intents of the query. An alternative approach is to ask the user a question to clarify her information need. Asking clarifying questions is particularly important for scenarios with ``limited bandwidth'' interfaces, such as voice-only and small-screen devices. In addition, our user studies and large-scale online experiment show that asking clarifying questions is also useful in web search. Although some recent studies have pointed out the importance of asking clarifying questions, generating clarifying question for open-domain search tasks remains unstudied and is the focus of this paper. Lack of training data even within major search industry for this task makes it challenging. To mitigate this issue, we first identify a taxonomy of clarification for open-domain search queries by analyzing large-scale query reformulation data sampled from Bing search logs. This taxonomy leads us to a set of question templates and a simple yet effective slot filling algorithm. We further use this model as a source of weak supervision to automatically generate clarifying questions for training. Furthermore, we propose supervised and reinforcement learning models for generating clarifying questions learned from weak supervision data. We also investigate methods for generating candidate answers for each clarifying question, so users can select from a set of pre-defined answers. Human evaluation of the clarifying questions and candidate answers for hundreds of search queries demonstrates the effectiveness of the proposed solutions.Corbin Rosset (Microsoft), Chenyan Xiong (Microsoft), Xia Song (Microsoft), Daniel Campos (Microsoft), Nick Craswell (Microsoft), Saurabh Tiwary (Microsoft) and Paul Bennett (Microsoft).
Abstract
``People Also Ask'' question suggestion is a popular feature in commercial search engines and a crucial gateway to lead users to more conversational search experiences. This paper fundamentally studies this question suggestion function, including offline metrics, suggestion models, weak supervision data, and online experiments. We first establish a novel offline evaluation metric, Usefulness, which reaches beyond just relevance and requires leading the search session with more conversational ``next-turn'' suggestions. We construct the first public benchmark dataset for Useful question suggestion. Then we develop two suggestion systems, a BERT retrieval model and a GPT-2 generation model. To guide the suggestion models to provide more Useful questions, we invent a new inductive training method that guides suggestion models to more next-turn suggestions using weak supervisions from mined coherent and informative search sessions in the search log. Our offline experiments demonstrate the crucial role our ``next-turn'' inductive training plays in improving Usefulness over a strong online system. Our online A/B evaluation shows that our ``next-turn'' focused question suggestions receive 8\% more user clicks than the previous system.Zhen Qin (Google), Zhongliang Li (Google), Michael Bendersky (Google) and Donald Metzler (Google).
Abstract
Recent neural ranking algorithms focus on learning semantic matching between query and document terms. However, practical learning to rank systems typically rely on a wide range of side information beyond query and document textual features, like location, user context, etc. It is common practice to concatenate all of these features and rely on deep models to learn a complex representation.We study how to effectively and efficiently combine textual information from queries and documents with other useful but less prominent side information for learning to rank. We conduct synthetic experiments to show that: 1) neural networks are inefficient at learning the interaction between two prominent features (e.g., query and document embedding features) in the presence of other less prominent features; 2) direct application of a state-of-art method for higher-order feature generation is also inefficient at learning such important interactions.Based on the above observations, we propose a simple but effective matching cross network (MCN) method for learning to rank with side information. MCN conducts an element-wise multiplication matching of query and document embeddings and leverages a technique called latent cross to effectively learn the interaction between matching output and all side information. The approach is easy to implement, adds minimal parameters and latency overhead to standard neural ranking architectures, and can be used for efficient end-to-end training.We conduct extensive experiments using two of the world's largest personal search engines, Gmail and Google Drive search, and show that each proposed component adds meaningful gains against a strong production baseline with minimal latency overhead, thereby demonstrating the practical effectiveness and efficiency of the proposed approach.Jyun-Yu Jiang (University of California, Los Angeles), Tao Wu (Google), Georgios Roumpos (Google), Heng-Tze Cheng (Google), Xinyang Yi (Google), Ed Chi (Google), Harish Ganapathy (Google), Nitin Jindal (Google), Pei Cao (Google) and Wei Wang (University of California, Los Angeles).
Abstract
Modern online content-sharing platforms host billions of items like music, videos, and products uploaded by various providers for users to discover items of their interests. To satisfy the information needs, the task of effective item retrieval (or item search ranking) given user search queries has become one of the most fundamental problems to online content-sharing platforms. Moreover, the same query can represent different search intents for different users, so personalization is also essential for providing more satisfactory search results. Different from other similar research tasks, such as ad-hoc retrieval and product retrieval with copious words and reviews, items in content-sharing platforms usually lack sufficient descriptive information and related meta-data as features. In this paper, we propose the end-to-end deep attentive model (EDAM) to deal with personalized item retrieval for online content-sharing platforms using only discrete personal item history and queries. Each discrete item in the personal item history of a user and its content provider are first mapped to embedding vectors as continuous representations. A query-aware attention mechanism is then applied to identify the relevant contexts in the user history and construct the overall personal representation for a given query. Finally, an extreme multi-class softmax classifier aggregates the representations of both query and personal item history to provide personalized search results. We conduct extensive experiments on a large-scale real-world dataset with hundreds of million users from one of large-scale online content-sharing platforms. The experimental results demonstrate that our proposed approach significantly outperforms several competitive baseline methods. It is also worth mentioning that this work utilizes a massive dataset from a real-world commercial content-sharing platform for personalized item retrieval to provide more insightful analysis from the industrial aspects.Mobile (4)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Christian Meurisch (TU Darmstadt), Bekir Bayrak (TU Darmstadt) and Max Mühlhäuser (TU Darmstadt).
Abstract
User services increasingly base their actions on AI models, e.g., to offer personalized and proactive support. However, the underlying AI algorithms require a continuous stream of personal data---leading to privacy issues, as these algorithms typically run in the provider's cloud, and thus, users have to share data out of their sovereign territory. Current privacy-preserving concepts are either not applicable to such AI-based services or to the disadvantage of any party. This paper presents PrivAI, a new decentralized and privacy-by-design platform for overcoming the need for sharing user data to benefit from personalized AI services. In short, PrivAI complements existing approaches to personal data stores, but strictly enforces the confinement of raw user data. PrivAI further addresses the resulting challenges by (1) dividing AI algorithms into cloud-based general model training and a subsequent local personalization step, and by (2) loading confidential AI models into a trusted execution environment, and thus, protecting provider's intellectual property. Our experiments show the feasibility and effectiveness of PrivAI with comparable performance as currently-practiced approaches.Zhihan Fang (Rutgers University), Guang Wang (Rutgers University), Shuai Wang (Southeast University), Chaoji Zuo (Rutgers University), Fan Zhang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences) and Desheng Zhang (Rutgers University).
Abstract
Understanding representativeness in cellular web logs at city scale is essential for web applications. Most of the existing work on cellular web analyses or applications is built upon data from a single network in a city, which may not be representative of the overall usage patterns since multiple cellular networks coexist in most cities in the world. In this paper, we conduct the first comprehensive investigation of multiple cellular networks in a city with a 100\% user penetration rate. We study web usage patterns (se.g., internet access services) correlation and difference between diverse cellular networks in terms of spatial and temporal dimensions to quantify the representativeness of web usage from a single network in usage patterns of all users in the same city. Moreover, relying on three external datasets, we study the correlation between the representativeness and contextual factors (e.g., Point-of-Interest, population, and mobility) to explain the potential causalities for the representativeness difference. We found that contextual diversity is a key reason for representativeness difference, and representativeness has a significant impact on the performance of real-world applications. Based on the analysis results, we further design a correction model to address the bias of single cellphone networks and improve representativeness by 45.8\%.Tianfu He (Harbin Institute of Technology), Jie Bao (JD Intelligent City Research), Ruiyuan Li (JD Intelligent City Research), Sijie Ruan (JD Intelligent City Research), Yanhua Li (Worcester Polytechnic Institute (WPI)), Li Song (Meituan-Dianping), Hui He (Room B618,Dorm.10, HIT,Harbin,China,150001) and Yu Zheng (JD Intelligent City Research).
Abstract
Human mobility, e.g., GPS trajectories of vehicles, sharing bikes, and mobile devices, reflects people's travel patterns and preferences, which are especially crucial for urban applications such as urban planning and business location selection.However, collecting a large set of human mobility data is not easy because of the privacy and commercial concerns, as well as the high cost to deploy sensors and a long time to collect the data, especially in newly developed cities.Realizing this, in this paper, based on the intuition that the human mobility is driven by the mobility intentions reflected by the origin and destination (or OD) features, as well as the preference to select the path between them, we investigate the problem to generate mobility data for a new target city, by transferring knowledge from mobility data and multi-source data of the source cities.Our framework contains three main stages: 1)~{\em mobility intention transfer}, which learns a latent unified mobility intention distribution across the source cities, and transfers the model of the distribution to the target city; 2)~{\em OD generation}, which generates the OD pairs in the target city based on the transferred mobility intention model, and 3)~{\em path generation}, which generates the paths for each OD pair, based on a utility model learned from the real trajectory data in the source cities. Also, a demo of our trajectory generator is publicly available online for two city regions. Extensive experiment results over four regions in China validate the effectiveness of the proposed solution. Besides, an on-field case study is presented in a newly developed region, i.e., Xiongan, China. With the generated trajectories in the new city, many trajectory mining techniques can be applied.Web Mining-B (4)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Wenbo Zheng (School of software Engineering, Xi'an Jiao Tong University), Shaocong Mo (Zhejiang University) and Yang Zhao (Xi'an Jiaotong University).
Abstract
As the size and source of network traffic increase, so does the challenge of monitoring and analyzing network traffic. The challenging problems of classifying encrypted traffic are the imbalanced property of network data, and overly dependent on data size. In this paper, we propose an application of a meta-learning approach to address these problems, named RBRN. The RBRN is an end-to-end classification model that learns representative features from the raw flows and then classifies them in a unified framework. Moreover, we design hallucinator to produce additional training samples for the imbalanced classification, and then focus on meta-learning to classify unseen categories from few labeled samples. We validate the effectiveness of the RBRN on the real-world network traffic dataset, and the experimental results demonstrate that the RBRN can achieve an excellent classification performance and outperform other methods on encrypted traffic classification.Shuangyin Li (Department of Computer Science, South China Normal University), Yu Zhang (Southern University of Science and Technology), Rong Pan (Sun Yat-sen University) and Kaixiang Mo (The Hong Kong University of Science and Technology).
Abstract
Word embeddings have been widely used and proven to be effective in many natural language processing and text modeling tasks. It is obvious that one ambiguous word could have very different semantics in various contexts, which is called polysemy. Most existing works aim at generating only one single embedding for each word while a few works build a limited number of embeddings to present different meanings for each word. However, it is hard to determine the exact number of senses for each word as the word meaning is dependent on contexts. To address this problem, we propose a novel Adaptive Probabilistic Word Embedding (APWE) model, where the word polysemy is defined over a latent interpretable semantic space. Specifically, at first each word is represented by an embedding in the latent semantic space and then based on the proposed APWE model, the word embedding can be adaptively adjusted and updated based on different contexts to obtain the tailored word embedding. Empirical comparisons with state-of-the-art models demonstrate the superiority of the proposed APWE model.Binny Mathew (IIT Kharagpur), Sandipan Sikdar (RWTH Aachen University), Florian Lemmerich (RWTH Aachen University) and Markus Strohmaier (RWTH Aachen University & GESIS).
Abstract
We introduce ‘POLAR’ - a framework that adds interpretability to pre-trained word embeddings via the adoption of semantic differentials. Semantic differentials are a psychometric construct for measuring the semantics of a word by analysing its position on a scale between two polar opposites (e.g., cold - hot, soft - hard). The original idea of our approach is to transform existing, pre-trained word embeddings via semantic differentials to a new “polar” space where dimensions are interpretable. The framework allows for selecting discriminative dimensions from a set of polar dimensions provided by an oracle. We show that the interpretable dimensions selected by our framework align with human judgement. We also demonstrate the effectiveness of our framework by deploying it to various downstream tasks where our interpretable word embeddings achieve a performance that is comparable to the original word embeddings. These results together demonstrate that interpretability could be added to word embeddings without compromising on the performance. Our work is relevant for researchers or engineers interested in interpreting trained word embeddings.Dongbo Xi (Institute of Computing Technology, Chinese Academy of Sciences), Fuzhen Zhuang (Institute of Computing Technology, Chinese Academy of Sciences), Ganbin Zhou (Tencent), Xiaohu Cheng (Tencent), Fen Lin (Tencent) and Qing He (Institute of Computing Technology, Chinese Academy of Sciences).
Abstract
Domain adaptation tasks such as cross-domain sentiment classification aim to utilize existing labeled data in the source domain and unlabeled or few labeled data in the target domain to improve the performance in the target domain via reducing the shift between the data distributions. Existing cross-domain sentiment classification methods need to distinguish pivots, i.e., the domain-shared sentiment words, and non-pivots, i.e., the domain-specific sentiment words, for excellent adaptation performance. In this paper, we first design a Category Attention Network (CAN), and then propose a model named CAN-CNN to integrate CAN and a Convolutional Neural Network (CNN). On the one hand, the model regards pivots and non-pivots as unified category attribute words and can automatically capture them to improve the domain adaptation performance; on the other hand, the model makes an attempt at interpretability to learn the transferred category attribute words. Specifically, the optimization objective of our model has three different components: 1) the supervised classification loss; 2) the distributions loss of category feature weights; 3) the domain invariance loss. Finally, the proposed model is evaluated on three public sentiment analysis datasets and the results demonstrate that CAN-CNN can outperform other various baseline methods.Wenxuan Zhang (The Chinese University of Hong Kong), Wai Lam (The Chinese University of Hong Kong), Yang Deng (The Chinese University of Hong Kong) and Jing Ma (The Chinese University of Hong Kong).
Abstract
Product-specific question answering platforms can greatly help to address the concerns of potential customers. However, the user-provided answers on such platforms often vary a lot in their qualities. Helpfulness votes from the community can indicate the overall quality of the answer, but they are often missing. Accurately predicting the helpfulness of an answer to a given question and thus identifying helpful answers is becoming a demanding need. Since the helpfulness of an answer depends on multiple perspectives instead of only topical relevance investigated in typical QA tasks, common answer selection algorithms are insufficient for tackling this task. In this paper, we propose the Review-guided Answer Helpfulness Prediction (RAHP) model that not only considers the interactions between QA pairs but also investigates the opinion coherence between the answer and crowds' opinions reflected in the reviews, which is another important factor to identify helpful answers. Moreover, we tackle the task of determining opinion coherence as a language inference problem and explore the utilization of pre-training strategy to transfer the textual inference knowledge obtained from a specifically designed trained network. Extensive experiments conducted on real-world data across seven product categories show that our proposed model achieves superior performance on the prediction task.Semantics (3)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Jiaming Shen (University of Illinois at Urbana-Champaign), Zhihong Shen (Microsoft), Chenyan Xiong (Microsoft), Chi Wang (Microsoft), Kuansan Wang (Microsoft) and Jiawei Han (University of Illinois at Urbana-Champaign).
Abstract
Taxonomy consists of machine-interpretable semantics and provides valuable knowledge for many web applications. For example, online retailers (e.g., Amazon and eBay) use taxonomies for product recommendation, and web search engines (e.g., Google and Bing) leverage taxonomies to enhance query understanding. Enormous efforts have been made on constructing taxonomies either manually or semi-automatically. However, with the fast-growing volume of web content, existing taxonomies will become outdated and fail to capture emerging knowledge. Therefore, in many applications, dynamic expansions of an existing taxonomy are in great demand. In this paper, we study how to expand an existing taxonomy by adding a set of new concepts. We propose a novel self-supervised framework, named TaxoExpan, which automatically generates a set ofShuang Peng (Ant Financial Services Group), Hengbin Cui (Ant Financial Services Group), Niantao Xie (MOE Key Laboratory of Computational Linguistics, Peking University), Sujian Li (MOE Key Laboratory of Computational Linguistics, Peking University), Jiaxing Zhang (Ant Financial Services Group) and Xiaolong Li (Ant Financial Services Group).
Abstract
Learning sentence similarity is a fundamental research topic and has been explored using various deep learning methods recently. In this paper, we further propose an enhanced recurrent convolutional neural network (Enhanced-RCNN) model for learning sentence similarity. Compared to the state-of-the-art BERT model, the architecture of our proposed model is far less complex. Experimental results show that our similarity learning method outperforms the baselines and achieves the competitive performance on two real-world paraphrase identification datasets.Yu Liu (Tsinghua University), Quanming Yao (4Paradigm) and Yong Li (Tsinghua University).
Abstract
With the rapid development of knowledge bases (KBs), link prediction task, which completes KBs with missing facts, has been broadly studied in especially binary relational KBs (a.k.a knowledge graph) with powerful tensor decomposition related methods. However, the ubiquitous n-ary relational KBs with higher-arity relational facts are paid less attention, in which existing translation based and neural network based approaches have weak expressiveness and high complexity in modeling various relations. Tensor decomposition has not been considered for n-ary relational KBs, while directly extending tensor decomposition related methods of binary relational KBs to the n-ary case does not yield satisfactory results due to exponential model complexity and their strong assumptions on binary relations. To generalize tensor decomposition for n-ary relational KBs, in this work, we propose GETD, a generalized model based on Tucker decomposition and Tensor Ring decomposition. The existing negative sampling technique is also generalized to the n-ary case for GETD. In addition, we theoretically prove that GETD is fully expressive to completely represent any KBs. Extensive evaluations on two representative n-ary relational KB datasets demonstrate the superior performance of GETD, significantly improving the state-of-the-art methods by over 15%. Moreover, GETD further obtains the state-of-the-art results on the benchmark binary relational KB datasets.Junshuang Wu (Beihang University), Richong Zhang (Beihang University), Yongyi Mao (University of Ottawa), Hongyu Guo (National Research Council Canada, Ottawa, Canada), Masoumeh Soflaei Shahrbabak (University of Ottawa) and Jinpeng Huai (Beihang University).
Abstract
Entity linking, which maps named entity mentions in a document into the proper entities in a given knowledge base, has shown to significantly benefit from modeling the entity relatedness through Graph Convolutional Networks (GCN). Nevertheless, existing GCN entity linking models fail to take into account the fact that the structured graph for a set of entities not only depends on the contextual information of the given document but also adaptively changes on different aggregation layers of the GCN network, resulting in insufficiency in terms of capturing the relatedness between entities. In this paper, we propose a dynamic GCN architecture to effectively cope with this challenge. The graph structure in our model is dynamically computed and modified during training. Through aggregating information from dynamic linked nodes, our GCN model can collectively identify the entity mappings between the document and the knowledge base and to efficiently capture the topical coherence among various entity mentions in the entire document. Empirical studies on benchmark entity linking data sets confirm the superior performance of our proposed strategy and the benefits of the dynamic graph structure.Social Network-B (2)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Nikhil Goyal (IIT Delhi), Harsh Jain (IIT Delhi) and Sayan Ranu (IIT Delhi).
Abstract
Graph generative models have been extensively studied in the data mining literature. While traditional techniques are based on generating structures that adhere to a pre-decided distribution, recent techniques have shifted towards learning this distribution directly from the data. While learning-based approaches have imparted significant improvement in quality, some limitations remain to be addressed. First, learning graph distributions introduces additional computational overhead, which limits their scalability to large graph databases. Second, many techniques only learn the structure and do not address the need to also learn node and edge labels, which encode important semantic information and influence the structure itself. Third, existing techniques often incorporate domain-specific rules and lack generalizability. Fourth, the experimentation of existing techniques is not comprehensive enough due to either using weak evaluation metrics or focusing primarily on synthetic or small datasets. In this work, we develop a domain-agnostic, scalable technique called GraphGen to overcome all of these limitations. GraphGen converts graphs to sequences using minimum DFS codes. Minimum DFS codes are canonical labels and capture the graph structure precisely along with the label information. The complex joint distributions between structure and semantic labels are learned through a novel LSTM architecture. Extensive experiments on million-sized, real graph datasets show GraphGen to be $4$ times faster on average than state-of-the-art techniques while being $40$ times better in quality across a comprehensive set of $10$ different metrics.Rachid Guerraoui (Ecole Polytechnique Fédérale de Lausanne), Anne-Marie Kermarrec (EPFL, Mediego), Olivier Ruas (Peking University) and Francois Taiani (Univ Rennes, Inria, CNRS, IRISA).
Abstract
We propose GoldFinger, a new compact and fast-to-compute binary representation of datasets to approximate Jaccard’s index. We illustrate the effectiveness of GoldFinger on the emblematic big data problem of K-Nearest-Neighbor (KNN) graph construction and show that GoldFinger can drastically accelerate a large range of existing KNN algorithms with little to no overhead. As a side effect, we also show that the compact representation of the data protects users’ privacy for free by providing k-anonymity and l-diversity. Our extensive evaluation of the resulting approach on several realistic datasets shows that our approach delivers speedups of up to 78.9% compared to the use of raw data while only incurring a negligible to moderate loss in terms of KNN quality. To convey the practical value of such a scheme, we apply it to item recommendation, and show that the loss in recommendation quality is negligible.Samed Atouati (Télécom Paristech), Xiao Lu (BNP Paribas Asset Management) and Mauro Sozio (Télécom ParisTech).
Abstract
Social networks users often express their discontent with a product or a service from a company. Such a reaction is more pronounced in the aftermath of a corporate scandal such as a corruption scandal or food poisoning in a chain restaurant. In our work, we focus on identifying negative purchase intent in a tweet, i.e. the intent of a user of not purchasing any product or consuming any service from a company. We develop a binary classifier for such a task, which consists of a generalization of logistic regression leveraging the locality of purchase intent in posts from Twitter. We conduct an extensive experimental evaluation against state-of-the-art approaches on a large collection of tweets, showing the effectiveness of our approach in terms of F1 score. We also provide some preliminary results on which kinds of corporate scandals might affect the purchase intent of customers the most.Huda Nassar (Stanford University), David Gleich (Purdue University), Austin Benson (Cornell University), Shweta Jain (University of California Santa Cruz) and Caitlin Kennedy (Purdue University).
Abstract
In the simplest setting, graph visualization is the problem of producing a set of two-dimensional coordinates for each node that meaningfully shows connections and communities in a graph. Among other uses, having a meaningful layout is often useful to help interpret the results from network sciences tasks such as community detection and link prediction. There are several existing graph visualization techniques in the literature that are based on spectral methods, graph embeddings, or optimizing graph distances. Despite the large number of methods, it is still often challenging or extremely time consuming to produce meaningful layouts of graphs with hundreds of thousands of vertices. Existing methods often either fail to produce a visualization in a meaningful time window, or produce a layout colorfully called a ``hairball'', which looks like a filled ellipse with small hairs emerging that does not illustrate any internal structure in the graph. Here, we show that adding higher-order information based on cliques to a classic eigenvector based graph visualization techniques enables it to produce meaningful plots of large graphs. We further evaluate these visualizations along a number of graph visualization metrics and we find that it outperforms existing techniques on a metric that uses random walks to measure the local structure. Finally, we show many examples of how our algorithm successfully produces layouts of large networks.User Modeling-B (2)
(UTC/GMT +8) 13:30-15:30, April, 23, Thursday
Meeting rooms are not available now
Jing Yao (Renmin University of China), Zhicheng Dou (Renmin University of China), Jun Xu (Renmin University of China) and Ji-Rong Wen (Renmin University of China).
Abstract
Personalized search improves generic ranking models by taking the user interests into consideration and returning more accurate search results to individual users. In recent years, machine learning and deep learning techniques have been successfully applied in personalized search. Most existing personalization models simply regard the search history as a static set of user behaviours and learn fixed ranking strategies based on the recorded data. Though improvements have been observed, it is obvious that these methods ignore the dynamic nature of the search process: search is a sequence of interactions between the search engine and the user. During the search process, the user interests may dynamically change. It would be more helpful if a personalized search model could track the whole interaction process and update its ranking strategy continuously. In this paper, we propose a reinforcement learning based personalization model, referred to as RLPer, to track the sequential interactions between the users and search engine with a hierarchical Markov Decision Process (MDP). In RLPer, the model (agent) interacts with the user (environment) through returning document list in the high level MDP, while it samples document pairs under each query as training data to update the ranking model in the low level. Experimental results on query logs from a commercial search engine verified that our proposed model can significantly outperform state-of-the-art personalized search models.Defu Lian (University of Science and Technology of China), Qi Liu (University of Science and Technology of China) and Enhong Chen (University of Science and Technology of China).
Abstract
As the task of predicting a personalized ranking on a set of items, item recommendation has become an important way to address information overload. Optimizing ranking loss aligns better with the ultimate goal of item recommendation, so many ranking-based methods were proposed for item recommendation, such as collaborative filtering with Bayesian Personalized Ranking (BPR) loss, and Weighted Approximate-Rank Pairwise (WARP) loss. However, the ranking-based methods can not consistently beat regression-based models with the gravity regularizer. The key challenge in ranking-based optimization is difficult to fully use the limited number of negative samples, particularly when they are not so informative. To this end, we propose a new ranking loss based on importance sampling so that more informative negative samples can be better used. We then design a series of negative samplers from simple to complex, whose informativeness of negative samples is from less to more. With these samplers, the loss function is very easy to use and can be optimized by popular solvers. The proposed algorithms are evaluated with five real-world datasets of varying size and difficulty. The results show that they consistently outperform the state-of-the-art item recommendation algorithms, and the relative improvements with respect to NDCG@50 are more than 19.2\% on average. Moreover, the loss function is verified to make better use of negative samples and to require fewer negative samples when they are more informative.Feiyang Pan (Institute of Computing Technology, Chinese Academy of Sciences), Xiang Ao (Institute of Computing Technology, Chinese Academy of Sciences), Pingzhong Tang (Tsinghua University), Min Lu (Tencent), Dapeng Liu (Tencent), Lei Xiao (Tencent) and Qing He (Institute of Computing Technology, CAS).
Abstract
It is often observed that the probabilistic predictions given by a machine learning system can disagree with averaged actual outcomes on specific subsets of data, which is also known as the issue of miscalibration. It is responsible for the unreliability of practical machine learning systems. For example, in an online advertising system, an ad can receive a click-through rate prediction of 0.1 over some population of users where its actual click rate is 0.15. In such cases, the probabilistic predictions have to be fixed before deployment.In this paper, we first introduce an evaluation metric for calibration, coined field-level calibration error, that measures bias in predictions over the input fields that the decision-maker concerns. We show that existing post-hoc calibration methods have limited improvements in the new field-level metric and completely fail to improve other non-calibration metrics such as the AUC score. To this end, we propose Neural Calibration, a simple yet powerful post-hoc calibration method that learns to calibrate by making full use of the field-aware information over the validation set. We present extensive experiments on five large-scale datasets, including a default prediction dataset, an insurance dataset, and three user response prediction tasks for advertising. The results showed that Neural Calibration significantly improves against uncalibrated predictions in common metrics, including the negative log-likelihood, Brier score, AUC, as well as the field-level calibration error, and consistently outperforms existing methods.Rohail Syed (University of Michigan), Kevyn Collins-Thompson (University of Michigan), Paul Bennett (Microsoft Research AI), Mengqiu Teng (University of Michigan), Shane Williams (Microsoft Research AI), Wendy Tay (Independent) and Shamsi Iqbal (Microsoft Research AI).
Abstract
As AI technology advances, the opportunity to improve educational outcomes by integrating AI technology with an overall learning experience offers promise. We investigate forward-looking interactive reading experiences that leverage both automatic question generation and attention signals, such as gaze tracking, to improve short- and long-term learning outcomes. We aim to expand the known pedagogical benefits of adjunct questions to more general reading scenarios, by investigating the benefits of adjunct questions generated only after, and based on, the participant's gaze attention behavior when reading an article. We compare manually-written and Automatic Question Generation (AQG) as potential question sources. We further investigate gaze and reading patterns indicative of low vs high learning in both short- and long-term scenarios (one-week followup). We show AQG-generated adjunct questions have promise as a way to scale to a wide variety of reading material where the cost of manually curating questions may be prohibitive.Research Tracks (6)
Web Mining-A (6)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
André Greiner-Petter (University of Wuppertal), Moritz Schubotz (University of Wuppertal), Fabian Müller (FIZ Karlsruhe), Corinna Breitinger (University of Konstanz), Howard Cohl (National Institute of Standards and Technology), Akiko Aizawa (National Institute of Informatics) and Bela Gipp (University of Wuppertal).
Abstract
Mathematical notation, i.e., the writing system used to communicate concepts in mathematics, encodes valuable information for a variety of information search and retrieval systems. Yet, mathematical notations remain mostly unutilized by today's systems. In this paper, we present the first in-depth study on the distributions of mathematical notation in two large scientific corpora: the open-access arXiv (2.5B mathematical objects) and the mathematical reviewing service zbMATH (61M mathematical objects). Our study lays a foundation for future research projects on mathematical information retrieval for large scientific corpora. Further, we demonstrate the relevance of our results to a variety of use-cases. For example, to assist semantic extraction systems, to improve scientific search engines, and to facilitate specialized math recommendation systems.The contributions of our presented research are as follows: (1) we present the first distributional analysis of mathematical formulae on arXiv and zbMATH; (2) we retrieve relevant mathematical objects for given textual search queries (i.e., linking $P_{n}^{(\alpha, \beta)}\!\left(x\right)$ with `Jacobi polynomial'); (3) we extend zbMATH's search engine by providing relevant mathematical formulae; and (4) we exemplify the applicability of the results by presenting auto-completion for math inputs as the first contribution to math recommendation systems. To expedite future research projects, we make our source code and the data available.Ramnath Kumar (BITS Pilani Hyderabad Campus), Shweta Yadav (Knoesis), Raminta Daniulaityte (Wright State University), Francois Lamy (Mahidol University), Krishnaprasad Thirunarayan (Knoesis), Usha Lokala (Knoesis) and Amit Sheth (Knoesis).
Abstract
Darknet, crypto markets are online marketplaces using crypto currencies (e.g., Bitcoin, Monero) and advanced encryption techniques to offer anonymity to vendors and consumers trading for illegal goods or services. The exact volume of substances advertised and sold through these crypto markets is difficult to assess, at least partially, because vendors tend to maintain multiple accounts (or Sybil accounts) within and across different crypto markets. Linking these different accounts will allow us to accurately evaluate the volume of substances advertised across the different crypto markets by each vendor. In this paper, we present a multi-view unsupervised framework (eDarkFind) that helps modeling vendor characteristics and facilitates Sybil account detection. We employ a multi-view learning paradigm to generalize and improve the performance by exploiting the diverse views from multiple rich sources such as BERT, stylometric, and location representation. Our model is further tailored to take advantage of domain-specific knowledge such as the Drug Abuse Ontology to take into consideration the substance information. We performed extensive experiments and demonstrated that the multiple views obtained from diverse sources can be effective in linking Sybil accounts. Our proposed eDarkFind model achieves an accuracy of 98% on three real-world datasets which shows the generality of the approach.Udit Paul (University of California, Santa Barbara), Alexander Ermakov (University of California, Santa Barbara), Michael Nekrasov (University of California, Santa Barbara), Vivek Adarsh (University of California, Santa Barbara) and Elizabeth Belding (University of California, Santa Barbara).
Abstract
Natural disasters are increasing worldwide at an alarming rate.To aid relief operations during and post disaster, humanitarian organizations rely on various types of situational information such as missing, trapped or injured people and damaged infrastructure in an area. Crucial and timely identification of infrastructure and utility damage is critical to properly plan and execute search and rescue operations. However, in the wake of natural disasters, real time identification of this information becomes challenging. This research investigates the use of tweets posted on the Twitter social media platform to detect power and communication outages during natural disasters. We first curate a data set of 18,097 tweets based on domain-specific keywords obtained using Latent Dirichlet Allocation. We annotate the gathered data set to separate the tweets into different types of outage-related events: power outage,communication outage and both power-communication outage. We analyze the tweets to identify information such as popular words,length of words and hashtags as well as sentiments that are associated with tweets in these outage-related categories. Furthermore,we apply machine learning algorithms to classify these tweets into their respective categories. Our results show that simple classifiers such as the boosting algorithm are able to classify outage related tweets from unrelated tweets with close to 100% f1-score. Addition-ally, we observe that the transfer learning model, BERT, is able to classify different categories of outage-related tweets with close to 90%accuracy in less than 90 seconds of training and testing time,demonstrating that tweets can be mined in real-time to assist first responders during natural disastersSimone Filice (Amazon), Nachshon Cohen (Amazon) and David Carmel (Amazon).
Abstract
Community Question Answering (CQA) websites, such as Stack Exchange or Quora, allow users to freely ask questions and obtain answers from other users, i.e., the community.Personal assistants, such as Amazon Alexa or Google Home, can also exploit CQA data to answer a broader range of questions and increase customers' engagement. However, the voice-based interaction poses new challenges to the Question Answering scenario. Even assuming that we are able to retrieve a previously asked question that perfectly matches the user's query, we cannot simply read its answer to the user. A major limitation is the answer length. Reading these answers to the user is cumbersome and boring. Furthermore, many answers contain non-voice-friendly parts, such as images, or URLs.In this paper, we define the Answer Reformulation task and propose a novel solution to automatically reformulate a community provided answer making it suitable for a voice interaction. Results on a manually annotated dataset extracted from Stack Exchange show that our models improve strong baselines.Mingxi Cheng (University of Southern California), Shahin Nazarian (University of Southern California) and Paul Bogdan (University of Southern California).
Abstract
Social media have evolved to be popular and applicable to almost every aspect of our lives. The convenience of posting online not only benefits individual users, but also fosters various fast-spreading rumors. The rapid and wide expansion of rumors may cause persistent adverse impacts. Researchers therefore put great effort to reduce the negative impacts of rumors. A rumor classification system is designed to detect, track, and verify rumors on social media. It typically includes four components, namely rumor detector, rumor tracker, stance classifier and veracity classifier. Prior works tackled some of the components either individually or jointly. An efficient, high performance framework that can realize all four functions is in great need. To address this, we propose VRoC, a tweet-level variational autoencoder based rumor classification system. VRoC includes a co-train engine that trains variational autoencoders (VAEs) and rumor classification components. This helps the VAEs to tune their latent representations to be classifier-friendly. We also show that VRoC is able to classify unseen rumors with high levels of accuracy. Under PHEME dataset, VRoC consistently outperforms several state-of-the-art techniques, on both observed and unobserved rumors, by up to 26.9%, in terms of macro-F1 scores.Social Network-A (6)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Zi Chen (ECNU), Long Yuan (Nanjing University of Science and Technology), Xuemin Lin (UNSW), Lu Qin (UTS) and Jianye Yang (Hunan University).
Abstract
Clique is one of the most fundamental models for cohesive subgraph mining in network analysis. Existing clique model mainly focuses on unsigned networks. In real world, however, many applications are modeled as signed networks with positive and negative edges. As the signed networks hold their own properties different from the unsigned networks, the existing clique model is inapplicable for the signed networks. Motivated by this, we propose the balanced clique model that considers the most fundamental and dominant theory, structural balance theory, for signed networks, and study the maximal balanced clique enumeration problem which computes all the maximal balanced cliques in a given signed network. We show that the maximal balanced clique enumeration problem is NPhard. A straightforward solution for the maximal balanced clique enumeration problem is to treat the signed network as two unsigned networks and leverage the off the-shelf techniques for unsigned networks. However, such a solution is inefficient for large signed networks. To address this problem, in this paper, we first propose a new maximal balanced clique enumeration algorithm by exploiting the unique properties of signed networks. Based on the new proposed algorithm, we devise two optimization strategies to further improve the efficiency of the enumeration. We conduct extensive experiments on large real and synthetic datasets. The experimental results demonstrate the efficiency, effectiveness and scalability of our proposed algorithms.Ilya Amburg (Cornell University), Nate Veldt (Cornell University) and Austin Benson (Cornell University).
Abstract
Modern graph or network datasets often contain rich structure that goes beyond simple pairwise connections between nodes. This call for complex representations that can capture, for instance, edges of different types as well as so-called "higher-order interactions" that involve more than two nodes at a time. This brings a need for methods that can meaningfully analyze data with such rich structure. Here, we develop a scalable computational framework for the fundamental problem of clustering graphs with categorical edge labels, targeting the setting where clusters correspond to groups of nodes that tend to participate in the same type or category of interaction. Our approach seamlessly generalizes to hypergraphs, enabling analysis of higher-order interactions with categorical hyperedges, and our objective functions can be optimized in polynomial time in the special case of two categorical labels. Although minimizing our objective becomes NP-hard in the multi-label case, we develop effective approximation algorithms based on linear programming and multiway-cut techniques. We show that our algorithms readily outperform competitive baselines in both synthetic and real-world data.Digvijay Boob (Georgia Institute of Technology), Yu Gao (Georgia Institute of Technology), Richard Peng (Georgia Institute of Technology), Saurabh Sawlani (Georgia Institute of Technology), Charalampos Tsourakakis (Boston University), Di Wang (Georgia Institute of Technology) and Junxing Wang (CMU).
Abstract
The problem of finding dense components of a graph is a major primitive in graph mining and data analysis. The {\em densest subgraph problem} (DSP) that asks to find a subgraph with maximum average degree forms a basic primitive in dense subgraph discovery with applications ranging from community detection to unsupervised discovery of biological network modules gionis2015dense. The DSP is exactly solvable in polynomial time using maximum flows goldberg1984finding,gallo1989fast,khuller2009finding. Due to the high computational cost of maximum flows, Charikar's greedy approximation algorithm is usually preferred in practice due to its linear time and linear space complexity asahiro2000greedily,charikar2000greedy. It constitutes a key algorithmic idea in scalable solutions for large-scale dynamic graphs bahmani2012densest,bhattacharya2015space. However, its output density can be a factor 2 off the optimal solution.In this paper we design {\sc Greedy++}, an iterative peeling algorithm that improves upon the performance of Charikar's greedy algorithm significantly. Our iterative greedy algorithm is able to output near-optimal and optimal solutions fast by adding a few more passes to Charikar's greedy algorithm. Furthermore {\sc Greedy++} is more robust against the structural heterogeneities (e.g., skewed degree distributions) in real-world datasets. An additional property of our algorithm is that it is able to assess {\em quickly}, without computing maximum flows, whether Charikar's approximation quality on a given graph instance is closer to the worst case theoretical guarantee of $1{2}$ or to optimality. We also demonstrate that our method has significant efficiency advantage over the maximum flow based exact optimal algorithm. For example, our algorithm achieves $\sim$145$\times$ speedup on average across a variety of real-world graphs while finding subgraphs of density that are at least 90\% as dense as the densest subgraph.Se-eun Yoon (Korea Advanced Institute of Science and Technology), Hyungseok Song (Korea Advanced Institute of Science and Technology), Kijung Shin (Korea Advanced Institute of Science and Technology) and Yung Yi (Korea Advanced Institute of Science and Technology).
Abstract
Hypergraphs provide a natural way of representing interactions that occur in groups. Different downstream tasks and computational convenience motivate an extensive array of prior work to adopt some form of abstraction and simplification of complex higher-order group interactions in hypergraphs, showing the value of using hypergraphs in many graph tasks. However, the following question has yet to be addressed: How much abstraction of group interactions is sufficient in solving a hypergraph task, and how different such results become across different datasets? This question, if properly answered, provides a useful engineering guideline on how to appropriately trade off between complexity in representation of higher-order group interactions and accuracy of solving a task involving hypergraphs. To this end, we propose a method of incrementally representing group interactions using a notion of n-projected graph whose accumulation contains the information on up to n-way interactions, and quantify the accuracy of solving a given task as n grows for various datasets. As a downstream task, we consider hyperedge prediction, an extension of link prediction, which, we believe, is a canonical task for evaluating graph models. Through extensive experiments on 15 real-world datasets, we draw the following messages: (a) Diminishing returns: small n is enough to achieve accuracy comparable with near-perfect approximations, (b) Troubleshooter: as the task becomes more challenging, higher n brings more benefit, and (c) Irreducibility: datasets whose pairwise interactions do not tell much about higher-order interactions lose much accuracy when reduced to pairwise abstractions.Lijun Chang (The University of Sydney) and Miao Qiao (The University of Auckland).
Abstract
In this paper, we aim to understand the distribution of the densest subgraphs of a given graph under the density notion of average-degree. We show that the structures, the relationships and the distributions of all the densest subgraphs of a graph G can be encoded in O(L) space in an index called the ds-Index. Here L denotes the maximum output size of a densest subgraph of G. More importantly, ds-Index can report all the minimal densest subgraphs of G collectively in O(L) time and can enumerate all the densest subgraphs of G with an O(L) delay. Besides, the construction of ds-Index costs no more than finding a single densest subgraph using the state-of-the-art approach. Our empirical study shows that for web-scale graphs with billions of edges, the ds-Index can be constructed in several minutes on an ordinary commercial machine.User Modeling-A (6)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Katherine Van Koevering (Cornell University), Austin Benson (Cornell University) and Jon Kleinberg (Cornell University).
Abstract
There is inherent information captured in the order in which we write words in a list. The orderings of binomials - lists of two words separated by "and" or "or" - has been studied for more than a century. These binomials are common across many areas of speech, in both formal and informal text. In the last century, numerous explanations have been given to describe what order people use for these binomials, from differences in semantics to differences in phonology. These rules describe primarily "frozen" binomials that exist in exactly one ordering and have lacked large-scale trials to determine efficacy.Text in online social media such as Reddit provides an unique opportunity to study these lists in the context of informal text at a very large scale. In this work, we expand the view of binomials to include a large-scale analysis of both frozen and non-frozen binomials in a quantitative way. Using this data we then demonstrate that most previously proposed rules are ineffective at predicting binomial ordering. By tracking the order of these binomials across time and communities we are able to establish additional, unexplored dimensions central to these predictions and demonstrate the global structure of the binomials across communities.Expanding beyond the question of individual binomials, we then explore the global structure of binomials in various communities, establishing a new model for these lists and analyzing this structure for non-frozen and frozen binomials. Additionally, novel analysis of trinomials - lists of length three - suggests that none of the analysis of binomials applies in these cases. Finally, we demonstrate how large data sets gleaned from the web can be used in conjunction with older theories and work to expand and improve on old questions.Xinyan Zhao (University of Science and Technology of China), Feng Xiao (University of Science and Technology of China), Haoming Zhong (WeBank.com), Jun Yao (WeBank.com) and Huanhuan Chen (University of Science and Technology of China).
Abstract
The study of question answering has received increasing attention in recent years. This work focuses on providing an answer that compatible with both user intent and conditioning information corresponding to the question, such as delivery status and stock information in e-commerce. However, these conditions may be wrong or incomplete in real-world applications. Although existing question answering systems have considered the external information, such as categorical attributes and triples in knowledge base, they all assume that the external information is correct and complete. To alleviate the effect of defective condition values, this paper proposes condition aware and revise Transformer (CAR-Transformer) CAR-Transformer (1) revises each condition value based on the whole conversation and original conditions values, and (2) it encodes the revised conditions and utilizes the conditions embedding to select an answer. Experimental results on a real-world customer service dataset demonstrate that the CAR-Transformer can still select an appropriate reply when conditions corresponding to the question exist wrong or missing values, and substantially outperforms baseline models on automatic and human evaluations. The proposed CAR-Transformer can be extended to other NLP tasks which need to consider conditioning information.Xiaoying Zhang (The Chinese University of Hong Kong), Hong Xie (College of Computer Science, Chongqing University), Hang Li (Bytedance Inc.) and John C.S. Lui (The Chinese University of Hong Kong).
Abstract
Contextual bandit algorithms provide principled online learning solutions to balance the exploration-exploitation trade-off in various applications such as recommender systems. However, the learning speed of the traditional contextual bandit algorithms is often slow due to the need for extensive exploration. This poses a critical issuein applications like recommender systems, since users may need to provide feedbacks to a lot of uninterested items. To accelerate the learning speed, we generalize the contextual bandit to conversational contextual bandit. The conversational contextual bandit leverages not only behavioral feedbacks on arms (e.g., articles in news recommendation), but also occasional conversational feedbacks on key-terms from the user. Here, a key-term can relate to a subset of arms, for example, a category of articles in news recommendation, etc. We design a new bandit algorithm, which we call the Conversational UCB algorithm (ConUCB), to address two challenges in conversational contextual bandit: (1) Which key-terms to select to conduct conversation; (2) How to leverage conversational feedbacks to accelerate the speed of bandit learning. We theoretically prove that ConUCB can achieve a smaller regret upper bound than the traditional contextual bandit algorithm LinUCB, which implies a faster learning speed. Experiments on synthetic data, as well as real datasets from Yelp and Toutiao, demonstrate the efficacy of the ConUCB algorithm.Liu Yang (Google; University of Massachusetts Amherst), Minghui Qiu (Alibaba), Chen Qu (University of Massachusetts Amherst), Cen Chen (Ant Financial Services Group), Jiafeng Guo (Institute of Computing Technology), Yongfeng Zhang (Rutgers University), Bruce Croft (University of Massachusetts Amherst) and Haiqing Chen (Alibaba Group).
Abstract
Personal assistant systems, such as Apple Siri, Google Now, Amazon Alexa, and Microsoft Cortana, are becoming ever more widely used. Understanding user intent such as clarification questions, potential answers and user feedback in information-seeking conversations is critical for retrieving good responses. In this paper, we analyze user intent patterns in information-seeking conversations and propose an intent-aware neural response ranking model IART, which refers to ''Intent-Aware Ranking with Transformers''. IART is built on top of the integration of user intent modeling and the recent breakthroughs in language representation learning with the Transformer architecture that relies entirely on a self-attention mechanism instead of recurrent nets. It incorporates intent-aware utterance attention to derive an importance weighting scheme of utterances in conversation context with the aim of better conversation history understanding. We conduct extensive experiments with three information-seeking conversation data sets including both standard benchmarks and commercial data. Our proposed model outperforms all baseline methods with respect to a variety of metrics. We also perform case studies and analysis of learned user intent and its impact on response ranking in information-seeking conversations to provide interpretation of results. We will open source the code of our model.Huiqiang Mao (Alibaba Group), Yanzhi Li (City University of Hong Kong), Chenliang Li (Wuhan University), Di Chen (Alibaba Group), Xiaoqing Wang (Alibaba Group) and Yuming Deng (Alibaba Group).
Abstract
The presence or absence of one item in a recommendation list will affect the demand for other items because customers are often willing to switch to other items if their most preferred items are not available. The cross-item influence, called “peer effects”, has been largely ignored in the literature. In this paper, we develop a peer-aware recommender system, named PARS. We apply a ranking based choice model to capture the cross-item influence and solve the resultant MaxMin problem with a decomposition algorithm. The MaxMin model solves for the recommendation decision in the meanwhile of estimating users’ preferences towards the items, which yields high-quality recommendations robust to input data variation, as the theoretical analysis shows. Experimental results illustrate that PARS outperforms a few frequently used methods in practice. An online evaluation with a flash sales scenario at Taobao also shows that PARS delivers significant improvements in terms of both conversion rates and user value.Crowdsourcing (2)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Yao Zhou (University of Illinois at Urbana-Champaign), Arun Reddy Nelakurthi (Samsung Electronics), Ross Maciejewski (Arizona State University), Wei Fan (Tencent America) and Jingrui He (University of Illinois at Urbana-Champaign).
Abstract
The need for annotated labels to train machine learning models led to a surge in crowdsourcing. Given a noisy labeled set, how can we leverage the label information obtained from amateur crowd workers to denoise the data? Also, is it possible to teach the crowd workers using the noisy labeled set and improve their performance? In this paper, we answer both questions via a novel adaptive and interactive teaching framework, which uses visual explanations to simultaneously teach and gauge the confidence level of the crowd workers. In particular, the teacher performs teaching using an empirical risk minimizer learned from a noisy labeled set; the workers are assumed to have a forgetting behavior and their learning rate depends on the interpretation difficulty of the teaching item. Furthermore, we also show that the empirical risk minimizer used by the teacher is a reliable and realistic substitute for the unknown target concept by utilizing the unbiased surrogate loss. Finally, the performance of the proposed framework is demonstrated through experiments on multiple real-world image and text data sets.Alexander Braylan (The University of Texas at Austin) and Matthew Lease (The University of Texas at Austin).
Abstract
Modeling annotators and their labels is useful for ensuring data quality. However, while many models have been proposed to handle binary or categorical labels, prior methods do not generalize to complex annotation tasks (e.g., open-ended text, multivariate, or structured responses) without devising new models for each specific task. To obviate the need for task-specific modeling, we propose to model distances between labels, rather than the labels themselves. Our models are agnostic to the distance function; we leave it to the requesters to specify an appropriate distance function for their given annotation task. We propose three models, including a Bayesian hierarchical extension of multidimensional scaling. Results show the generality and effectiveness of our models across four diverse, complex annotation tasks: sequence labeling, translation, syntactic parsing, and element rankingMingzhou Yang (Xi'an Jiaotong University), Yanhua Li (Worcester Polytechnic Institute), Xun Zhou (The University of Iowa), Hui Lu (Guangzhou University), Zhihong Tian (Guangzhou University) and Jun Luo (Lenovo Research, Hong Kong).
Abstract
Public transports, such as subway lines and buses, offer affordable ride-sharing services and reduce the road network traffic. Extracting people's preferences from their public transit choices is non-trivial. When people travel by public transits, they make sequences of transit choices, and their rewards are usually influenced by the other people's choices, so this process can be seen as a Markov Game (MG). In this paper, we make the first effort to model travelers’ preferences of making transit choices using MGs. Based on the discovery that passengers usually never change their policies, we propose novel algorithms to extract the reward functions from the observed, deterministic equilibrium joint policy of all agents in a general-sum MG to infer travelers' preferences. First, we assume we have the access to the entire joint policy. We characterize the set of all reward functions for which the given joint policy is a Nash equilibrium policy. In order to remove the degeneracy of the solution, we then attempt to pick reward functions so as to maximize the deviation from the the observed policy to the sub-optimal policy of each agent. This result in a skillfully solvable linear programming algorithm of the multi-agent inverse reinforcement learning (MA-IRL) problem. Then, we deal with the case where we have access to the equilibrium joint policy through an actual trajectory. We propose an iterative algorithm inspired by single-agent apprenticeship learning algorithms and the cyclic coordinate descent approach. Then, we validate our algorithms using a simple discrete problem. Finally, under the assumption that the actual joint policy is Nash equilibrium and the passengers' reward functions are linear with the decision-making features, we use the proposed algorithms on a unique real-world dataset (from Shenzhen, China) to extract passengers' preferences.Health (3)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Rhys Biddle (University of Technology, Sydney), Aditya Joshi (CSIRO), Shaowu Liu (University of Technology, Sydney), Cecile Paris (CSIRO) and Guandong Xu (University of Technology, Sydney).
Abstract
Harnessing data from social media to monitor health events is a promising avenue for public health surveillance. A key step is the detection of reports of a disease (referred to as 'health mention classification') amongst tweets that mention disease words. Primary work shows that figurative usage of disease words may prove to be challenging for health mention classification. Since the experience of a disease is associated with a negative sentiment, we present a method that utilises sentiment information to improve health mention classification. Specifically, our classifier for health mention classification combines pre-trained contextual word representations with sentiment distributions of words in the tweet. For our experiments, we extend a benchmark dataset of tweets for health mention classification, by adding over 14k manually annotated tweets across existing and new diseases. We also additionally annotate each tweet with a label that indicates if the disease words are used in a figurative sense. Our classifier outperforms current SOTA approaches in detecting both health-related and figurative tweets that mention disease words. We also show that tweets containing disease words are mentioned figuratively more often than in a health-related context, proving to be challenging for classifiers targeting health-related tweets.Payam Karisani (Emory University), Eugene Agichtein (Emory University) and Joyce Ho (Emory University).
Abstract
Mining social media content for tasks such as detecting personal experiences or events, suffer from lexical sparsity, insufficient training data, and inventive lexicons. To reduce the burden of creating extensive labeled data and improve classification performance, we propose to perform these tasks in two steps: 1. Decomposing the task into domain-specific sub-tasks by identifying key concepts, thus utilizing human domain understanding; and 2. Combining the results of learners for each key concept using co-training to reduce the requirements for labeled training data. We empirically show the effectiveness and generality of our approach, Co-Decomp, using three representative social media mining tasks, namely Personal Health Mention detection, Crisis Report detection, and Adverse Drug Reaction monitoring. The experiments show that our model is able to outperform the state-of-the-art text classification models--including those using the recently introduced BERT model--when small amounts of training data are available.Taisa Kushner (University of Colorado - Boulder) and Amit Sharma (Microsoft).
Abstract
Recent years have seen a rise in technology-based platforms for mental health, in particular social media platforms which seek to provide peer-to-peer support to individuals suffering from mental distress. Studies on the impact of these platforms have historically tracked interactions on a single-post thread, or longitudinally over months or years of usage, however, it is often not clear how an individual's mental health changes across this time. We show a unique characteristic of activity on one such mental health platform, Talklife, which shows that people engage on this platform in ``bursts'' and ``breaks'' of activity, similar to online search behavior for health.We formalize the notion of bursts based on median activity of each user and propose bursts as a natural unit of analysis for tracking and understanding change in psychosocial well-being in an online mental health community. We then study the characteristics of a burst which lead to positive outcomes for an individual, based on a definition of positive cognitive change.We find that users who undergo a positive cognitive change over a burst of activity are more likely to engage with others at a higher rate through posting replies on other’s posts, participate in increased complex support and lower simple support when replying to others, and have increased post diversity while maintaining similarity between the categories they post replies and original posts in. We also study how a user's behavior changes before and after they experience a moment of change.Lastly, features which correlate to users experiencing moments of cognitive change are robustly tested against self-reported changes in mood to determine two actionable suggestions for improving user experience: persistence within a burst, and giving complex emotional support to others. This work has implications for how we think about user interactions with online mental health platforms, user churn, and retention.Wen Wang (Carnegie Mellon University), Han Zhao (Carnegie Mellon University), Honglei Zhuang (University of Illinois at Urbana-Champaign), Rema Padman (Carnegie Mellon University) and Nirav Shah (NorthShore University HealthSystem and University of Chicago Pritzker School of Medicine).
Abstract
Early identification of patients at risk for postoperative complications can facilitate timely workups and treatments and improve health outcomes. Currently, a widely-used surgical risk calculator online web system developed by the American College of Surgeons (ACS) uses patients' static features, e.g. gender, age, to assess the risk of postoperative complications. However, the most crucial signals that reflect the actual postoperative physical conditions of patients are usually real-time dynamic signals, including the vital signs of patients (e.g., heart rate, blood pressure) collected from postoperative monitoring. In this paper, we develop a dynamic postoperative complication risk scoring framework (DyCRS) to detect the "at-risk" patients in a real-time way based on postoperative sequential vital signs and static features. DyCRS is based on adaptations of the Hidden Markov Model (HMM) that captures hidden states as well as observable states to generate a real-time, probabilistic, complication risk score. Evaluating our model using electronic health record (EHR) on elective Colectomy surgery from a major health system, we show that DyCRS significantly outperforms the state-of-the-art ACS calculator and real-time predictors with 50.16% area under precision-recall curve (AUCPRC) gain on average in terms of detection effectiveness. In terms of earliness, our DyCRS can predict 15hrs55mins earlier on average than clinician's diagnosis with the recall of 60% and a precision of 55%. Furthermore, Our DyCRS can extract interpretable patients' stages, which are consistent with previous medical postoperative complication studies. We believe that our contributions demonstrate significant promise for developing a more accurate, robust and interpretable postoperative complication risk scoring system, which can benefit more than 50 million annual surgeries in the US by substantially lowering adverse events and healthcare costs.Economics (2)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Natã M. Barbosa (University of Illinois at Urbana-Champaign), Emily Sun (Airbnb Inc.), Judd Antin (Airbnb Inc.) and Paolo Parigi (Airbnb Inc.).
Abstract
Trust is a fundamental prerequisite in the growth and sustainability of sharing economy platforms. Many of such platforms rely on transactions that require trust actions to take place, such as entering a stranger's car or sleeping at a stranger's place. For this reason, understanding, measuring, and tracking trust can be of great benefit to such platforms, enabling them to identify trust behaviors, both online and offline, and identify groups which may benefit from trust-building interventions. In this work, we present the design and evaluation of a behavioral framework to measure a user's propensity to trust others on a sharing economy platform. We conducted an online experiment with 4,499 Airbnb users in the form of an investment game in order to capture users' propensity to trust other users on Airbnb. Then, we used the experimental data to generate both explanatory and predictive models of trust propensity. Our contribution is a framework that can be used to measure trust propensity in sharing economy platforms like Airbnb via online and offline signals. We discuss which affordances need to be in place so that sharing economy platforms can get signals of trust, in addition to how such a framework can be used to inform design around trust in the short and long term.Linyi Yang (Insight Centre for Data Analytics, University College Dublin, Dublin, Ireland), Riuhai Dong (Insight Centre for Data Analytics, University College Dublin, Dublin, Ireland), Tin Lok James Ng (School of Mathematics and Applied Statistics, University of Wollongong, Australia) and Barry Smyth (Insight Centre for Data Analytics, University College Dublin, Dublin, Ireland).
Abstract
The volatility forecasting task refers to predicting the amount of variability in the price of a financial asset over a certain period. It is an important mechanism for evaluating the risk associated with an asset and, as such, is of significant theoretical and practical importance in financial analysis. While classical approaches have framed this task as a time-series prediction one – using historical pricing as a guide to future risk forecasting – recent advances in natural language processing have seen researchers turn to complementary sources of data, such as analyst reports, social media, and even the audio data from earnings calls. This paper proposes a novel hierarchical, transformer, multi-task architecture designed to harness the text and audio data from quarterly earnings conference calls to predict future price volatility in the short and long term. This includes a comprehensive comparison to a variety of baselines, which demonstrates very significant improvements in prediction accuracy, in the range 17% - 49% compared to the current state-of-the-art. In addition, we describe the results of an ablation study to evaluate the relative contributions of each component of our approach and the relative contributions of text and audio data with respect to prediction accuracy.Christof Naumzik (ETH Zurich) and Stefan Feuerriegel (ETH Zurich).
Abstract
In e-commerce, product presentations, and particularly images, are known to provide important information for user decision-making, and yet the relationship between images and prices has not been studied. To close this research gap, we suggest a tailored web mining framework, since one must quantify the relative contribution of image content in describing prices ceteris paribus. That is, one must account for the fact that such images inherently depict heterogeneous products. In order to isolate the pricing power of image content, we suggest a three-stage framework involving deep learning and statistical inference.Our empirical evaluation draws upon a comprehensive dataset of more than 20,000 real estate listings from Craigslist. We find that the image content describes a large portion of the variance in prices, even when controlling for location and common characteristics of apartments. A one standard deviation in the image variable is associated with a 14.45% increase in price. By utilizing a carefully designed instrumental variables estimation, we further set out to obtain causal estimates. Our empirical findings contribute to theory by quantifying the hedonic value of images and thus establishing a link between visual appearance and product pricing. Even though a positive relationship seems intuitive, we provide for the first time an empirical confirmation. Based on our large-scale computational study, we further yield evidence of a picture superiority effect: simply put, a beneficial image corresponds to the same price change as 2856.03 additional words in the textual description.In sum, images capture valuable information for users that goes beyond narrative explanations. As a direct implication, we aid online platforms and their users in assessing and improving the multi-modal presentation of their product offerings. Finally, we contribute to web mining by highlighting the importance of visual information.Seungbae Kim (University of California, Los Angeles), Jyun-Yu Jiang (University of California, Los Angeles), Masaki Nakada (University of California, Los Angeles), Jinyoung Han (Sungkyunkwan University) and Wei Wang (University of California, Los Angeles).
Abstract
Influencer marketing has become a key marketing method for brands in recent years. Hence, brands have been increasingly utilizing influencers' social networks to reach niche markets, and researchers have been studying various aspects of influencer marketing. However, brands have often suffered from searching and hiring the right influencers with specific interests/topics for their marketing due to a lack of available influencer data and/or limited capacity of marketing agencies. This paper proposes a multimodal deep learning model that uses text and image information from social media posts (i) to classify influencers into specific interests/topics (e.g., fashion, beauty) and (ii) to classify their posts into certain categories. We use the attention mechanism to select more relevant posts to influencers' topics thereby generating useful representations of influencers. We conduct experiments on the data from Instagram which is the most popular social media for influencer marketing. The experimental results show that our proposed model achieves 98\% and 96\% accuracy in classifying influencers and their posts, respectively. Our model significantly outperforms existing user profiling methods. By applying our proposed model to our dataset, which had been collected for 92 days from October 1st, 2018 to January 1st, 2019, we analyze the behavior characteristics of influencers in terms of their topics, size of potential customers, and their posting behaviors. We plan to release our influencer dataset that contains 33,935 influencers (labeled with specific topics) with their 10,180,500 posting information, which can be used in future research.Systems (2)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Adithya Kumar (The Pennsylvania State University), Iyswarya Narayanan (The Pennsylvania State University), Timothy Zhu (The Pennsylvania State University) and Anand Sivasubramaniam (The Pennsylvania State University).
Abstract
Small and medium sized enterprises use the cloud for running online, user-facing, tail latency sensitive applications with well-defined fixed monthly budgets. For these applications, adequate system capacity must be provisioned to extract maximal performance despite the challenges of uncertainties in load and request-sizes. In this paper, we address the problem of capacity provisioning under fixed budget constraints with the goal of minimizing tail latency.To tackle this problem, we propose building systems using a heterogeneous mix of low latency expensive resources and cheap resources that provide high throughput per dollar. As load changes through the day, we use more faster resources to reduce tail latency during low load periods and more cheaper resources to handle the high load periods. To achieve these tail latency benefits, we introduce novel heterogeneity-aware scheduling and autoscaling algorithms that are designed for minimizing tail latency. Using software prototypes and by running experiments on the public cloud, we show that our approach can outperform existing capacity provisioning systems by reducing the tail latency by as much as 45\% under fixed-budget settings.Marc Warrior (Northwestern University), Yunming Xiao (Northwestern University), Matteo Varvello (Brave Software) and Aleksandar Kuzmanovic (Northwestern University).
Abstract
Free and open source media centers are currently experiencing a boom in popularity for the convenience and flexibility they offer users seeking to remotely consume digital content. This newfound fame is matched by increasing notoriety — for their potential to serve as hubs for illegal content—and a presumably ever-increasing network footprint. It is fair to say that a complex ecosystem has developed around Kodi, composed of millions of users, thousands of “add-ons” – Kodi extensions from from 3rd-party developers – and content providers. Motivated by these observations, this paper aims at conducing the first analysis of the Kodi ecosystem. Our rationale is to build some “crawling” software around Kodi which can automatically install an addon, explore its menu, and locate (video) content. This is challenging for many reasons. First, Kodi largely relies on visual information and user input which intrinsically complicates automation. Second, no central aggregators for Kodi addons exist. Third, the potential sheer size of this ecosystem requires a highly scalable crawling solution. We address these challenges with de-Kodi, a full fledged crawling system capable of discovering and crawling large cross-sections of Kodi’s decentralized ecosystem at tunable levels of depth and breadth. With de-Kodi, we discovered and tested over 9,000 distinct Kodi addons. Our results demonstrate de-Kodi, which we make available to the general public, to be a essential asset in studying one of the largest multimedia platforms in the world. Our work further serves as the first ever transparent and repeatable analysis of the Kodi ecosystem at large.Porter Jenkins (The Pennsylvania State University), Jennifer Zhao (Pinterest, Inc.), Heath Vinicombe (Pinterest, Inc.) and Anant Subramanian (Pinterest, Inc.).
Abstract
Understanding content at scale is a difficult but important problem for many platforms. Many previous studies focus on content understanding to optimize engagement with existing users. However, little work studies how to leverage better content understanding to attract new users. In this work, we build a framework for generating natural language content annotations and show how they can be used for search engine optimization. The proposed framework relies on an XGBoost model that labels ``pins'' with high probability phrases, and a logistic regression layer that learns to rank aggregated annotations for groups of content. The pipeline identifies keywords that are descriptive and contextually meaningful. We perform a large-scale production experiment deployed on the Pinterest platform and show that natural language annotations cause a 1-2% increase in traffic from leading search engines. This increase is statistically significant. Finally, we explore and interpret the characteristics of our annotations framework.Xin Wang (Baidu Research), Xu Li (Baidu Research), Jinxing Yu (Baidu Research), Mingming Sun (Baidu Research) and Ping Li (Baidu Research).
Abstract
Recent years have witnessed the continuing growth of people's dependence on touchscreen devices. As a result, input speed with the onscreen keyboard has become crucial to communication efficiency and user experience. In this work, we formally discuss the general problem of input expectation prediction with touch-screen input method editor. Taken input efficiency as the optimization target, we proposed a neural end-to-end candidates generation solution to handle automatic correction, reordering, insertion, deletion as well as completion. Evaluation metrics are also discussed base on real use scenarios. For a more thorough comparison, we also provide a statistical strategy for mapping touch coordinate sequences to text input candidates. The proposed model and baselines are evaluated on a real-world dataset. The experiment shows that the proposed model outperform the all the baselines.Semantics (4)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Zheng Fang (Institute of Information Engineering, Chinese Academy of Sciences & University of Chinese Academy of Sciences), Yanan Cao (Institute of Information Engineering, Chinese Academy of Sciences), Ren Li (Institute of Information Engineering, Chinese Academy of Sciences & University of Chinese Academy of Sciences), Zhenyu Zhang (Institute of Information Engineering, Chinese Academy of Sciences), Yanbing Liu (Institute of Information Engineering, Chinese Academy of Sciences) and Shi Wang (Institute of Computing Technology, Chinese Academy of Sciences).
Abstract
Entity Linking (EL) is a task for mapping mentions in text to corresponding entities in knowledge base (KB). This task usually includes candidate generation (CG) and entity disambiguation (ED) stages. Recent EL systems based on neural network models has achieved good performance, but they still face two challenges: (i) Previous studies evaluate their models without considering the differences between candidate entities. In fact, the quality (gold recall in particular) of candidate sets has an effect on the EL results. So, how to promote the quality of candidates needs more attention. (ii) In order to utilize the topical coherence among the referred entities, many graph and sequence models are proposed for collective ED. However, graph-based models treat all candidate entities equally which may introduce much noise information. On the contrary, sequence models can only observe previous referred entities, ignoring the relevance between the current mention and its subsequent entities. To address the first problem, we propose a multi-strategy based CG method to generate high recall candidate sets. For the second problem, we design a sequential Graph Attention Networks (SeqGAT) which combines the advantages of graph and sequence methods. In our model, mentions are dealt with in a sequence manner. Given the current mention, SeqGAT dynamically encodes both its previous referred entities and subsequent ones, and assign different importance to these entities. In this way, it not only makes full use of the topical consistency, but also reduce noise interference. We conduct experiments on different types of datasets and compare our method with previous EL system on the open evaluation platform. The comparison results show that our model achieves significant improvements over the state-of-the-art methods.Dongxiang Zhang (Zhejiang University), Yuyang Nie (University of Science and Technology of China), Sai Wu (Zhejiang University), Yanyan Shen (Shanghai Jiao Tong University) and Kian-Lee Tan (National University of Singapore).
Abstract
Entity matching (EM) is a classic research problem that identifies data instances referring to the same real-world entity. Recent technical trend in this area is to take advantage of deep learning (DL) to automatically extract discriminative features. DeepER and DeepMatcher have emerged as two pioneering DL models for EM. However, these two state-of-the-art solutions simply incorporate vanilla RNNs and straightforward attention mechanisms. In this paper, we fully exploit the semantic context of embedding vectors for the pair of entity text descriptions. In particular, we propose an integrated multi-context attention framework that takes into account self-attention, pair-attention and global-attention from three types of context. The idea is further extended to incorporate attribute attention in order to support structured datasets. We conduct extensive experiments with 7 benchmark datasets that are publicly accessible. The experimental results clearly establish our superiority over DeepER and DeepMatcher in all the datasets.Paolo Rosso (University of Fribourg), Dingqi Yang (eXascale Infolab, University of Fribourg,) and Philippe Cudre-Mauroux (eXascale Infolab, University of Fribourg,).
Abstract
Knowledge Graph (KG) embeddings are a powerful tool for predicting missing links in KGs. Existing embedding techniques typically represent a KG as a set of triplets, where each triplet (h, r, t) links two entities h and t through a relation r, and learn entity/relation embeddings from such triplets while preserving such a structure. However, this triplet representation oversimplifies the complex nature of the data stored in the KG, in particular for hyper-relational facts, where each fact contains not only a base triplet (h, r, t), but also the associated key-value pairs (k, v). Even though a few recent techniques tried to learn from such data by transforming a hyper-relational fact into an n-ary representation (i.e., a set of key-value pairs only without triplets), they result in suboptimal models as they are unaware of the triplet structure, which serves as the fundamental data structure in modern KGs and indeed preserves the essential information for link prediction. To address this issue, we propose HINGE, a hyper-relational KG embedding model, which directly learns from hyper-relational facts in a KG. HINGE captures not only the primary structural information of the KG encoded in the triplets, but also the correlation between each triplet and its associated key-value pairs. Our extensive evaluation shows the superiority of HINGE on various link prediction tasks over KGs. In particular, HINGE consistently outperforms not only the KG embedding methods learning from triplets only (by 0.81-41.45% depending on the link prediction tasks and settings), but also the methods learning from hyper-relational facts using the n-ary representation (by 13.2-84.1%).Emaad Manzoor (Carnegie Mellon University), Dhananjay Shrouty (Pinterest), Rui Li (Pinterest) and Jure Leskovec (Stanford).
Abstract
Curated taxonomies enhance the performance of machine-learning systems via high-quality structured knowledge. However, manually curating a large and rapidly-evolving taxonomy is infeasible. In this work, we propose Arborist, an approach to automatically expand textual taxonomies by predicting the parents of new taxonomy nodes. Unlike previous work, Arborist handles the more challenging scenario of taxonomies with heterogeneous edge semantics that are unobserved.Arborist learns latent representations of the edge semantics along with embeddings of the taxonomy nodes to measure taxonomic relatedness between node pairs. Arborist is then trained by optimizing a large-margin ranking loss with a dynamic margin function. We propose a principled formulation of the margin function, which theoretically guarantees that Arborist minimizes an upper-bound on the shortest-path distance between the predicted parents and actual parents in the taxonomy. Via extensive evaluation on a curated taxonomy at Pinterest and several public datasets, we demonstrate that Arborist outperforms the state-of-the-art, achieving up to 59% in mean reciprocal rank and 83% in recall at 15. We also explore the ability of Arborist to infer nodes' taxonomic-roles, without explicit supervision on this task.Social Network-B (3)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Dawei Zhou (University of Illinois at Urbana-Champaign), Lecheng Zheng (University of Illinois at Urbana-Champaign), Jianbo Li (Three Bridges Capital), Yada Zhu (IBM) and Jingrui He (University of Illinois at Urbana-Champaign).
Abstract
Financial time series analysis plays a central role in optimizing investment decisions and hedging market risks. This is a challenging task as the problems are always accompanied by dual-level (i.e.,data-level and task-level) heterogeneity. For instance, in stock price forecasting, a successful portfolio with bounded risks usually consists of a large number of stocks from diverse domains (e.g., utility, information technology, healthcare, etc.), and forecasting stocks in each domain can be treated as one task; within a portfolio, each stock is characterized by temporal data collected from multiple modalities (e.g., finance, weather, and news), which corresponds to the data-level heterogeneity. Furthermore, the finance industry follows highly regulated processes, which require prediction models to be interpretable, and the output results to meet compliance. Therefore, a natural research question is how to build a model that can achieve satisfactory performance on such multi-modality multi-task learning problems, while being able to provide comprehensive explanations for the end-users.To answer this question, in this paper, we propose a generic time series forecasting framework named Dandelion, which leverages the consistency of multiple modalities and explores the relatedness of multiple tasks using a deep neural network. In addition, to ensure the interpretability of the framework, we integrate a novel trinity attention mechanism, which allows the end-users to investigate the variable importance over three dimensions (i.e., tasks, modality and time). Extensive empirical results demonstrate that Dandelion achieves superior performance for financial market prediction across 396 stocks from 4 different domains over the past 15 years. In particular, two interesting case studies show the efficacy of Dandelion in terms of its profitability performance, and the interpretability of output results to end-users.Lichen Jin (Peking University), Yizhou Zhang (USC, contributed mainly in PKU), Guojie Song (Peking University) and Yilun Jin (The Hong Kong University of Science and Technology).
Abstract
Recent works show that end-to-end, (semi-) supervised network embedding models can generate satisfactory vectors to represent network topology, and are even applicable to unseen graphs by inductive learning. However, domain mismatch between training and test network for inductive learning, as well as lack of labeled data often compromises the outcome of such methods. To make matters worse, while transfer learning and active learning techniques, being able to solve such problems correspondingly, have been well studied on regular i.i.d data, relatively few attention has been paid on networks. Consequently, we propose in this paper a method for active domain transfer on networks named active-transfer network embedding, named ATNE. In ATNE we jointly consider the influence of each node on the network from the perspectives of transfer and active learning, and hence design novel and effective influence scores combining both aspects in the training process to facilitate node selection. We demonstrate that ATNE is efficient and decoupled from the actual model used. Further extensive experiments show that ATNE outperforms state-of-the-art active node selection methods and shows versatility in different situations.Alvis Logins (Aarhus University), Yuchen Li (Singapore Management University) and Panagiotis Karras (Aarhus University).
Abstract
How can we assess the ability of a network defined in probabilistic terms to maintain its functionality under failures? Network robustness has been studied extensively in the case of deterministic networks under threats to their connectivity. However, applications such as the online diffusion of information and the behavior of networked public raise the question about robustness in a probabilistic network. In this paper, we propose three novel robustness measures for networks hosting a stochastic diffusion process under the Independent Cascade (IC) model, which is susceptible to node failures. The outcome of such a process depends on the selection of its initiators, or seeds, by the seeder, as well as on two parameters not on seeder’s discretion: the attack strategy and the probabilistic diffusion outcome. In an abstraction, we consider three levels of seeder awareness regarding these two uncontrolled parameters, and evaluate the network’s viability aggregated over all possible extents of node failures. We introduce novel algorithms from building blocks found in previous works to evaluate the proposed measures. A thorough experimental study with synthetic and real, scale-free and homogeneous networks establishes that the proposed algorithms are effective and efficient, while the proposed measures highlight differences among networks in terms of their robustness and the surprise they can furnish under attack. Last, we devise a new measure of diffusion entropy that can inform the design of probabilistically robust networks.Jinyuan Jia (Duke University), Binghui Wang (Duke University), Xiaoyu Cao (Duke University) and Neil Zhenqiang Gong (Duke University).
Abstract
Community detection plays a key role in understanding graph structure. However, several recent studies showed that community detection is vulnerable to adversarial structural perturbation. In particular, via adding or removing a small number of carefully selected edges in a graph, an attacker can manipulate the detected communities. However, to the best of our knowledge, there are no studies on certifying robustness of community detection against such adversarial structural perturbation. In this work, we aim to bridge this gap. Specifically, we develop the first certified robustness guarantee of community detection against adversarial structural perturbation. Given an arbitrary community detection method, we build a new smoothed community detection method via randomly perturbing the graph structure. We theoretically show that the smoothed community detection method provably groups a given arbitrary set of nodes into the same community (or different communities) when the number of edges added/removed by an attacker is bounded. Moreover, we show that our certified robustness is tight. We also empirically evaluate our method on multiple real-world graphs with ground truth communities.User Modeling-B (3)
(UTC/GMT +8) 16:00-18:00, April, 23, Thursday
Meeting rooms are not available now
Zohreh Ovaisi (University of Illinois at Chicago), Ragib Ahsan (University of Illinois at Chicago), Yifan Zhang (Sun Yat-sen University), Kathryn Vasilaky (California Polytechnic State University) and Elena Zheleva (University of Illinois at Chicago).
Abstract
Click data collected by modern recommendation systems are an important source of observational data that can be utilized to train learning-to-rank (LTR) systems. However, these data suffer from a number of biases that can result in poor performance for LTR systems. Recent methods for bias correction in such systems mostly focus on position bias, the fact that higher ranked results (e.g., top search engine results) are more likely to be clicked even if they are not the most relevant results given a user’s query. Less attention has been paid to correcting for selection bias, which occurs because clicked documents are reflective of what documents have been shown to the user in the first place. Here, we propose new counterfactual approaches which adapt Heckman's two-stage method and accounts for selection and position bias in LTR systems. Our empirical evaluation shows that our proposed methods have better accuracy compared to existing unbiased LTR algorithms under moderate position bias assumptions and are more robust to noise overall.Tobias Hatt (ETH Zurich) and Stefan Feuerriegel (ETH Zurich).
Abstract
Most users leave e-commerce websites with no purchase. Hence, it is important for website owners to detect users at risk of exiting and intervene early (eg, via price promotions). Prior approaches make widespread use of clickstream data; however, state-of-the-art algorithms only model the sequence of web pages visited and not the time spent on them.In this paper, we develop a novel Markov modulated marked point process (M3PP) model for predicting user exits from clickstream data. It accommodates clickstream data in a holistic manner: our proposed M3PP models both the sequence of pages visited and the temporal dynamics between them (ie, the time spent on pages). This is achieved by a continuous-time marked point process. Different from previous Markovian clickstream models, our M3PP is the first model in which the continuous nature of time is considered. The marked point process is modulated by a continuous-time Markov process in order to account for different latent shopping phases. As a secondary contribution, we suggest a risk assessment framework. Rather than predicting future page visits, we compute a user's risk of exiting with no purchase. For this purpose, we build upon sequential hypothesis testing in order to suggest a risk score for users exits.Our computational experiments draw upon real-world clickstream data provided by a large online retailer. Based on it, we find that state-of-the-art algorithms are consistently outperformed by our M3PP model in terms of both AUROC (+6.24 percentage points) and so-called time of early warning (+12.93%). Accordingly, our M3PP model allows for timely detections of user exists and thus provides sufficient time for website owners to trigger dynamic online interventions (eg, adapting website content or price promotions).Chang Wang (Huazhong University of Science and Technology (HUST), Wuhan, China) and Bang Wang (Huazhong University of Science and Technology (HUST), Wuhan, China).
Abstract
Social emotion classification is to predict the distribution of different emotions evoked by an article among its readers. Prior studies have shown that document semantic and topical features can help improve classification performance. However, how to effectively extract and jointly exploit such features have not been well researched. In this paper, we propose an end-to-end topic-enhanced self-attention network (TESAN) that jointly encodes document semantics and extracts document topics. In particular, TESAN first constructs a neural topic model to learn topical information and generates a topic embedding for a document. We then propose a topic-enhanced self-attention mechanism to encode semantic and topical information into a document vector. Finally, a fusion gate is used to compose the document representation for emotion classification by integrating the document vector and the topic embedding. The entire TESAN is trained in an end-to-end manner. Experimental results on three public datasets reveal that TESAN outperforms the state-of-the-art schemes in terms of higher classification accuracy and higher average Pearson correlation coefficient. Furthermore, the TESAN is computation efficient and can generate more coherent topics.Wenhao Zhang (Dept. of Computer Science; University of California, Los Angeles), Wentian Bao (Alibaba Group), Keping Yang (Alibaba Group), Quan Lin (Alibaba Group), Xiao-Yang Liu (Columbia University), Hong Wen (Alibaba Group) and Ramin Ramezani (Dept. of Computer Science; University of California, Los Angeles).
Abstract
Post-click conversion rate (CVR) estimation is a critical task in e-commerce recommender systems. This task is deemed quite challenging under industrial setting with two major issues: 1) selection bias caused by user self-selection, 2) data sparsity due to the limited click events. A successful conversion typically has the following sequential events: "exposure→click→conversion". Conventional CVR estimators are trained in the click space, but inference is done in the entire exposure space. The unclicked data is excluded intentionally in training phase as we have no explicit conversion feedback for the items that are not clicked by customers. These information is typically missing not at random due to the user self-selection.Conventional CVR estimators fail to account for the causes of the missing data and treat them as missing at random. Hence, their estimations are highly likely to deviate from the real values by large. In addition, the data sparsity issue can also handicap many industrialCVR estimators which usually have large parameter space.In this paper, we propose two principled, efficient and highly effective CVR estimators for industrial CVR prediction tasks, namely,Multi-IPW and Multi-DR. The proposed models approach the CVR estimation task from a causal perspective and adapt for the cause of missing not at random. In addition, our methods are based on the multi-task learning framework and mitigate the data sparsity issue. Extensive experiments on industrial-level datasets demonstrate that the proposed methods outperform other state-of-the-art CVR models.Research Tracks (7)
Web Mining-A (7)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Le Zhang (University of Science and Technology of China), Tong Xu (University of Science and Technology of China), Hengshu Zhu (Baidu Inc.), Chuan Qin (University of Science and Technology of China), Qingxin Meng (Rutgers-the State University of New Jersey), Hui Xiong (Rutgers-the State University of New Jersey) and Enhong Chen (University of Science and Technology of China).
Abstract
Recent years have witnessed the growing interests in investigating the competition among companies. Existing studies for company competitive analysis generally rely on subjective survey data and inferential analysis. Instead, in this paper, we aim to develop a new paradigm for studying the competition among companies through the analysis of talent flows. The rationale behind this is that the competition among companies usually leads to talent movement. Along this line, we first build a Talent Flow Network based on the large-scale job transition records of talents, and formulate the concept of ``competitiveness'' for companies with consideration of their bi-directional talent flows in the network. Then, we propose a Talent Flow Embedding (TFE) model to learn the bi-directional talent attractions of each company, which can be leveraged for measuring the pairwise competitive relationships between companies. Specifically, we employ the random-walk based model in original and transpose networks respectively to learn representations of companies by preserving their competitiveness as well as the in/out-degree distribution of the network. Furthermore, we design a multi-task strategy to refine the learning results from a fine-grained perspective, which can jointly embed multiple talent flow networks by assuming the features of company keep stable but take different roles in networks of different job positions. Finally, extensive experiments on a large-scale real-world dataset clearly validate the effectiveness of our TFE model in terms of company competitive analysis and reveal some interesting rules of competition based on the derived insights on talent flows.Yusan Lin (Visa Research), Maryam Moosaei (Visa Research) and Hao Yang (Visa Research).
Abstract
Recommending fashion outfits to users presents several challenges. First of all, an outfit consists of multiple fashion items, and each user emphasizes different parts of an outfit when considering whether they like it or not. Secondly, a user's liking for a fashion outfit considers not only the aesthetics of each item but also the compatibility among them. Lastly, fashion outfit data is often sparse in terms of the relationship between users and fashion outfits. Not to mention, we can only obtain what the users like, but not what they dislike.To address the above challenges, in this paper, we formulate the fashion outfit recommendation problem as a multiple-instance-learning (MIL) problem. We propose OutfitNet, a fashion outfit recommendation framework that includes two stages. The first stage is a Fashion Item Relevancy network (FIR), which learns the compatibility between fashion items and further generates relevancy embedding of fashion items. In the second stage, an Outfit Preference network (OP) learns the users' tastes for fashion outfits using visual information. OutfitNet takes in multiple fashion items in a fashion outfit as input, learns the compatibility among fashion items, the users' tastes toward each item, as well as the users' attention on different items in the outfit with the attention mechanism.Quantitatively, our experiments show that OutfitNet outperforms state-of-the-art models in two tasks: fill-in-the-blank (FITB) and personalized outfit recommendation. Qualitatively, we demonstrate that the learned personalized item scores and attention scores capture well the users' fashion tastes, and the learned fashion item embeddings capture well the compatibility relationships among fashion items. We also leverage the learned fashion item embedding and propose a simple fashion outfit generation framework, which is shown to produce high-quality fashion outfit combinations.Xi Tong Lee (Nanyang Technological University), Arijit Khan (Nanyang Technological University), Sourav Sen Gupta (Nanyang Technological University), Yu Hann Ong (Nanyang Technological University) and Xuan Liu (Nanyang Technological University).
Abstract
Blockchains are increasingly becoming popular due to the prevalence of cryptocurrencies and decentralized applications. Ethereum is a distributed public blockchain network that focuses on running code (smart contracts) for decentralized applications. More simply, it is a platform for sharing information in a global state that cannot be manipulated or changed. Ethereum blockchain introduces a novel ecosystem of human users and autonomous agents (smart contracts). In this network, we are interested in all possible interactions: user to-user, user-to-contract, contract-to-user, and contract-to-contract. This requires us to construct interaction networks from the entire Ethereum blockchain data, where vertices are accounts (users, contracts) and arcs denote interactions. Each interaction network provides us with a different perspective on the Ethereum blockchain, and our analyses on the networks reveal new insights by combining information from the four networks. We perform an in-depth study of these networks based on several graph properties consisting of both local and global properties, discuss their similarities and differences with social networks and the Web, draw interesting conclusions, and highlight important, future research directions.Siddarth R (Adobe Inc), Nupur Kumari (Adobe Inc), Akash Rupela (Adobe Inc), Piyush Gupta (Adobe Inc) and Balaji Krishnamurthy (Adobe Inc).
Abstract
We present ShapeVis, a visualization technique for point cloud data inspired from topological data analysis. Our method captures the underlying geometric and topological structure of the data in a compressed graphical representation. Much success has been reported by the graph-based data compression technique Mapper, that discretely approximates the Reeb graph of a filter function on the data. However, when using standard dimensionality reduction algorithms as the filter function, Mapper suffers from considerable computational cost. This makes it difficult to scale to high-dimensional data. Our proposed technique relies on finding a subset of points called landmarks along the data manifold to construct a weighted witness-graph over it. This graph captures the structural characteristics of the point cloud and its weights are determined using a Finite Markov Chain. We further compress this graph by applying induced maps from standard community detection algorithms. Using techniques borrowed from manifold tearing, we prune and reinstate edges in the induced graph based on their modularity to summarize the shape of data. We empirically demonstrate how our technique captures the structural characteristics of real and synthetic data sets. Further, we compare our approach with Mapper using various filter functions like t-Sne, UMAP, LargeVis, and show that our algorithm scales to millions of data points while preserving the quality of data visualization.Aleksandr Artemenkov (Skoltech) and Maxim Panov (Skoltech).
Abstract
Modern methods for data visualization, such as t-SNE, usually have performance issues that prohibit their application to large amounts of high-dimensional data. In this work, we propose NCVis -- a high-performance visualization method built on a sound statistical basis of noise contrastive estimation. We show that NCVis outperforms state-of-the-art techniques in terms of speed while preserving the representation quality of other methods. In particular, the proposed approach successfully proceeds the large dataset of more than 1 million news in headers in several minutes and presents the underlying structure in a human-readable way. Moreover, it provides results consistent with classical methods like t-SNE on more straightforward datasets like images of hand-written digits. We believe that the broader usage of such software can significantly simplify the web data analysis and lower the large-scale application entry barrier.Social Network-A (7)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Ana-Andreea Stoica (Columbia University), Jessy Xinyi Han (Columbia University) and Augustin Chaintreau (Columbia University).
Abstract
The problem of social influence maximization is widely applicable in designing viral campaigns, news dissemination, or medical aid. State-of-the-art algorithms often select "early adopters" that are most central in a network, unfortunately mirroring or exacerbating embedded historical biases in human networks and leaving under-represented communities out of the loop. In this paper, we aim at a rigorous foundation for fair influence maximization. Through a theoretical model of biased networks, we characterize the intricate relationship between diversity and efficiency, which sometimes may be at odds but may also reinforce each other. Most importantly, we prove an analytical condition under which more equitable choices of early adopters lead simultaneously to fairer outcomes and larger outreach. Analysis of data on DBLP confirms our condition is often met. We design and test a set of algorithms leveraging networks to optimize the diffusion of a message while avoiding to create disparate impact among participants based on gender or race.Hui-Ju Hung (The Pennsylvania State University), Wang-Chien Lee (The Pennsylvania State University), De-Nian Yang (Academia Sinica), Chih-Ya Shen (National Tsing Hua University), Zhen Lei (The Pennsylvania State University) and Sy-Miin Chow (The Pennsylvania State University).
Abstract
Research suggests that social relationships have substantial impacts on individuals' health outcomes. Network intervention, through careful planning, can assist a network of users to build healthy relationships. However, most previous work is not designed to assist such planning by carefully examining and improving multiple network characteristics. In this paper, we propose and evaluate algorithms that facilitate network intervention planning through simultaneous optimization of network degree, closeness, betweenness, and local clustering coefficient, under scenarios involving Network Intervention with Limited Degradation - for Single target (NILD-S) and Network Intervention with Limited Degradation - for Multiple targets (NILD-M). We prove that NILD-S and NILD-M are NP-hard and cannot be approximated within any ratio in polynomial time unless P=NP. We propose the Candidate Re-selection with Preserved Dependency (CRPD) algorithm for NILD-S, and the Objective-aware Intervention edge Selection and Adjustment (OISA) algorithm for NILD-M. Various pruning strategies are designed to boost the efficiency of the proposed algorithms. Extensive experiments on various real social network datasets collected from public primary schools and the Web and an empirical study are conducted to show that CRPD and OISA outperform the baselines in both efficiency and effectiveness.Bruno Ordozgoiti (Aalto University), Antonis Matakos (Aalto University) and Aristides Gionis (Aalto University).
Abstract
Signed networks are graphs whose edges are labelled with either a positive or a negative sign, and are able to capture nuances in interactions that are missed by their unsigned counterparts. The concept of balance in signed graph theory determines whether or not a network can be partitioned into two perfectly opposing subsets, and is therefore useful for modelling phenomena such as the existence of polarized communities in social networks. While determining whether a graph is balanced is easy, finding a large balanced subgraph is hard. The few heuristics available in the literature for this purpose are either ineffective or non-scalable. In this paper we propose an efficient algorithm for finding balanced subgraphs in signed networks. The algorithm relies on signed spectral theory and a novel bound for perturbations of the graph Laplacian. In a wide variety of experiments on real data we show that our algorithm can find balanced subgraphs much larger than those detected by existing methods, and is in addition faster. We test its scalability on graphs of up to 18 million edges.Cyrus Rashtchian (UCSD), Aneesh Sharma (Google) and David Woodruff (Carnegie Mellon University).
Abstract
All-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets. Traditionally, similarity search has focused on discovering very similar pairs, for which a variety of efficient algorithms are known. However, recent work has highlighted the importance of discovering pairs of sets with relatively small intersection sizes. For example, in a recommender system, two users may be alike even though their interests only overlap on a small percentage of items. In such systems, it is also common that some dimensions are highly-skewed, because they are very popular. Together, these two properties render previous approaches infeasible for large input sizes. To address this problem, we present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity. The core of our algorithm is a randomized selection procedure based on Locality Sensitive Filtering. In particular, our method deviates from prior approximate algorithms, which are based on Locality Sensitive Hashing. Theoretically, we show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets. We prove guarantees on the communication, work, and maximum load of LSF-Join, and we also experimentally demonstrate its accuracy on multiple graphs.Dhivya Eswaran (Carnegie Mellon University), Srijan Kumar (Georgia Institute of Technology) and Christos Faloutsos (Carnegie Mellon University).
Abstract
Do higher-order network structures aid graph semi-supervised learning? Given a graph and a few labeled vertices, labeling the remaining vertices is a high-impact problem with applications in several tasks, such as recommender systems, fraud detection and protein identification. However, traditional methods rely on edges for spreading labels, which is limited by the fact that all edges are not equal. Vertices with stronger connections participate in higher-order structures in graphs, which calls for methods that can leverage these structures in the semi-supervised learning tasks.Our contributions are three-fold. First, we create an information-theoretic metric to quantify the homogeneity of labels in higher-order structures in graphs. We show that across four diverse real-world networks, higher-order structures exhibit more homogeneity of labels compared to edges. Second, we create an algorithm, HOLS, for label spreading using higher-order structures. HOLS has strong theoretical guarantees and reduces to standard label spreading in the base case. Third, we conduct extensive experiments to compare HOLS to several traditional and recent state-of-the-art methods. We show that higher-order label spreading using triangles in addition to edges is up to 4.7% better than label spreading using edges alone. Compared to the baselines, HOLS leads to statistically significantly higher accuracy in all-but-one cases. HOLS is also fast and scalable to large graphs, running under 2 minutes in graphs with over 21 million edges. All the code and datasets are available at http://bit.ly/www2020hols for reproducibility.User Modeling-A (7)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Nikhita Vedula (The Ohio State University), Nedim Lipka (Adobe), Pranav Maneriker (The Ohio State University) and Srinivasan Parthasarathy (The Ohio State University).
Abstract
Accurately discovering user intents from their written or spoken language plays a critical role in natural language understanding and automated dialog response. Most existing research models this as a classification task with a single intent label per utterance, by grouping user utterances into a single intent type from a set of categories known beforehand. Going beyond this formulation, we define and investigate a new problem of open intent discovery. It involves discovering one or more generic intent types from text utterances, that may not have been encountered during training. We propose a novel domain-agnostic approach, OPINE, which formulates the problem as a sequence tagging task under an open-world setting. It employs a CRF on top of a bidirectional LSTM to extract intents in a consistent format, subject to constraints among intent tag labels. We apply a multi-head self-attention mechanism to effectively learn dependencies between distant words. We further use adversarial training to improve performance and robustly adapt our model across varying domains. Finally, we curate and plan to release an open intent annotated dataset of 25K real-life utterances spanning diverse domains. Extensive experiments show that our approach outperforms state-of-the-art baselines by 5-15% F1 score points. We also demonstrate the efficacy of OPINE in recognizing multiple, diverse domain intents with limited (or zero) training examples per unique domain.Idan Szpektor (Google), Deborah Cohen (Google), Gal Elidan (Google), Michael Fink (Google), Avinatan Hassidim (Google), Orgad Keller (Google), Sayali Kulkarni (Google), Eran Ofek (Google), Sagie Pudinsky (Google), Asaf Revach (Google), Shimi Salant (Google) and Yossi Matias (Google).
Abstract
We study conversational exploration and discovery(CoED), where the user’s goal is to enrich her knowledge of a given domain by conversing with an informative bot. Such conversations should be well grounded in high-quality domain knowledge as well as engaging and open-ended. A CoED bot should be proactive and introduce relevant information even if not directly asked by the user. The bot should also appropriately pivot the conversation to undiscovered regions of the domain. To address these dialogue characteristics, we introduce a novel approach termed dynamic composition. This approach decouples candidate content generation from the flexible composition of bot responses. This allows the bot to control the source, correctness and quality of the offered content, while achieving flexibility via a dialogue manager that selects the most appropriate contents in a compositional manner. We implemented a CoED bot based on dynamic composition and integrated it in theGoogle Assistant. As an example domain, the bot conversed about the NBA basketball league in a seamless experience, such that users were not aware whether they were conversing with the vanilla system or the one augmented with the CoED bot. Experimental results are positive and offer insights into what makes a good conversation.To the best of our knowledge, this is the first real user experiment of open-ended dialogues as part of a commercial assistant system.Roland Aydin (Institute of Materials Research, Helmholtz-Zentrum Geesthacht), Lars Klein (Ecole Polytechnique Fédérale de Lausanne), Arnaud Miribel (Ecole Polytechnique Fédérale de Lausanne) and Robert West (Ecole Polytechnique Fédérale de Lausanne).
Abstract
The learning of a new language remains to this date a cognitive task that requires considerable diligence and willpower, recent advances and tools notwithstanding. In this paper, we propose Broccoli, a new paradigm aimed at significantly reducing the required effort by seamlessly embedding vocabulary learning into users' everyday information diets. This is achieved by inconspicuously switching chosen words encountered by the user for their translation in the target language. Thus, by seeing words in context, the user can assimilate new vocabulary without much conscious effort. We validate our approach in a careful user study, finding that the efficacy of the lightweight Broccoli approach is competitive with traditional, memorization-based vocabulary learning. The low cognitive overhead is manifested in a pronounced decrease in learners' usage of mnemonics and other learning strategies, as compared to traditional learning. Finally, we establish that language patterns in typical information diets are compatible with spaced-repetition strategies, enabling an efficient use of the Broccoli paradigm. Overall, our work establishes the feasibility of a novel and powerful "install-and-forget" approach for embedded language acquisition.Mi Luo (Huawei Noah's Ark Lab), Fei Chen (Huawei Noah’s Ark Lab), Pengxiang Cheng (Huawei Noah’s Ark Lab), Zhenhua Dong (Huawei Noah’s Ark Lab), Xiuqiang He (Huawei Noah’s Ark Lab), Jiashi Feng (National University of Singapore) and Zhenguo Li (Huawei Noah's Ark Lab).
Abstract
Recommender systems often face heterogeneous datasets containing highly personalized historical data of users, where no single model could give the best recommendation for every user. We observe this ubiquitous phenomenon on both public and production datasets and address the issue of model selection in pursuit of optimizing the quality of recommendation for each user. We propose a meta-learning framework to facilitate user-level adaptive model selection in a hybrid recommender system. In this framework, a collection of recommenders is trained with data from all users, on top of which the meta-learning module trains a model selector that aims to select the best model for each user using the user-specific historical data. We conduct extensive experiments on two public datasets and a real-world production dataset, demonstrating that our proposed framework achieves significant improvements over single model baselines and sample-level model selector in terms of AUC and LogLoss. In particular, the improvement over the production dataset may lead to huge profit gain when deployed in online recommender systems.Kai Luo (University of Toronto), Scott Sanner (University of Toronto), Ga Wu (University of Toronto), Hanze Li (University of Toronto) and Hojin Yang (University of Toronto).
Abstract
Critiquing is a method for conversational recommendation that iteratively adapts recommendations in response to user preference feedback. In this setting, a user is iteratively provided with an item recommendation and attribute description for that item; a user may either accept the recommendation, or critique the attributes in the item description to generate a new recommendation. Historical critiquing methods were largely based on explicit constraint- and utility-based methods for modifying recommendations w.r.t. critiqued item attributes. In this paper, we revisit the critiquing approach in the era of recommendation methods based on latent embeddings with subjective item descriptions (i.e., keyphrases from user reviews). Two critical research problems arise: (1) how to co-embed keyphrase critiques with user preference embeddings to update recommendations, and (2) how to modulate the strength of multi-step critiquing feedback, where critiques are not necessarily independent, nor of equal importance. To address (1), we build on an existing state-of-the-art linear embedding recommendation algorithm to align review-based keyphrase attributes with user preference embeddings. To address (2), we exploit the linear structure of the embeddings and recommendation prediction to formulate a linear program (LP) based optimization problem to determine optimal weights for incorporating critique feedback. We evaluate the proposed framework on two recommendation datasets containing user reviews. Empirical results compared to a standard approach of averaging critique feedback show that our approach reduces the number of interactions required to find a satisfactory item and increases the overall success rate.Society (5)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Nir Rosenfeld (Harvard University), David Parkes (Harvard University) and Aron Szanto (Harvard University).
Abstract
Recent work in the domain of misinformation identification has leveraged rich signals in the text and user identities associated with content on social media to discriminate between true and false information. But text can be strategically manipulated and accounts reopened under different aliases, suggesting that this approach is inherently brittle. In this work, we explore an alternative signal that is naturally robust---the pattern in which information propagates. Our goal is to answer the following question: can the veracity of an unverified rumor spreading through social media be predicted solely on the basis of its pattern of diffusion through the social network? Using graph kernels to extract topological information from Twitter cascade structures, we train models that are surprisingly accurate given that they are blind to language, user identities, and time, demonstrating that ``sanitized'' diffusion patterns can be highly informative of content. Our results suggest that, with the proper form of aggregation, the collective sharing pattern of the crowd can reveal powerful signals of rumor veracity, even in the early stages of propagation.Pushkal Agarwal (King’s College London), Sagar Joglekar (King’s College London), Panagiotis Papadopoulos (Brave Software Inc.), Nishanth Sastry (King’s College London) and Nicolas Kourtellis (Telefonica Research).
Abstract
Websites with hyper-partisan, left or right-leaning focus offer content that is typically biased towards the expectations of their target audience. Such content often polarizes users, who are repeatedly primed to specific (extreme) content, usually reflecting hard party lines on political and socio-economic topics. Though this polarization has been extensively studied with respect to content, it is still unknown how it associates with the online tracking experienced by browsing users, especially when they exhibit certain demographic characteristics. For example, it is unclear how such websites enable the ad-ecosystem to track users based on their gender or age. In this paper, we take a first step to shed light and measure such potential differences in tracking imposed on users when visiting specific party-line’s websites. For this, we design and deploy a methodology to systematically probe such websites and measure differences in user tracking. This methodology allows us to create user personas with specific attributes like gender and age and automate their browsing behavior in a consistent and repeatable manner. Thus, we systematically study how personas are being tracked by these websites and their third parties, especially if they exhibit particular demographic properties. Overall, we test 9 personas on 556 hyper-partisan websites and find that right-leaning sites tend to track users more intensely than left-leaning, always depended on user demographics, and using both cookies and cookie synchronization methods, leading to more costly delivered ads.Alexander Spangher (University of Southern California), Adam Fourney (Microsoft), Besmira Nushi (Microsoft), Gireeja Ranade (University of California, Berkeley) and Eric Horvitz (Microsoft).
Abstract
The Russia-based Internet Research Agency (IRA) carried out a broad information campaign in the U.S. before and after the 2016 presidential election. The organization created an expansive set of internet properties: web domains, Facebook pages, and Twitter bots, which received traffic via purchased Facebook ads, tweets, and search engines indexing their domains. In this paper, we focus on IRA activities that received exposure through search engines, by joining data from Facebook and Twitter with logs from a major internet company’s web browsers and search engine.We find that a substantial volume of Russian content was apolitical and emotionally-neutral in nature. Our observations demonstrate that such content gave IRA web-properties considerable exposure through search-engines and brought readers to websites hosting inflammatory content and engagement hooks. Our findings show that, like social media, web search also directed traffic to IRA generated web content, and the resultant traffic patterns are distinct from those of social media.Security (5)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Vikas Mishra (INRIA), Pierre Laperdrix (CNRS / Univ. Lille / Inria), Antoine Vastel (Univ. Lille / Inria), Walter Rudametkin (Univ. Lille / Inria), Romain Rouvoy (Univ. Lille / Inria / IUF) and Martin Lopatka (Mozilla).
Abstract
Targeted online advertising has become an inextricable part of the way Web content and applications are monetized. Historical online advertising consisted of simple ad-banners broadly shown to website visitors; the current evolution is a complex ecosystem responsible for tracking users to learn their habits, and show them targeted, personalized ads. To protect users against tracking, several countermeasures have been proposed, ranging from browser extensions that leverage filter lists, to features natively integrated into popular browsers like Firefox and Brave. Nevertheless, few browsers offer protections against IP address-based tracking techniques. Notably, the most popular browsers, Chrome, Firefox, Safari and Edge do not protect users against IP address tracking. Indeed, while IP addresses assigned to mobile devices tend to be reassigned more frequently, residential IP addresses remain stable for long periods of time, despite being dynamic.In this paper, we study the stability of the public IP addresses a user’s device uses to communicate with our server. The public IP addresses we obtain could be those that are directly assigned to the users’ devices or, more commonly, the users’ devices are behind a gateway, such as a residential router, in which case, our server obtains the IP addresses of the routers. Over time, a same device communicates with our server using a set of distinct IP addresses, but we find that devices reuse some of their previous IP addresses for long periods of time. We call this IP address retention, and the duration for which an IP address is retained by a device, we call the IP address retention period.We use a dataset collected over a period of 111 days with 5, 443 users and 41, 566 unique IP addresses to study the retention period of IP addresses and show that 87% of users have at least one IP address that was retained for more than a month. We also present variations on the retention period based on the country and we show that, even in cases where long-lived IP addresses do change, more often than not only the least significant octet changes. Apart from being stable, we also show that 93% of users can be uniquely identified based on a set of long-lived IP addresses, thus showing both uniqueness and stability over time. Furthermore, we also detect the presence of cycles of IP addresses showing their potential to be used in inferring traits of the user’s behaviour, as well as mobility traces. Finally, we discuss different defence solutions that users could take advantage of to protect their privacy.Simon Woo (skku).
Abstract
Although pronounceability can improve password memorability, most existing password generation approaches have not properly integrate the pronounceability of passwords in their designs. In this work, we demonstrate several shortfalls of current pronounceable password generation approaches, and then propose, ProSemPass, a new method of generating passwords that are pronounceable and semantically meaningful. In our approach, users supply initial input words and our system improves the pronounceability and meaning of the user-provided words by automatically creating a portmanteau. To measure the strength of our approach, we use attacker models, where attackers have complete knowledge of our password generation algorithms. We measure strength in guess numbers and compare those with other existing password generation approaches. Using a large-scale IRB-approved user study with 1,563 Amazon MTurkers over 9 different conditions, our approach achieves a 30% higher recall than those from current pronounceable password approaches, and is stronger than the offline guessing attack limit.Vinayshekhar Bannihatti Kumar (Carnegie Mellon University), Roger Iyengar (Carnegie Mellon University), Namita Nisal (University of Michigan), Yuanyuan Feng (Carnegie Mellon University), Hana Habib (Carnegie Mellon University), Peter Story (Carnegie Mellon University), Sushain Cherivirala (Carnegie Mellon University), Margaret Hagan (Stanford University), Lorrie Cranor (Carnegie Mellon University), Shomir Wilson (The Pennsylvania State University), Florian Schaub (University of Michigan) and Norman Sadeh (Carnegie Mellon University).
Abstract
Website privacy policies sometimes provide users the option to opt-out of certain collections and uses of their personal data. Unfortunately, many policies bury these instructions deep in their text, and few users of the web have the time or skill necessary to discover them. We describe an effort to automate the detection of opt-out choices in privacy policy text and to present them to users through a web browser extension. We describe the creation of two corpora of opt-out choices, which enable training classifiers to flexibly identify opt-outs in privacy policies. Our overall approach to extracting and classifying opt-out choices combines simple heuristics to identify a small set of commonly found opt-out hyperlinks with supervised machine learning to automatically identify less conspicuous instances. Our overall approach achieves a precision of 0.93 and a recall of 0.9. We introduce Opt-Out Easy, a web browser extension designed to present available opt-out choices to users as they browse the web. We discuss results of a user study to evaluate the usability of our browser extension. The paper also presents results of a large-scale analysis of opt-outs found in the text of several thousand of the most popular websites.Brown Farinholt (University of California San Diego), Mohammad Rezaeirad (George Mason University), Damon McCoy (NYU) and Kirill Levchenko (University of Illinois at Urbana-Champaign).
Abstract
Remote Access Trojans (RATs) are a persistent class of malware that give an attacker direct, interactive access to a victim’s personal computer, allowing the attacker to steal private data, spy on the victim in real-time using the camera and microphone, and verbally harass the victim through the speaker. To date, the users and victims of this pernicious form of malware have been challenging to observe in the wild due to the unobtrusive nature of infections. In this work, we report the results of a longitudinal study of the DarkComet RAT ecosystem. Using a known method for collecting victim log databases from DarkComet controllers, we present novel techniques for tracking RAT controllers across hostname changes and improve on established techniques for filtering spurious victim records caused by scanners and sandboxed malware executions. We downloaded 6,620 DarkComet databases from 1,029 unique controllers spanning over 5 years of operation. Our analysis shows that there have been at least 57,805 victims of DarkComet over this period, with 69 new victims infected every day, many of whose keystrokes have been captured, actions recorded, and webcams monitored during this time. Our methodologies for more precisely identifying campaigns and victims could potential be useful for improving the efficiency and efficacy of victim cleanup efforts and prioritization of law enforcement investigations.Search (4)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Wei Ye (Peking University), Rui Xie (Peking University), Jinglei Zhang (Peking University), Tianxiang Hu (Peking University), Xiaoyin Wang (University of Texas at San Antonio) and Shikun Zhang (Peking University).
Abstract
Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query. Since both tasks aim to model the association between natural language and programming language, recent studies have combined these two tasks to improve their performance. However, researchers have yet been able to effectively leverage the intrinsic connections between the two tasks as they train these tasks in a separate or pipeline manner so that their performance can not be well balanced. In this paper, we propose a novel end-to-end model for the two tasks by introducing an additional code generation task. More specifically, we explicitly exploit the probabilistic correlation between code summarization and code generation with dual learning, and utilize the two encoders for code summarization and code generation to train the code retrieval task via multi-task learning. We have carried out extensive experiments on an existing dataset of SQL and Python, and results show that our model can significantly improve the results of the code retrieval task over the-state-of-art models, as well as achieve competitive performance in terms of BLEU score for the code summarization task.Furong Xu (UCAS), Wei Zhang (Ant Financial Services Group), Yuan Cheng (Ant Financial Services Group) and Wei Chu (Ant Financial Services Group).
Abstract
Product image search in E-commerce systems is a challenging task, because of a huge number of product classes, low intra-class similarity and high inter-class similarity. Deep metric learning, based on paired distances independent of the number of classes, aims to minimize intra-class variances and inter-class similarity in feature embedding space. Most existing approaches strictly restrict the distance between samples with fixed values to distinguish different classes of samples. However, the distance of paired samples has various magnitudes during different training stages. Therefore, it is difficult to directly restrict absolute distances with fixed values. In this paper, we propose a novel Equidistant and Equidistributed Triplet-based (EET) loss function to adjust the distance between samples with relative distance constraints. By optimizing the loss function, the algorithm progressively maximizes intra-class similarity and inter-class variances. Specifically, 1) the equidistant loss pulls the matched samples closer by adaptively constraining two samples of the same class to be equally distant from another one of a different class in each triplet, 2) the equidistributed loss pushes the mismatched samples farther away by guiding different classes to be uniformly distributed while keeping intra-class structure compact in the embedding space. Extensive experimental results on product search benchmarks verify the improved performance of our method. We also achieve improvements on other retrieval datasets, which show superior generalization capacity of our method in image search.Xiang Li (Alibaba Group), Chao Wang (Alibaba Group), Jiwei Tan (Alibaba Group), Xiaoyi Zeng (Alibaba Group), Dan Ou (Alibaba Group) and Bo Zheng (Alibaba Group).
Abstract
For better user experience and business effectiveness, Click-Through Rate (CTR) prediction has been one of the most important tasks in E-commerce. Although extensive CTR prediction models have been proposed, learning good representation of items from multimodal features is still less investigated, considering an item in E-commerce usually contains multiple heterogeneous modalities. Previous works either concatenate the multiple modality features, that is equivalent to giving a fixed importance weight to each modality; or learn dynamic weights of different modalities for different items through technique like attention mechanism. However, a problem is that there usually exists common redundant information across multiple modalities. The dynamic weights of different modalities computed by using the redundant information may not correctly reflect the different importance of each modality. To address this, we explore the complementarity and redundancy of modalities by considering modality-specific and modality-invariant features differently. We propose a novel Multimodal Adversarial Representation Network (MARN) for the CTR prediction task. A multimodal attention network first calculates the weights of multiple modalities for each item according to its modality-specific features. Then a multimodal adversarial network learns modality-invariant representations where a double-discriminators strategy is introduced. Finally, we achieve the multimodal item representations by combining both modality-specific and modality-invariant representations. We conduct extensive experiments on both public and industrial datasets, and the proposed method consistently achieves remarkable improvements to the state-of-the-art methods. Moreover, the approach has been deployed in an operational E-commerce system and online A/B testing further demonstrates the effectiveness.David Carmel (Amazon), Elad Haramaty (Amazon), Arnon Lazerson (Amazon) and Liane Lewin-Eytan (Amazon).
Abstract
Learning a ranking model in product search involves satisfying many requirements such as maximizing the relevance of retrieved products with respect to the user query, as well as maximizing the purchase likelihood of these products. Multi-Objective Ranking Optimization (MORO) is the task of learning a ranking model from training examples while optimizing multiple objectives simultaneously. Label aggregation is a popular solution approach for multi-objective optimization, which reduces the problem into a single objective optimization problem, by aggregating the multiple labels of the training examples, related to the different objectives, to a single label. In this work we explore several label aggregation methods for MORO in product search. We show that a ranking model that is optimized for the reduced single objective problem, using a deterministic label aggregation approach, does not necessarily reach an optimal solution for the original multi-objective problem. We propose a novel stochastic label aggregation method which randomly selects a label per training example according to a given distribution over the labels. We provide a theoretical proof showing that stochastic label aggregation is superior to alternative aggregation approaches, in the sense that any optimal solution of the MORO problem can be generated by a proper parameter setting of the stochastic aggregation process. We experiment on three different datasets: two from the voice product search domain, and one publicly available dataset from the Web product search domain. We demonstrate empirically over these three datasets that MORO with stochastic label aggregation provides a family of ranking models that fully dominates the set of MORO models built using deterministic label aggregation.Mobile (5)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Kijin An (Virginia Tech) and Eli Tilevich (Virginia Tech).
Abstract
Modern web applications are distributed across a browser-based client and a cloud-based server. The distributed nature of web applications complicates their inspection and evolution. Also, mature program analysis and transformation tools work only with centralized software. Inspired by business process re-engineering, in which remote operations can be insourced back in house to restructure and outsource anew, we bring an analogous approach to the re-engineering of full-stack JavaScript applications. We designed and implemented the Client Insourcing automatic refactoring to create a distributed application’s centralized variant to inspect, modify, and redistribute to meet new requirements. We demonstrate the utility and value of Client Insourcing to address changes in privacy, reliability, and performance requirements. By streamlining the required non-trivial program inspections and modifications, our approach can become a helpful aid in the re-engineering of web applications.Zhe Xu (DiDi Chuxing), Chang Men (DiDi Chuxing), Peng Li (Didi Chuxing), Bicheng Jin (Didi Chuxing), Ge Li (Didi Chuxing), Yue Yang (Didi Chuxing), Chunyang Liu (Didi Chuxing), Ben Wang (Didi Chuxing) and Xiaohu Qie (Didi Chuxing).
Abstract
E-hailing platforms have become an important component of public transportation in recent years. The supply (online drivers) and demand (passenger requests) are intrinsically imbalanced because of the pattern of human behavior, especially in time and locations such as peak hours and train stations. Hence, how to balance supply and demand is one of the key problems to satisfy passengers and drivers and increase social welfare. As an intuitive and effective approach to address this problem, driver repositioning has been employed by some real-world e-hailing platforms. In this paper, we describe a novel framework of driver repositioning system, which meets various requirements in practical situations, including robust driver experience satisfaction and multi-driver collaboration. We introduce an effective and user-friendly driver interaction design called "driver repositioning task". A novel modularized algorithm is developed to generate the repositioning tasks in real time. To our knowledge, this is the first industry-level application of driver repositioning. We evaluate the proposed method in real-world experiments, achieving a 2% improvement of driver income. Our framework has been fully deployed online and repositions millions of drivers every day.Chao Huang (JD Digits), Xian Wu (University of Notre Dame), Chuxu Zhang (University of Notre Dame) and Nitesh Chawla (University of Notre Dame).
Abstract
Spatial event forecasting is challenging and crucial for urban sensing scenarios, which is beneficial for a wide spectrum of spatial-temporal mining applications, ranging from traffic management, public safety, to environment policy making. In spite of significant progress has been made to solve spatial-temporal prediction problem, most existing deep learning based methods based on a coarse-grained spatial setting and the success of such methods largely relies on data sufficiency. In many real applications, predicting events with a fine-grained spatial resolution do play a critical role to provide high discernibility of spatial-temporal data distributions. However, in such cases, applying existing methods will result in weak performance since they may not well capture the quality spatial-temporal representations when training triple instances are highly imbalance across locations and time.To tackle this challenge, we develop a hierarchically structured Spatial-Temporal Transformer network (STtrans) which leverages a main embedding space to capture the inter-dependencies across time and space for alleviating data imbalance issue. In our STtrans framework, the first-stage transformer module discriminates the types of region and time-wise relations. To make the latent spatial-temporal representations be reflective of the relational structure between categories, we further develop a cross-category fusion transformer network to endow STtrans with the capability to preserve the semantic signals in a fully dynamic manner. Finally, an adversarial training strategy is introduced to yield a robust spatial-temporal learning under data imbalance. Extensive experiments on two real-world imbalanced spatial-temporal datasets from NYC and Chicago demonstrate the superiority of our method over various state-of-the-art baselines.Benjamin Coleman (Rice University) and Anshumali Shrivastava (Rice University).
Abstract
Kernel density estimates are important for many machine learning applications in the streaming setting. Unfortunately, they have a prohibitive memory and computation cost for large, high-dimensional datasets. Recent sampling algorithms for high-dimensional densities can reduce the computation cost but cannot operate online, while streaming algorithms currently suffer from the curse of dimensionality. Even though the problem is well-studied, all existing methods suffer from a high memory storage cost which is prohibitive for many internet of things (IoT) and mobile applications. We propose an online sketching algorithm to compress a set of N high dimensional vectors into a small array of integer counters. This sketch is sufficient to estimate the kernel density for a large class of useful kernels. Our method is practical to implement and comes with strong theoretical guarantees. Our sketches are mergeable, parallel and ideal for distributed computation settings. We evaluate our method on datasets with hundreds to thousands of dimensions and show that our sketch provides a 10x compression improvement over competing methods at a similar computational cost. We expect that our dataset compression method will enable a variety of applications in resource-constrained settings such as mobile and IoT.Web Mining-B (5)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Shaogang Ren (Baidu Research, USA), Dingcheng Li (Baidu Research, USA), Zhixin Zhou (Baidu Research, USA) and Ping Li (Baidu Research, USA).
Abstract
The thriving of deep models and generative models provides approaches to model high dimensional distributions. The fact that GANs can generate amazing realistic images implies that they can learn the data manifolds well. In this manuscript, we propose an approach to estimate the implicit likelihoods of GAN models. A stable regularized inverse function of the generator can be learned with the help of a variance network of the generator. The local variance of the sample distribution can be approximated by the normalized distance in the latent space. Simulation studies, anomaly detection, and likelihood testing on real-world datasets validate the proposed method, which outperforms some baseline methods in these tasks.Gromit Yeuk-Yin Chan (New York University), Fan Du (Adobe), Ryan Rossi (Adobe), Anup Rao (Adobe), Eunyee Koh (Adobe), Claudio Silva (New York University) and Juliana Freire (NYU Poly).
Abstract
Online visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To explore clusters in different hierarchies and discover useful customer segments, marketers often need to explore different splits of the data. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current clustering techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. The algorithm first pre-processes the input points to compute annotations for cluster assignment. While the cluster assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. To address this challenge, we leverage a weighted Jaccard similarity metric and propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation of this approach on Spark, which effectively deals with data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 $\times$ speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Our scalable clustering algorithm enables data scientists and marketers to interactively explore and discover customer segments from millions of online visitor records in real-time.Shahab Raji (Rutgers University) and Gerard de Melo (Rutgers University).
Abstract
Affective analysis of textual data is instrumental in understanding human communication in the modern era of social media. A number of semantic resources have been proposed as attempts to capture the emotional associations of words. In this work, we show that we can obtain a resource that goes beyond the common binary association scores for emotion classification by using elegant techniques that draw on lexical associations as well as existing emotion lexicons. In a series of statistical and machine learning experiments, we show that these simple techniques outperform previous state-of-the-art approaches by substantial margins.Masayasu Muraoka (IBM Research - Tokyo), Tetsuya Nasukawa (IBM Research - Tokyo), Rudy Raymond (IBM Research - Tokyo) and Bishwaranjan Bhattacharjee (IBM Thomas J. Watson Research Center).
Abstract
Significant advances in deep learning for Computer Vision have enabled object recognition systems to recognize a large number of visual concepts in images at almost the same level as humans. Given an image, such recognition systems output labels for representing visual concepts expressed in text. However, a visual concept may be represented by various expressions, not only by the typical expressions used for labels (e.g., car, automobile, or vehicle) but also by other expressions including casual expressions such as auto and specific expressions (e.g., BMW, Jeep, or Suzuki) in real-world textual data, such as SNS data. The expressions can be also expressed in various languages (e.g., Wagen, macchina, and mobil meaning car in German, Italian, and Indonesian, respectively). Yet, an object recognition system does not deal with this association because the system does not consider textual data. Associating textual expressions with the corresponding visual objects is essential for bridging the gap between vision and language because they are tightly linked. To this end, we propose a task called Visual Concept Naming to associate diverse textual expressions written by humans who have different background knowledge in different languages. The goal of the task is to extract textual expressions, i.e., names of visual concepts from real-world multimodal data, consisting of textual data combined with visual data. To tackle the task, we create a dataset consisting of 3.4 million tweets in total in three languages. We also propose a method for extracting candidate names of visual concepts and validating them by exploiting Web-based knowledge obtained through image search. To demonstrate the capability of our method, we conduct an experiment with the dataset we create and evaluate names obtained by our method through crowdsourcing, where we establish an evaluation method to verify the names. The experimental results indicate that the proposed method can identify a wide variety of names of visual concepts. The names we obtained also show interesting insights regarding languages and cultures.Semantics (5)
(UTC/GMT +8) 10:30-12:30, April, 24, Friday
Meeting rooms are not available now
Xiangyu Zhao (College of Intelligence and Computing, Tianjin University), Longbiao Wang (College of Intelligence and Computing, Tianjin University), Ruifang He (College of Intelligence and Computing, Tianjin University), Ting Yang (College of Intelligence and Computing, Tianjin University), Jinxin Chang (College of Intelligence and Computing, Tianjin University) and Ruifang Wang (College of Intelligence and Computing, Tianjin University).
Abstract
Knowledge is essential for intelligent conversation systems to generate informative responses. This knowledge comprises a wide range of diverse modalities such as knowledge graphs (KGs), grounding documents and conversation topics. However, limited abilities in understanding language and utilizing different types of knowledge still challenge existing approaches. Some researchers try to enhance models' language comprehension ability by employing the pre-trained language models, but they neglect the importance of external knowledge in specific tasks. In this paper, we propose a novel universal transformer-based architecture for dialogue system, the Multiple Knowledge Syncretic Transformer (MKST), which fuses multi-knowledge in open-domain conversation. Firstly, the model is pre-trained on a large-scale corpus to learn commonsense knowledge. Then during fine-tuning, we divide the type of knowledge into two specific categories that are handled in different ways by our model. While the encoder is responsible for encoding dialogue contexts with multifarious knowledge together, the decoder with a knowledge-aware mechanism attentively reads the fusion of multi-knowledge to promote better generation. This is the first attempt that fuses multi-knowledge in one conversation model. The experimental results have been demonstrated that our model achieves significant improvement on knowledge-driven dialogue generation tasks than state-of-the-art baselines. Meanwhile, our new benchmark could facilitate the further study in this research area.Shuo Zhang (University of Stavanger), Edgar Meij (Bloomberg L.P.), Krisztian Balog (University of Stavanger) and Ridho Reinanda (Bloomberg LP).
Abstract
When working with any sort of knowledge base (KB) one has to make sure it is as complete and also as up-to-date as possible. Both tasks are non-trivial as they require recall-oriented efforts to determine which entities and relationships are missing from the KB. As such they require a significant amount of labor. Tables on the Web on the other hand are abundant and have the distinct potential to assist with these tasks. In particular, we can leverage the rich content in such tables to discover new entities, properties, and relationships. This paper addresses two main tasks in this context: table-to-KB matching and novel entity discovery. The first task aims to infer table semantics by linking table cells and heading columns to elements of a KB. We propose a novel, feature-based method for this task and on two public test collections, we demonstrate substantial improvements over the state-of-the-art in terms of precision whilst also improving recall. We further apply our method to annotate a corpus of 3M tables, which will be released as a public resource. The second task is novel and targets the discovery of new entities and relationships, where we differentiate different types including in-KB (``known'') and out-of-KB (``novel'') information. When evaluated using three purpose-built test collections, we find that our proposed approaches obtain a marked improvement on precision whilst keeping recall stable.Medina Andresel (Vienna University of Technology), Julien Corman (Free University of Bozen-Bolzano), Magdalena Ortiz (Vienna University of Technology), Juan L. Reutter (Pontificia Universidad Católica), Ognjen Savkovic (Free University of Bolzano) and Mantas Simkus (Vienna University of Technology).
Abstract
SHACL (Shapes Constraint Language) is a W3C recommendation for validating graph-based data against a set of conditions. Among the interesting features of SHACL is the ability to define recursive shapes, to state, for example, that children of persons must be persons. Although the recommendation left open the semantics of recursive shapes, there has already been proposals to extend the official semantics for the case of recursion. However, they are based on the idea of possibility: a graph will be valid against a schema as long as one can find a way to assign shapes to nodes in such a way that all constraints are satisfied. However, this definition is not constructive, as it does not give any guidelines on how one is to obtain such assignment, and it may lead to unfounded assignments, where the only reason to state that a node has a certain shape is because it serves to validate the graph. In this paper we propose a stricter semantics for SHACL that is based on the idea of stable models in logic programming: instead of allowing any possible assignment, we only allow those where each shape assignments is justified by a given constraint. We further exploit the relation between our semantics and logic programming, and show that the validation problem for a graph and a SHACL schema can be encoded into an ASP program. This also gives us a constructive semantics for a special type of SHACL schemas that are based on the idea of stratified negation. Finally, we also extend our semantics in the context of partial assignments, which have been used to define a more relaxed notion of validation that is tolerant to certain faults in the schema. In this case, we show that the stable semantics with partial assignments can be captured by the same ASP translation, this time working with well-founded ASP models.Jingbo Shang (University of Illinois at Urbana-Champaign), Xinyang Zhang (University of Illinois at Urbana-Champaign), Liyuan Liu (University of Illinois at Urbana-Champaign), Sha Li (University of Illinois at Urbana-Champaign) and Jiawei Han (University of Illinois at Urbana-Champaign).
Abstract
The automated construction of topic taxonomies can benefit numerous applications, including web search, recommendation, and knowledge discovery. One of the major advantages of automatic taxonomy construction is the ability to capture corpus-specific information and adapt to different scenarios. To better reflect the characteristics of a corpus, we take the meta-data of documents into consideration and view the corpus as a text-rich network. In this paper, we propose NetTaxo, a novel automatic topic taxonomy construction framework, which goes beyond the existing paradigm and allows text data to collaborate with network structure. Specifically, we learn term embeddings from both text and network as contexts. Network motifs are adopted to capture appropriate network contexts. We conduct an instance-level selection for motifs, which further refines term embedding according to the granularity and semantics of each taxonomy node. Clustering is then applied to obtain sub-topics under a taxonomy node. Extensive experiments on two real-world datasets demonstrate the superiority of our method over the state-of-the-art, and further verify the effectiveness and importance of instance-level motif selection.Research Tracks (8)
Web Mining-A (8)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Chen Zhao (University of Maryland), Chenyan Xiong (Microsoft), Xin Qian (University of Maryland) and Jordan Boyd-Graber (University of Maryland).
Abstract
We introduce DELFT, a factoid question answering system which combines the nuance and depth of knowledge graph question answering approaches with the broader coverage of free-text. DELFT builds a free-text knowledge graph from Wikipedia, with entities as nodes, and sentences in which entities co-occur as edges. For each question, DELFT finds the subgraph linking question entity nodes to candidate using text sentences as edges, yielding a dense and high coverage semantic graph. A novel graph neural network reasons over the free-text graph —combining evidence on the nodes via information along edge sentences—to select a final answer. Experiments on three question answering datasets show DELFT can answer entity-rich questions better than machine reading based models, BERT-based answer ranking and memory networks with big margins. DELFT's strong advantage comes from both the high coverage of its free-text knowledge graph—more than doubled that of DBpedia relations —and the novel graph neural network model which conducts accurate structural reasoning on the rich but also noisy free-text evidence.Wei-Fan Chen (Paderborn University), Shahbaz Syed (Leipzig University), Benno Stein (Bauhaus-Universität Weimar), Matthias Hagen (Martin-Luther-Universität Halle-Wittenberg) and Martin Potthast (Leipzig University).
Abstract
An abstractive snippet is an originally created piece of text to summarize a web page on a search engine results page. Compared to the conventional extractive snippets, which are generated by extracting literal phrases and sentences from a web page, abstractive snippets circumvent copyright issues; even more interesting is the fact, that they open the door for personalization. Abstractive snippets have been evaluated as equally powerful in terms of user acceptance and expressiveness-but the key question remains: Can abstractive snippets be automatically generated with sufficient quality?This paper introduces a competitive approach to abstractive snippet generation, supported by a thorough evaluation. We identify new sources that can be exploited via distant supervision to serve as ground truth data for this kind of summarization tasks: web directories (a hierarchical list of websites with descriptions organized by subject) and anchor contexts (the sentences around hyperlinks). Regarding the former, we utilize the DMOZ Open Directory Project, which is one of the largest human-edited directories on the web. Regarding the latter, we mine the entire ClueWeb09 and ClueWeb12 corpora. Altogether, we compile more than 3 million triples of the formBang Liu (University of Alberta), Haojie Wei (Tencent), Di Niu (University of Alberta), Haolan Chen (Tencent) and Yancheng He (Tencent).
Abstract
Learning to ask questions is critical to both human and machine intelligence. It helps knowledge acquisition, improves machine reading comprehension and question-answering tasks, and helps to continue a conversation in chatbots. Existing answer-aware question generation models are ineffective at generating a large number of high-quality question-answer pairs from unstructured text, since given an answer and an input passage, question generation is inherently a one-to-many mapping problem. In this paper, we propose Answer-Clue-Style-aware Question Generation (ACS-QG), a novel system aimed at automatically generating diverse and high-quality question-answer pairs from unlabeled text corpus at scale by mimicking the way a human asks questions. Our system consists of: i) an information extractor, which samples multiple types of assistive information to guide question generation; ii) neural question generators, which generates diverse and controllable questions about a passage, utilizing the extracted assistive information as an input; and iii) a neural quality controller, which filters out low-quality generated data based on text entailment. We compare our question generation models with existing approaches and perform pilot user studies to evaluate the quality of the generated question-answer pairs. The evaluation results show that our system dramatically outperforms state-of-the-art neural question generation models in terms of the generation quality, while being scalable in the meantime. With models trained on a relatively smaller amount of data, we can generate 2.8 million quality-assured question-answer pairs from a million sentences in Wikipedia.Yuqing Xie (University of Waterloo), Wei Yang (RSVP.ai), Luchen Tan (RSVP.ai), Kun Xiong (RSVP.ai), Nicholas Jing Yuan (HUAWEI Cloud & AI), Baoxing Huai (HUAWEI Cloud & AI), Ming Li (University of Waterloo) and Jimmy Lin (University of Waterloo).
Abstract
We tackle the problem of question answering directly on a large document collection, combining simple ``bag of words'' passage retrieval with a BERT-based reader for extracting answer spans. In the context of this architecture, we present a data augmentation technique using distant supervision to automatically annotate paragraphs as either positive or negative examples to supplement existing training data, which are then used together to fine-tune BERT.We explore a number of details that are critical to achieving high accuracy:\ the proper sequencing of different datasets during fine-tuning, the balance between ``difficult'' vs.\ ``easy'' examples, and different approaches to gathering negative examples. Experimental results show that, with the appropriate settings, we can achieve large gains in effectiveness on two English and two Chinese QA datasets. We are able to achieve state-of-the-art results without any modeling advances, which once again affirms the clich\'e ``there's no data like more data''.Zilong Wang (Peking University), Zhaohong Wan (Peking University) and Xiaojun Wan (Peking University).
Abstract
Multimodal sentiment analysis is an important research area that predicts speaker's sentiment tendency through features extracted from textual, visual and acoustic modalities. The central challenge is the fusion method of the multimodal information. A variety of fusion methods have been proposed, but few of them adopt end-to-end translation models to mine the subtle correlation between modalities. Enlightened by recent success of Transformer in the area of machine translation, we propose a new fusion method, TransModality, to address the task of multimodal sentiment analysis. We assume that translation between modalities contributes to a better joint representation of speaker's utterance. With Transformer, the learned features embody the information both from the source modality and the target modality. We validate our model on multiple multimodal datasets: CMU-MOSI, MELD, IEMOCAP. The experiments show that our proposed method achieves the state-of-the-art performance.Social Network-A (8)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Chun Lo (LinkedIn), Emilie de Longueau (LinkedIn), Ankan Saha (LinkedIn), Shaunak Chatterjee (LinkedIn) and Ye Tu (LinkedIn).
Abstract
Social networks act as major content marketplaces where creators and consumers come together to share and consume various kinds of content. Popular content ranking applications (e.g., newsfeed, moments, notifications, ads) and edge recommendations (e.g., connect to members, follow celebrities or groups or hashtags) on such platforms aim at improving the consumer experience. In this work, we focus on the creator experience and specifically on improving edge recommendations to better serve creators in such ecosystems.The audience and reach of creators -- individuals, celebrities, publishers and companies -- are critically shaped by these edge recommendation products. Hence, incorporating creator utility in such recommendations can have a material impact on their success, and in turn, on the marketplace. Our proposed solution involves edge-level creator utility estimation (for currently unformed edges) and an experiment design that accounts for the network effect. We also discuss the implementation of our proposal at scale on LinkedIn, a professional network with 645M+ members, and report our findings.Adam Breuer (Harvard University), Roee Eilat (Facebook) and Udi Weinsberg (Facebook).
Abstract
In this paper, we study the problem of early detection of fake user accounts on social networks based solely on their network connectivity with other users. Removing such accounts is a core task for maintaining the integrity of social networks, and early detection helps to reduce the harm that such accounts inflict. However, new fake accounts are notoriously difficult to detect via graph-based algorithms, as their small number of connections are unlikely to reflect a significant structural difference from those of new real accounts. We present the SybilEdge algorithm, which determines whether a new user is a fake (`sybil') account by aggregating over (I) her choices of friend request targets and (II) these targets' respective responses. SybilEdge performs this aggregation giving more weight to a user's choices of targets to the extent these targets are preferred by other fake versus real users, and also to the extent these targets respond differently to fake versus real users. We show that this algorithm rapidly detects new fake accounts at scale on the Facebook network, and also that it performs well compared to state-of-the-art alternatives on simulated networks designed to capture a variety of sybil attack strategies. To our knowledge, this is the first time a graph-based algorithm has been shown to achieve high accuracy (AUC > 0.9) on new users who have only sent a small number of friend requests.Han Xiao (Aalto University), Bruno Ordozgoiti (Aalto University) and Aristides Gionis (Aalto University).
Abstract
Signed graphs have been used to model interactions in social networks, which can be either positive (friendly) or negative (antagonistic). The model has been used to study polarization and other related phenomena in social networks, which can be harmful to the process of democratic deliberation in our society. An interesting and challenging task in this application domain is to detect polarized communities in signed graphs. A number of different methods have been proposed for this task~chu2016finding, lo2013mining, however, existing methods aim at finding globally optimal solutions. Instead, in this paper we are interested in finding polarized communities that are related to a small set of seed nodes provided as input. Seed nodes may consist of two sets, which constitute the two sides of a polarized structure.In this paper we formulate the problem of finding local polarized communities in signed graphs as a locally-biased eigen-problem~mahoney2012local. By viewing the eigenvector associated with the smallest eigenvalue of the Laplacian matrix as the solution of a constrained optimization problem, we are able to incorporate the local information as an additional constraint. In addition, we show that the locally-biased vector can be used to find communities with approximation guarantee with respect to a local analogue of Cheeger constant on signed graphs~atay2014cheeger. By exploiting the sparsity in the input graph, an indicator-vector for the polarized communities can be found in time linear to the graph size.Our experiments on real-world networks validate the proposed algorithm and demonstrate its usefulness at finding local structures in this semi-supervised manner.Anton Tsitsulin (Google), Marina Munkhoeva (Google) and Bryan Perozzi (Google AI).
Abstract
Graph comparison is a fundamental operation of many many tasks in data mining and information retrieval. Because of the combinatorial nature of graphs, it is hard to balance the expressiveness of the similarity measure and its scalability. Spectral graph analysis provides quintessential tools for mining information from networks, as the spectrum of a graph reflects its multi-scale structure and, thus, is a well-suited foundation for reasoning about differences between graphs. However, computing the full spectrum of a graph is computationally prohibitive, and spectral methods for graph comparison therefore must rely on rough approximation techniques with few error guarantees. In addition to approximation error, scalabilty is a bottleneck for most graph comparison methods. Few distance measures between unaligned graphs can handle graphs with more than ten thousand nodes, and those which can sacrifice approximation guarantees and accuracy for scalability sake.In this work, we propose SLaQ, an efficient and effective approximation technique for computing two distances between graphs with millions of nodes and billions of edges. We derive the corresponding error bounds and demonstrate that accurate computation is possible in time linear in the number of graph edges. In a thorough experimental evaluation we show that SLaQ outperforms existing approximation methods, sometimes by several orders of magnitude in accuracy, while maintaining comparable performance, allowing to accurately compare of million-scale graphs in a matter of minutes on a single machine.Prasenjit Dey (Microsoft), Kunal Goel (Microsoft) and Rahul Agrawal (Microsoft).
Abstract
The measure of similarity between nodes in a graph is a useful tool in many areas of computer science. SimRank, proposed by Jeh and Widom, is a classic measure of similarities of nodes in graph that has both theoretical and intuitive properties and has been extensively studied and used in many applications such as Query-Rewriting, link prediction, collaborative filtering and so on. Existing works based on Simrank primarily focus on preserving the microscopic structure, such as the second and third order proximity of the vertices, while the macroscopic scale-free property is largely ignored. Scale-free property is a critical property of any real-world web graphs where the vertex degrees follow a heavy-tailed distribution. In this paper, we introduce P-Simrank which extends the idea of Simrank to Scale-free bipartite networks. To study the efficacy of the proposed solution on a real world problem, we tested the same on the well known query-rewriting problem in bipartite click graph, similar to Simrank++, which acts as our baseline. We show that Simrank++ produces sub-optimal similarity scores in case of bipartite graphs where degree distribution of vertices follow power-law. We also show how P-Simrank can be optimized for real-world large graphs. Finally, we experimentally evaluate P-Simrank algorithm against Simrank++, using actual click graphs obtained from Bing, and show that P-Simrank outperforms Simrank++ in variety of metrics.User Modeling-A (8)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Yuxin Xiao (University of Illinois at Urbana-Champaign), Adit Krishnan (University of Illinois at Urbana-Champaign) and Hari Sundaram (University of Illinois at Urbana-Champaign).
Abstract
Some online social networks provide an explicit mechanism to allocate rewards based on users' actions, while the mechanism is more opaque in other types of social networks. Nonetheless, there are always individuals who are able to obtain higher reputations than their peers in those networks. An intuitive yet important question to ask is whether they employ strategic behaviors to become influential. It might appear that the influencers in those networks "have gamed the system" and the rest have not figured out the mechanism. However, it remains difficult to draw conclusions on the rationality of those winning individuals due to factors like the combinatorial strategy space, the inability to determine the payoffs and the resource limitation faced by individuals. The challenging nature of this question draws long-term attention from both the theory and data mining communities.Therefore, in this paper, we are motivated to investigate whether resource-limited individuals are able to discover strategic behaviors associated with high payoffs when producing contents in social networks. To properly tackle this question, we propose a novel framework of Dynamic Dual Attention Networks (DDAN) which models individuals' content production strategies under the influence of social interactions involved in the process. Extensive experimental results illustrate the model's effectiveness in user behavior modeling. Furthermore, we make three strong empirical findings: first, different strategies give rise to different payoffs; second, the best performing individuals exhibit stability in their preferential orders over strategies, which indicates the emergence of strategic behaviors; third, the stability of preference is correlated with high payoffs. To the best of our knowledge, this is the first attempt to formally identify strategic behaviors from empirical data.Xian Wu (university of notre dame), Suleyman Cetintas (Yahoo Research), Deguang Kong (Google), Miao Lu (Yahoo Research), Jian Yang (Yahoo Research) and Nitesh Chawla (University of Notre Dame).
Abstract
Contextual multi-armed bandit algorithms have received significant attention in modeling users' preferences for online personalized recommender systems in a timely manner. While significant progress has been made along this direction, a few major challenges have not been well addressed yet: (i) a vast majority of the literature is based on linear models that cannot capture complex non-linear inter-dependencies of user-item interactions; (ii) existing literature mainly ignores the latent relations between users and non-recommended items: hence may not properly reflect users' preferences in the real-world; (iii) current solutions are mainly based on historical data and are prone to cold-start problems for new users who have no interaction history. To address the above challenges, we develop a Graph Regularized Cross-modal (GRC) learning model, a general framework to exploit transferable knowledge learned from multi-modal user-item interactions as well as the external features of users and items in online personalized recommendations. In particular, the GRC framework seamlessly combines the linearity of contextual bandit framework and the non-linearity of neural network in modeling complex inherent structure of user-item interactions. We further augment GRC with the cooperation of the metric learning technique and a graph-constrained embedding module, to map the units from different dimensions (temporal, social and semantic) into the same latent space. An extensive set of experiments conducted on two benchmark datasets as well as a large scale proprietary dataset from a major search engine demonstrates the power of the proposed GRC model in effectively capturing users' dynamic preferences under different settings by outperforming all baselines by a large margin.Yongchun Zhu (Institute of Computing Technology, Chinese Academy of Sciences), Dongbo Xi (Institute of Computing Technology, Chinese Academy of Sciences), Bowen Song (Ant Financial Services Group), Fuzhen Zhuang (Institute of Computing Technology, Chinese Academy of Sciences), Shuai Chen (Ant Financial Services Group), Xi Gu (Ant Financial Services Group) and Qing He (Institute of Computing Technology, Chinese Academy of Sciences).
Abstract
With the explosive growth of the e-commerce industry, detecting online transaction fraud in real-world applications has become increasingly important to the development of e-commerce platforms. The sequential behavior history of users provides useful information in differentiating fraudulent payments from regular ones. Recently, some approaches have been proposed to solve this sequence-based fraud detection problem. However, these methods usually suffer from two problems: the prediction results are difficult to explain and the exploitation of the internal information of behaviors is insufficient. To tackle the above two problems, we propose a Hierarchical Explainable Network (HEN) to model users' behavior sequences, which could not only improve the performance of fraud detection but also make the inference process interpretable.Meanwhile, as e-commerce business expands to new domains, e.g. new countries or new markets, one major problem for modeling user behavior in fraud detection systems is the limitation of data collection, e.g., very few data/labels available. Thus, in this paper, we further propose a transfer framework to tackle the cross-domain fraud detection problem, which aims to transfer knowledge from existing domains (source domains) with enough and mature data to improve the performance in the new domain (target domain). Our proposed method is a general transfer framework that could not only be applied upon HEN but also various existing models in the Embedding \& MLP paradigm.By utilizing data from a world-leading cross-border e-commerce platform, we conduct extensive experiments in detecting card-stolen transaction frauds in different countries to demonstrate the superior performance of HEN. Besides, based on 90 transfer task experiments, we also demonstrate that our transfer framework could not only contribute to the cross-domain fraud detection task with HEN, but also be universal and expandable for various existing models. Moreover, HEN and the transfer framework form three-level attention which greatly increases the explainability of the detection results.Wen Wang (East China Normal University), Wei Zhang (East China Normal University), Shukai Liu (Search Product Center, WeChat Search Application Department, Tencent), Qi Liu (Search Product Center, WeChat Search Application Department, Tencent), Bo Zhang (Search Product Center, WeChat Search Application Department, Tencent), Leyu Lin (Search Product Center, WeChat Search Application Department, Tencent) and Hongyuan Zha (Georgia Institute of Technology).
Abstract
Session-based target behavior prediction is the task of predicting the next item to be interacted in the current anonymous behavior sequence under a specific type of user behavior (e.g., clicking an item). Although existing methods for session-based behavior prediction leverage powerful representation learning approaches to encode items' sequential relevance in a low-dimensional space, they suffer from several limitations. Firstly, they focus on only using the same type of user behavior as input for prediction, and ignore the potential of leveraging other type of behavior as auxiliary information which is particularly crucial when the target behavior is sparse but important (e.g., buying or sharing an item). Secondly, item-to-item relations in different sequences are modeled separately and locally, and they lack a principled way to globally encode these relations more effectively.To overcome these limitations, we propose a novel Multi-relational Graph Neural Network model for Session-based target behavior Prediction, namely MGNN-SPred for short. Specifically, we build a Multi-Relational Item Graph (MRIG) based on all behavior sequences from all sessions, involving target and auxiliary behavior types. MGNN-SPred learns global item-to-item relations based on MRIG and further obtains local representations for current target and auxiliary behavior sequences, respectively. In the end, MGNN-SPred leverages a gating mechanism to adaptively fuse different types of local representations for predicting next item interacted with target behavior. The extensive experiments on two real-world datasets demonstrate the superiority of our proposed model by comparing with state-of-the-art session-based prediction methods, validating the benefits of leveraging auxiliary behavior and learning item-to-item relations over MRIG.Shuai Zhao (New Jersey Institute of Technology), Achir Kalra (Forbes Media), Cristian Borcea (New Jersey Institute of Technology) and Yi Chen (New Jersey Institute of Technology).
Abstract
The fast growing ad-blocker usage results in large revenue decrease for ad-supported online websites. Facing this problem, many online publishers choose either to cooperate with ad-blocker software companies to show acceptable ads or to build a wall that requires users to whitelist the site for content access. However, it's lack of studies on the impact of these two counter-ad-blocking strategies on user behaviors. To address this issue, we conduct a randomized field experiment on the website of Forbes Media, a major US media publisher. The ad-blocker users are divided into a treatment group, which receives the wall strategy, and a control group, which receives the acceptable ads strategy. We utilize the difference-in-differences method to estimate the causal effects. Our study shows that the wall strategy has an overall negative impact on user engagements. It has no statistically significant effect on highly-engaged users as they would view the pages no matter what strategy is used. It has a big impact on low-engaged users, who have no loyalty to the site. Our study also shows that revisiting behavior decreases over, but the ratio of session whitelisting increases over time as the remaining users have relatively high loyalty and high engagement. The paper concludes with discussions of managerial insights for publishers when determine counter-ad-blocking strategies.Crowdsourcing (3)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Saiph Savage (Universidad Nacional Autonoma de Mexico (UNAM)), Chun Wei Chiang (West Virginia University), Susumu Saito (Waseda University), Carlos Toxtli (West Virginia University) and Jeffrey Bigham (Carnegie Mellon University).
Abstract
Crowd markets have traditionally limited workers by not providing transparency information concerning which tasks pay fairly or which requesters are unreliable. Researchers believe that a key reason why crowd workers earn low wages is due to this lack of transparency. As a result, tools have been developed to provide more transparency within crowd markets to help workers. However, while most workers use these tools, they still earn less than minimum wage. We argue that the missing element is guidance on how to use transparency information. In this paper, we explore how novice workers can improve their earnings by following the transparency criteria of Super Turkers, i.e., crowd workers who earn higher salaries on Amazon Mechanical Turk (MTurk). We believe that Super Turkers have developed effective processes for using transparency information. Therefore, by having novices follow a Super Turker criteria (one that is simple and popular among Super Turkers), we can help novices increase their wages. For this purpose, we: (i) conducted a survey and data analysis to computationally identify a simple yet common criteria that Super Turkers use for handling transparency tools; (ii) deployed a two-week field experiment with novices who followed this Super Turker criteria to find better work on MTurk. Novices in our study viewed over 25,000 tasks by 1,394 requesters. We found that novices who utilized this Super Turkers' criteria earned better wages than other novices. Our results highlight that tool development to support crowd workers should be paired with educational opportunities that teach workers how to effectively use the tools and their related metrics (e.g., transparency values). We finish with design recommendations for empowering crowd workers to earn higher salaries.Anthony Liu (University of Michigan), Santiago Guerra (Universidad de Monterrey), Isaac Fung (University of Michigan), Gabriel Matute (University of Michigan), Ece Kamar (Microsoft) and Walter Lasecki (University of Michigan, Computer Science & Engineering).
Abstract
Predictive models are susceptible to errors called unknown unknowns, in which the model assigns incorrect labels to instances with high confidence. These commonly arise when training data does not represent variations of a class encountered at model deployment. Prior work showed that crowd workers can identify instances of unknown unknowns, but asking the crowd to identify a sufficient number of individual instances can be costly to acquire. Instead, this paper presents an approach that leverages people's ability to find patterns that can be used to retrain classifiers more effectively with fewer examples. Our approach asks crowd workers to suggest and verify patterns in unknown unknowns. We then use these patterns to train a secondary classifier that is used to identify additional examples from existing data that the primary classifier has encountered (and potentially mis-classified) in the past. Our experiments show that using this approach outperforms existing unknown unknown detection methods for improving classifier performance. This work is the first to leverage crowds to identify error patterns in large datasets to improve the training of machine learning classifiers.Yifang Yin (National University of Singapore), Jagannadan Varadarajan (Grab), Guanfeng Wang (GrabTaxi Research and Development Centre), Xueou Wang (National University of Singapore), Dhruva Sahrawat (iiitd), Roger Zimmermann (National University of Singapore) and See-Kiong Ng (National University of Singapore).
Abstract
The quality of a digital map is of utmost importance for geo-aware services. However, maintaining an accurate and up-to-date map is a highly challenging task that usually involves a substantial amount of manual work. To reduce the manual efforts, methods have been proposed to automatically derive road attributes by mining GPS traces. However, previous methods always modeled each road attribute separately based on intuitive hand-crafted features extracted from GPS traces. This observation motivates us to propose a machine learning based method to learn joint features not only from GPS traces but also from map data. To model the relations among the target road attributes, we extract low-level shared feature embeddings via multi-task learning, while still being able to generate task-specific fused representations by applying attention-based feature fusion. To model the relations between the target road attributes and other contextual information that is available from a digital map, we propose to leverage map tiles at road centers as visual features that capture the information of the surrounding geographic objects around the roads. We perform extensive experiments on the OpenStreetMap where state-of-the-art classification accuracy has been obtained compared to existing road attribute detection approaches.Xiao Hu (Purdue University), Haobo Wang (Purdue University), Anirudh Vegesana (Purdue University), Somesh Dube (Purdue University), Kaiwen Yu (Purdue University), Gore Kao (Purdue University), Shuo-Han Chen (Purdue University), Yung-Hsiang Lu (Purdue University), George Thiruvathukal (Loyola University) and Ming Yin (Purdue University).
Abstract
Despite many exciting innovations in computer vision, recent studies reveal a number of risks in existing computer vision systems, suggesting results of such systems may be unfair and untrustworthy. Many of these risks can be partly attributed to the use of a training image dataset that exhibits sampling biases and thus does not accurately reflect the real visual world. Being able to detect potential sampling biases in the visual dataset prior to model development is thus essential for mitigating the fairness and trustworthy concerns in computer vision. In this paper, we propose a three-step crowdsourcing workflow to get humans into the loop for facilitating bias discovery in image datasets. Through two sets of evaluation studies, we find that the proposed workflow can effectively organize the crowd to detect sampling biases in both datasets that are artificially created with designed biases and real-world image datasets that are widely used in computer vision research and system development.Health (4)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Siddharth Biswal (Georgia Institute of Technology), Cao Xiao (IQVIA), Lucas Glass (IQVIA), Brandon Westover (MGH) and Jimeng Sun (Georgia Institute of Technology).
Abstract
Generating clinical reports from raw recording such as X-rays and electroencephalogram (EEG) is an essential and routine task for doctors. However, it is often time-consuming to write accurate and detailed reports. Most existing methods try to generate the whole reports from the raw input with limited success because 1) generated reports often contain errors that need manual review and correction, 2) it does not save time when doctors want to write additional information into the report, and 3) the generated reports are not customized based on individual doctors' preference. We propose {\it CL}inic{\it A}l {\it R}eport {\it A}uto-completion (CLARA), an interactive method that generates reports in a sentence by sentence fashion based on doctors' anchor words and partially completed sentences. CLARA searches for most relevant sentences from existing reports as the template for the current report. The retrieved sentences are sequentially modified by combining with the input feature representations to create the final report. In our experimental evaluation CLARA achieved 0.393 CIDEr and 0.248 BLEU-4 on X-ray reports and 0.482 CIDEr and 0.491 BLEU-4 for EEG reports for sentence-level generation, which is up to 35% improvement over the best baseline. Also via our qualitative evaluation, CLARA is shown to produce reports which have a higher level of approval by doctors.Brit Youngmann (Microsoft Research), Elad Yom-Tov (Microsoft Research), Ran Gilad-Bachrach (Microsoft Research) and Danny Karmon (Microsoft Healthcare NExT).
Abstract
Search advertising is one of the most commonly-used methods of advertising. Past work has shown that search advertising can be employed to improve health by eliciting positive behavioral change. However, writing effective advertisements requires expertise and (possible expensive) experimentation, both of which may not be available to public health authorities wishing to elicit such behavioral changes, especially when dealing with a public health crises such as epidemic outbreaks.Here we develop an algorithm which builds on past advertising data to train a sequence-to-sequence Deep Neural Network which “translates” advertisements into optimized ads that are more likely to be clicked. The network is trained using more than 114 thousands ads shown on Microsoft Advertising. We apply this translator to two health related domains: Medical Symptoms (MS) and Preventative Healthcare (PH) and measure the improvements in click-through rates (CTR).Our experiments show that the generated ads are predicted to have higher CTR in 81% of MS ads and 76% of PH ads. To understand the differences between the generated ads and the original ones we develop estimators for the affective attributes of the ads. We show that the generated ads contain more calls-to-action and that they reflect higher valence (36% increase) and higher arousal (87%) on a sample of 1000 ads. Finally, we run an advertising campaign where 10 random ads and their rephrased versions from each of the domains are run in parallel. We show an average improvement in CTR of 68% for the generated ads compared to the original ads.Our results demonstrate the ability to automatically optimize advertisement for the health domain. We believe that our work offers health authorities an improved ability to help nudge people towards healthier behaviors while saving the time and cost needed to optimize advertising campaigns.Wenchao Yu (University of California, Los Angeles), Lu Wang (Georgia Institute of Technology), Wei Cheng (NEC Labs), Martin Renqiang Ren (NEC Labs), Bo Zong (NEC Labs), Xiaofeng He (East China Normal University), Hongyuan Zha (Georgia Institute of Technology), Wei Wang (University of California, Los Angeles) and Haifeng Chen (NEC Labs).
Abstract
Recent developments in discovering dynamic treatment regimes (DTRs) have heightened the importance of deep reinforcement learning (DRL) which are used to recover the doctor's treatment policies. However, existing DRL-based methods expose the following limitations: 1) supervised methods based on behavior cloning suffer from compounding errors; 2) the self-defined reward signals in reinforcement learning models are either too sparse or need clinical guidance; 3) only positive trajectories (e.g. survived patients) are considered in current imitation learning models, with negative trajectories (patient samples with negative outcomes, e.g. deceased patients) been largely ignored, which are examples of what not to do and could help the learned policy avoid repeating mistakes. To address these limitations, in this paper, we propose the adversarial cooperative imitation learning model, ACIL, to deduce the optimal dynamic treatment regimes that mimics the positive trajectories while differs from the negative trajectories. Specifically, two discriminators are used to help achieve this goal: an adversarial discriminator is designed to minimize the discrepancies between the trajectories generated from the policy and the positive trajectories, and a cooperative discriminator is used to distinguish the negative trajectories from the positive and generated trajectories. The reward signals from the discriminators are utilized to refine the policy for dynamic treatment regimes. Experiments on the publicly real-world medical data demonstrate that ACIL improves the likelihood of patient survival and provides better dynamic treatment regimes with the exploitation of information from both positive and negative trajectories.Rediet Abebe (Harvard University), Salvatore Giorgi (University of Pennsylvania), Anna Tedijanto (Cornell University), Anneke Buffone (University of Pennsylvania) and H. Andrew Schwartz (Stony Brook University).
Abstract
The United States has the highest rate of maternal mortality of any developed nation. Mortality rates have more than doubled in the past 25 years and nearly 60,000 women face near-fatal complications every year. The experiences of Black and Latina mothers are notably worse: mortality rates for these groups can be 3 to 4 times higher than the mortality rates for white women. Despite extensive public health research, there remains a lot to be understood about contributing factors to pregnancy-related deaths and what characterizes communities with relatively high or low maternal mortality rates; indeed, standard socio-demographic and risk-factor variables do not adequately capture maternal experiences and disparities by race.Here, we explore the role that social media language can play in providing insights into community characteristics of maternal mortality. First, by analyzing pregnancy-related tweets generated in US counties, we reveal a diverse set of topics discussed on the platform including Morning Sickness, Celebrity Pregnancies, and Abortion Rights. We find that these topics predict maternal mortality rates with higher accuracy than standard socioeconomic and risk-related variables such as income, employment rates, access to healthcare, and race. We then select six topics -- Maternal Studies, Teen Pregnancy, and Congratulatory Remarks, in addition to the above three -- chosen for their interpretability and connections to known health and maternal risk factors. We show that these six topics have nearly as much predictive power as all the topics combined. We also investigate psychological aspects of communities to find that the use of less trustful, more stressed, and more negative language is significantly associated with higher mortality rates; even more notably, Trust and Affect explained a significant portion of the racial disparities in maternal mortality. We believe these findings provide further insights related to the intricate and urgent issues surrounding maternal health and can help inform actionable items at the community-level.Economics (3)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Safwan Hossain (University of Toronto, Vector Institute), Andjela Mladenovic (Independent Scientist) and Nisarg Shah (University of Toronto).
Abstract
The past decade has witnessed a rapid growth of research on fairness in machine learning. In contrast, fairness has been formally studied for almost a century in microeconomics in the context of resource allocation, during which many general-purpose notions of fairness have been proposed. This paper explore the applicability of two such notions --- envy-freeness and equitability --- in machine learning. We propose novel relaxations of these fairness notions which apply to groups rather than individuals, and are compelling in a broad range of settings. Our approach provides a unifying framework by incorporating several recently proposed fairness definitions as special cases. We provide generalization bounds for our approach, and theoretically and experimentally evaluate the tradeoff between loss minimization and our fairness guarantees.Weili Chen (Sun Yat-sen University), Tuo Zhang (Sun Yat-sen University), Zhiguang Chen (Sun Yat-sen University), Zibin Zheng (Sun Yat-sen University) and Yutong Lu (Sun Yat-sen University).
Abstract
The birth of Bitcoin ushered in the era of cryptocurrency, which has now become a financial market attracted extensive attention worldwide. The phenomenon of startups launching Initial Coin Offerings (ICOs) to raise capital led to thousands of tokens being distributed on blockchains. Many studies have analyzed this phenomenon from an economic perspective. However, little is know about the characteristics of participants in the ecosystem. To fill this gap and considering over 80% of ICOs launched based on ERC20 token on Ethereum, in this paper, we conduct a systematic investigation on the whole Ethereum ERC20 token ecosystem to characterize the token creator, holder, and transfer activity. By downloading the whole blockchain and parsing the transaction records and event logs, we construct three graphs, namely token creator graph, token holder graph, and token transfer graph. We obtain many observations and findings by analyzing these graphs. Besides, we propose an algorithm to discover potential relationships between tokens and other accounts. The reported case shows that our algorithm can effectively reveal entities and the complex relationship between various accounts in the token ecosystem.Shota Yasui (CyberAgent.inc.), Gota Morishita (CyberAgent.inc.), Fujita Komei (CyberAgent.inc.) and Masashi Shibata (CyberAgent.inc.).
Abstract
In display advertising, predicting the conversion rate, that is, the probability that a user takes a predefined action on an advertisers' website such as purchasing goods, is fundamental in estimating the value of showing a user an advertisement. However, there is a relatively long time delay between a click of a display advertisement and its resultant conversion. Because of this delayed feedback, some positive instances are labeled as negative when training data is gathered because some conversions that will occur in the future have not yet occurred. As a result, the conditional label distribution of the training data is different from that of the test data in the production environment because these are tracked for a sufficiently long period to be correctly labeled. This situation is referred to as a feedback shift. We address this problem by using an importance weight approach typically used for covariate shift correction. We prove its consistency for the feedback shift. Moreover, the importance weight approach can be applied to a wide variety of models and learning algorithms. Finally, two different experiments were conducted. The first experiment was conducted to prove the effectiveness of our proposed method from two different perspectives; performance and time efficiency. The results show that our proposed approach outperforms the existing method in terms of both. During the second experiment, we implemented a Field-awareFactorization Machine(FFM) with importance weight(FFMIW) to incorporate our proposed method into our production environment. The normal FFM and FFMIW were evaluated on an offline dataset. In addition, we conducted an online A/B test in the production system. In both settings, it was shown that FFMIW is superior to an FFM.Sam Miller (University of Warwick & Alan Turing Institute), Abeer El-Bahrawy (City University London), Martin Dittus (Oxford Internet Institute, University of Oxford & Alan Turing Institute), Mark Graham (University of Oxford) and Joss Wright (University of Oxford).
Abstract
Rapid changes in illicit drug demand, such as the Fentanyl epidemic, are a major public health issue. Policymakers currently rely on annual surveys to monitor public consumption, which are arguably too infrequent to detect rapid shifts in drug use. We present a novel method to predict drug use based on high-frequency sales data from darknet markets. We show that models based on historic trades alone cannot accurately predict drug demand. However, augmenting these models with data on Wikipedia page views for each drug greatly improves predictive accuracy. These results hold out-of-sample at high time frequency, across a range of drugs and countries. We find that Wikipedia page views most improve predictive accuracy for less popular drugs, suggesting our model may be particularly useful for detecting newly emerging substances. Therefore Wikipedia data may enable us to build a high frequency measure of drug demand, which could help policymakers respond more quickly to future drug crises.Systems (3)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Hamed Rezaei (University of Illinois at Chicago) and Balajee Vamanan (University of Illinois at Chicago).
Abstract
Datacenters host a mix of applications: foreground applications that perform distributed lookups in order to service user queries; and, background applications that perform other background tasks such as data reorganization, data backup, and data replication. While background flows produce the most load, foreground applications produce the most number of flows. Because flows (packets) from both types of applications compete at switches for network bandwidth, datacenter networks' performance is highly dependent on underlying flow scheduling mechanisms. Existing flow schedulers use flow size to distinguish critical flows from non-critical flows. However, recent studies on important datacenter workloads reveal that most flows are quite small (e.g., most flows consist of only a handful packets). In light of recent findings, we make the key observation that because flow size is not sufficient to distinguish critical flows from non-critical flows, existing flow schedulers do not achieve the desired prioritization. In this paper, we introduce ResQueue, which uses a combination of flow size and packet history to calculate the priority of each flow. Our analysis shows that ResQueue improves tail flow completion times of short flows by up to 60% over the state-of-the-art flow scheduling mechanisms.Moumena Chaqfeh (NYUAD), Yasir Zaki (NYUAD), Jacinta Hu (NYUAD) and Lakshmi Subramanian (NYU).
Abstract
A significant fraction of webpages suffer from the excessive usage of JavaScript. Based on analyzing popular webpages, we observe that a reasonable fraction of JavaScript utilized by these pages are not truly essential for many of the functional and visual features of the page. In this paper, we propose JSCleaner, a JavaScript de-cluttering engine that aims at simplifying webpages without compromising the page content or functionality. JSCleaner uses a classification algorithm that classifies JavaScript into three main categories: non-critical, replaceable, and critical JavaScript. JSCleaner removes the non-critical scripts from a webpage, translates the replaceable scripts with their HTML outcomes, and preserves the critical scripts. Our quantitative evaluation of 500 popular webpages shows that JSCleaner achieves around 30% reduction in page load times coupled with a 50% reduction in requested objects and page size. In addition, our qualitative user study of 103 evaluators shows that JSCleaner preserves 95% of the page content similarity, while maintaining about 88% of the page functionality.Jiangwei Zhang (National University of Singapore) and Y.C. Tay (National University of Singapore).
Abstract
Stack distance characterizes temporal locality of workloads and plays a vital role in cache analysis since the 1970s. However, the most efficient implementations of exact stack distance calculation are too costly, and impractical for online use. Hence, much work were done to optimize the exact computation, or approximate it through sampling or modeling.This paper introduces a new approximation technique PG2S that is based on reference popularity and gap distance. This approximation is exact under the Independent Reference Model (IRM). The technique is further extended, using machine learning, to PG2S+ for non-IRM reference patterns. Extensive experiments show that PG2S+ is much more accurate and robust than other state-of-the-art algorithms for determining stack distance. PG2S+ is the first technique to exploit the strong correlation among reference popularity, gap distance and stack distance.Stergios Stergiou (Google).
Abstract
Distributed graph frameworks formulate tasks as sequences of supersteps within which communication is performed asynchronously by transmitting messages over the graph edges. PageRank's communication pattern is identical across supersteps since each vertex sends messages to all its edges. We exploit this pattern to develop a new communication paradigm that allows us to exchange messages that include only edge payloads, dramatically reducing bandwidth requirements. Experiments on a web graph of 38 billion vertices and 3.1 trillion edges yield execution times of 34.4 seconds per iteration, suggesting more than an order of magnitude improvement over the state-of-the-art.Semantics (6)
(UTC/GMT +8) 13:30-15:30, April, 24, Friday
Meeting rooms are not available now
Amr Azzam (Vienna University of Business and Economics), Javier D. Fernández (Vienna University of Economics and Business), Maribel Acosta (Karlsruhe Institute of Technology), Martin Beno (Vienna University of Economics and Business) and Axel Polleres (Vienna University of Economics and Business - WU Wien).
Abstract
While Linked Data (LD) provides standards for publishing (RDF)and (SPARQL) querying Knowledge Graphs (KGs) on the Web, serving, accessing and processing such open, decentralized KGs is often practically impossible, as query timeouts on publicly available SPARQL endpoints show. Alternative solutions such as Triple Pat-tern Fragments (TPF) attempt to tackle the problem of availability by pushing query processing workload to the client side, but suffer from unnecessary transfer of irrelevant data on complex queries with large intermediate results. In this paper we present smart-KG, a novel approach to share the load between servers and clients, while significantly reducing data transfer volume, by combining TPF with shipping compressed KG partitions. Our evaluations show that smart-KG outperforms state-of-the-art client-side solutions and increases server-side availability towards more cost-effective and balanced hosting of open and decentralized KGs.Jacopo Urbani (Vrije Universiteit Amsterdam) and Ceriel Jacobs (Vrije Universiteit Amsterdam).
Abstract
The increasing availability and usage of Knowledge Graphs (KGs) on the Web calls for scalable and general-purpose solutions to store this type of data structures. We propose KGSYS, a novel storage architecture for very large KGs on centralized systems. KGSYS uses several interlinked data structures to provide fast access to nodes and edges, with the physical storage changing depending on the topology of the graph to reduce the memory footprint. In contrast to single architectures designed for single tasks, our approach offers an interface with few low-level and general-purpose primitives that can be used to implement tasks like SPARQL query answering, reasoning, or graph analytics. Our experiments show that KGSYS can handle graphs with 10^11 edges using inexpensive hardware, delivering competitive performance on multipleDaniele Dell'Aglio (University of Zurich) and Abraham Bernstein (University of Zurich).
Abstract
Data often contains sensitive information, which poses a major obstacle to publishing it. Some suggest to obfuscate the data or only releasing some data statistics. These approaches have, however, been shown to provide insufficient safeguards against de-anonymisation. Recently, differential privacy (DP) - an approach that injects noise into the query answers to provide statistical privacy guarantees - has emerged as a solution to release sensitive data. This study investigates how to continuously release privacy-preserving histograms (or distributions) from a continuous stream of sensitive data by combining DP and semantic web technologies. We focus on distributions, as they are the basis for many analytic applications. Specifically, we propose SihlQL, a query language that processes RDF streams in a privacy-preserving fashion. SihlQL builds on top of SPARQL and the w-event DP framework. We show how some peculiarities of w-event privacy constrains the expressiveness of SihlQL queries. Addressing these constraints, we propose an extension of w-event privacy that provides answers to more general queries while preserving their privacy. To evaluate SihlQL, we implemented a prototype engine that compiles queries to Apache Flink topologies and studied its privacy properties using real-world data from an IPTV provider and an online e-commerce web site.Jiaxin Huang (University of Illinois Urbana-Champaign), Yiqing Xie (The Hong Kong University of Science and Technology), Yu Meng (University of Illinois Urbana-Champaign), Jiaming Shen (University of Illinois Urbana-Champaign), Yunyi Zhang (University of Illinois Urbana-Champaign) and Jiawei Han (University of Illinois Urbana-Champaign).
Abstract
Given a small set of seed entities (e.g., “USA”, “Russia”), corpus-based set expansion is to induce an extensive set of entities which share the same semantic class (Country in this example) from a given corpus. Set expansion benefits a wide range of downstream applications in knowledge discovery, such as web search, taxonomy construction, and query suggestion. Existing corpus-based set expansion algorithms typically bootstrap the given seeds by incorporating lexical patterns and distributional similarity. However, due to no negative sets provided explicitly, these methods suffer from semantic drift caused by expanding the seed set freely without guidance. We propose a new framework, Set-CoExpan, that automatically generates auxiliary sets as negative sets that are closely related to the target set of user's interest, and then performs multiple sets co-expansion that extracts discriminative features by comparing target set and auxiliary sets, to form multiple cohesive sets that are distinctive from one another, thus resolving the semantic drift issue. In this paper we demonstrate that by generating auxiliary sets, we can guide the expansion process of target set to avoid touching those ambiguous areas around the border with auxiliary sets, and we show that Set-CoExpan outperforms strong baseline methods significantly.