Michael J. Paul
University of Colorado Boulder
Internet privacyMachine learningWorld Wide WebPublic healthArtificial intelligencePsychologyTranslation (geometry)Topic modelMachine translationNatural language processingExample-based machine translationData scienceSpeech translationTask (project management)Document classificationSpeech recognitionComputer scienceProbabilistic logicSocial mediaRule-based machine translationWord (computer architecture)Medicine
121Publications
31H-index
4,154Citations
Publications 113
Newest
#1Ashlynn R. Daughton (LANL: Los Alamos National Laboratory)H-Index: 10
#2Michael J. Paul (CU: University of Colorado Boulder)H-Index: 31
Source
#1Xiaolei HuangH-Index: 10
#2Michael J. PaulH-Index: 31
Last. Mark DredzeH-Index: 64
view all 5 authors...
Language varies across users and their interested fields in social media data: words authored by a user across his/her interests may have different meanings (e.g., cool) or sentiments (e.g., fast). However, most of the existing methods to train user embeddings ignore the variations across user interests, such as product and movie categories (e.g., drama vs. action). In this study, we treat the user interest as domains and empirically examine how the user language can vary across the user factor ...
#1Xiaolei HuangH-Index: 10
#2Michael J. PaulH-Index: 31
Last. Mark DredzeH-Index: 64
view all 5 authors...
Language varies across users and their interested fields in social media data: words authored by a user across his/her interests may have different meanings (e.g., cool) or sentiments (e.g., fast). However, most of the existing methods to train user embeddings ignore the variations across user interests, such as product and movie categories (e.g., drama vs. action). In this study, we treat the user interest as domains and empirically examine how the user language can vary across the user factor ...
#1Hande BatanH-Index: 1
#2Dianna RadpourH-Index: 1
Last. Michael J. PaulH-Index: 31
view all 5 authors...
Source
May 1, 2020 in ACL (Meeting of the Association for Computational Linguistics)
#1Mozhi Zhang (UMD: University of Maryland, College Park)H-Index: 8
#2Yoshinari Fujinuma (CU: University of Colorado Boulder)H-Index: 4
Last. Jordan Boyd-Graber (UMD: University of Maryland, College Park)H-Index: 38
view all 4 authors...
Cross-lingual word embeddings (CLWE) are often evaluated on bilingual lexicon induction (BLI). Recent CLWE methods use linear projections, which underfit the training dictionary, to generalize on BLI. However, underfitting can hinder generalization to other downstream tasks that rely on words from the training dictionary. We address this limitation by retrofitting CLWE to the training dictionary, which pulls training translation pairs closer in the embedding space and overfits the training dicti...
Source
#1Ashlynn R. Daughton (LANL: Los Alamos National Laboratory)H-Index: 10
#2Rumi Chunara (NYU: New York University)H-Index: 23
Last. Michael J. Paul (CU: University of Colorado Boulder)H-Index: 31
view all 3 authors...
BACKGROUND: Internet data can be used to improve infectious disease models. However, the representativeness and individual-level validity of internet-derived measures are largely unexplored as this requires ground truth data for study. OBJECTIVE: This study sought to identify relationships between Web-based behaviors and/or conversation topics and health status using a ground truth, survey-based dataset. METHODS: This study leveraged a unique dataset of self-reported surveys, microbiological lab...
Source
#1Shudong Hao (Bard College at Simon's Rock)H-Index: 5
#2Michael J. Paul (CU: University of Colorado Boulder)H-Index: 31
Probabilistic topic modeling is a common first step in crosslingual tasks to enable knowledge transfer and extract multilingual features. While many multilingual topic models have been developed, t...
Source
#3Franck Dernoncourt (SNU: Seoul National University)H-Index: 10
Existing research on fairness evaluation of document classification models mainly uses synthetic monolingual data without ground truth for author demographic attributes. In this work, we assemble and publish a multilingual Twitter corpus for the task of hate speech detection with inferred four author demographic factors: age, country, gender and race/ethnicity. The corpus covers five languages: English, Italian, Polish, Portuguese and Spanish. We evaluate the inferred demographic labels with a c...
#1Kai R. LarsenH-Index: 20
#2Eric B. HeklerH-Index: 37
Last. Bryan GibsonH-Index: 8
view all 4 authors...
Source
Nov 1, 2019 in EMNLP (Empirical Methods in Natural Language Processing)
#1Linzi XingH-Index: 5
#2Michael J. Paul (CU: University of Colorado Boulder)H-Index: 31
Last. Giuseppe Carenini (UBC: University of British Columbia)H-Index: 41
view all 3 authors...
Probabilistic topic models such as latent Dirichlet allocation (LDA) are popularly used with Bayesian inference methods such as Gibbs sampling to learn posterior distributions over topic model parameters. We derive a novel measure of LDA topic quality using the variability of the posterior distributions. Compared to several existing baselines for automatic topic evaluation, the proposed metric achieves state-of-the-art correlations with human judgments of topic quality in experiments on three co...
Source
This website uses cookies.
We use cookies to improve your online experience. By continuing to use our website we assume you agree to the placement of these cookies.
To learn more, you can find in our Privacy Policy.