You are here: American University Academic Programs Shared Data Science Data Science Practicum

Data Science Practicum

The Data Science Practicum (DATA-793) is the capstone experience for the MS in Data Science and provides assistance to faculty and staff across the university. 

Students entering the practicum have completed coursework in statistics, regression, and R for data science, along with completed or ongoing coursework in statistical machine learning. Practicum students are ready to put their visualization, analytics, and data modelling skills to work on live projects.

Current Projects Now Inviting Student Participation

Please browse descriptions further below for details on each project:

Call for Faculty & Staff Projects

Let our advanced students help with your data:

  1. Provide your project title, description, required skills, and email on the Faculty Project form. The only requirement is that the projects use the students' data science skills. 
  2. Under the guidance of our faculty, students in the Data Science Practicum (Data 793) or other advanced research courses review available projects for best fits and contact you.
  3. You and the student(s) agree on a plan for work on the project. Work can begin as early as January 2020.

Comparative analysis related to Wildfires in Conservation areas in Angola and Mozambique

Africa is often referred to as “The Fire Continent,” because it endures catastrophic wildfires annually. The US Forest Service is interested in assessing two African countries in the Southern Region, Angola, and Mozambique, to analyze many factors related to the causes, patterns, and aftermath of the wildfires in these two countries. In developing this comparative analysis, the student will assess regions in both countries with similar landscapes, ecosystems, size, and other factors that can provide information on wildfires. This project will be used for future program implementation, specifically a regional fire program currently in development. This project also has the potential to work with Forest Service colleagues. 
Prerequisites Knowledge of Remote Sensing, GIS, Regression, Data Analysis (such as R and other related programs), visualization.
Contact:  Michelle Zweede,

A Comparative Study of Litigation Data related to India’s Agricultural Land Laws

Agricultural land in India is heavily regulated. It is a state subject. State laws regulate who can be a farmer, what a farmer can do with the land, whom a farmer can sell the land to, how much land a farmer can own, and under what conditions the State can compel acquisition of farm land. Details vary from state to state. The students working on this project will work to web scrape the judgments/ case filing data related to agricultural land laws from the central case law repository and undertake a comparative frequency analysis for various India states. The goal of this project is to explore how litigation frequency has varied both, over time and with the kind / degree of restrictions on agricultural land and to identify the most contested/ litigated issues with respect to agricultural land. Students will apply a range of statistical and data science techniques, including web scraping, data tidying, visualization, and data exploration to obtain and analyze legal case data.
Prerequisites Knowledge of web scraping techniques is essential. In addition, knowledge or R (or Python) for data tidying, visualization and exploration; Statistics.
Contact: Prashant Narang (CCS, India) & Dr. Nimai Mehta (Math & Stat Department, AU),

The Adopt a Pixel Project

The Adopt a Pixel Project needs data scientists to build tools that will extract NASA satellite image areas and corresponding citizen science ground photos, and then associate them in a digital platform for analysis by citizen scientists. The project employs the high profile Zooniverse citizen science platform.

Students will implement strategies and methods to automate this extraction, processing and publishing of the data on the Zooniverse platform. The analyzed data will be used for further analysis and visualization in data dashboards, websites and programmatic notebooks. The resulting data stream will support scientific tasks such as graphing, mapping and data reduction and AI. 

Contact: Peder Nelson, Oregon State University, College of Earth Ocean and Atmospheric sciences. Peder is an instructor of geography and geospatial sciences, and the NASA GLOBE Observer Land Cover Science Lead.

Online Antisemitism Detection Using Multimodal Learning

Increasing cases of online antisemitism have become a major concern due to its socio-political consequences. The task of detection of online antisemitism poses multiple challenges that include the extraction of joint representations from multi-modal data (e.g. text, images, etc). Students working on this project will work with different data fusion approaches based on latent variable analysis and deep neural networks in order to extract joint representations from multi-modal data so that online detection of antisemitism and knowledge discovery are achieved jointly.
Prerequisites Machine Learning, good knowledge of Python
Contact: Nathalie Japkowicz and Zois

Exploring multi-task motor learning

How does our brain allow us to learn countless complex motor skills, from tying our shoes to hitting a tennis serve? How and why do previously learned skills affect how well we learn a new skill, and are such interactions reflected on the neuronal and memory level? Answers to these questions are not only important for basic neuroscience, but also for education, rehabilitation and the improvement of artificial neural networks for multi-task learning. We train rats in a high-throughput manner on multiple motor skills and track their performance and fine-grained movement trajectories over tens of thousands of trials. The goal of this project is to explore how learning of multiple motor skills differs from learning individual skills and how skills interact on the behavioral level under various training conditions. Students will explore our large performance and movement datasets to determine the relationship between learned skills and how it develops over the course of training. 
Prerequisites Matlab and/or Python
Contact:  Steffen Wolff,

Detecting Storytelling Tactics in Text Using Machine Learning

Research in entertainment theory has shown that people are more effectively mobilized through engagement and entertainment than through overt and explicit persuasive arguments. Utilizing entertainment tactics - or storytelling tactics - has been shown to “transport” readers into the story world, ultimately leading to more effective communication campaigns. For this effort, Protagonist is interested in building a machine learning classifier to better identify the storytelling potential of media articles and social media campaigns. Protagonist will provide a list of storytelling features known to contribute to a “state of transportation” and practicum participants will create a model to classify text data as “transportive” based on the presence of those features in multilingual text-based data. Practicum students may also employ feature selection techniques to identify the strongest predictors of transportation in text data.
Prerequisites Knowledge of machine learning, text mining, and Python
Contact: Becky Owens, Maria Barouti,

Statistical Consulting class: Stat 798

In this class, you will work with a statistics professor to learn new techniques, such as:
1) Determination of sample sizes and randomization of subjects to include in an experiment or survey
2) Design of a survey sampling and optimal experiments
3) Choice of the proper statistical methods for studies and experiments
4) Transformation and import of data into desired statistical software packages
5) Interpretation of statistical analysis results for researchers
6) Assessment of the power of statistical tests (power analysis)
7) Software support for major statistical software packages: R, SPSS, SAS, STATA, etc.
8) Assistance with the statistics section of a research manuscript before submission to a journal

Permission of the Director of Data Science ( and the Director of the Statistical Consulting Center ( is required.
Prerequisites If you have a strong statistical background, and 2)you would like to work on a variety of short and long projects then you can join STAT 798!

Contact: Aleka

LH Dynamics in African and Asian Elephants

Both Asian and African elephants exhibit a unique hormone pattern during the follicular phase of the estrous cycle with two luteinizing hormone surges that occur approximately 3 weeks apart (double LH surge); the surges are indistinguishable, but only the second one induces ovulation. The goal of this project is to summarize decades of hormonal data and identify intra- and inter-elephant differences in various aspects of the double LH surge: time from the decline in progestagens during the luteal phase to the first LH surge; time between LH1 and LH2, time from LH2 to the rise in progestagens after LH2-induced ovulation. Some elephants have even demonstrated three surges during the follicular phase, which also need to be documented and compared to the normal double LH surge.
Prerequisites Data manipulation and visualization, knowledge of R.
Contact: Natalia Prado,

Giant Panda Reproductive Physiology and Pregnancy Diagnosis

Giant Panda reproductive physiology is poorly understood, and they have several unique characteristics that make them difficult to study. For example, their reproductive strategy includes a strict seasonality with fertilization occurring only 1-3 days out of the year; and obligatory delayed implantation, resulting in unpredictable gestation lengths, ranging from 90-180 days. Most challenging for ex situ management is that hormonal patterns are indistinguishable between pregnancy and pseudopregnancy. For this study we gathered endocrine data over the last six breeding seasons, some successful (i.e. resulting in a cub), and some unsuccessful. The biomarkers include urinary progesterone, estradiol, oxytocin, relaxin, ceruloplasmin and prostaglandin FM. The goal if this study is to determine if there are patterns among these reproductive biomarkers that might be indicative or pregnancy.
Prerequisites Data manipulation and visualization, knowledge of R 
Contact: Natalia Prado

Characterizing Temporal Patterns in Longitudinal Prolactin Secretion in African Elephants

Normal cycling African elephant females have temporal patterns in prolactin secretion, while acyclic females with abnormal prolactin levels (too high or low) appear to lose this temporal pattern. The goals of this project are to describe normal temporal prolactin patterns in elephants and determine how acyclic females change or lose their prolactin temporal patterns. In doing so, we aim to better understand underlying causes for hyperprolactinemia, a reproductive disorder that is associated with ovarian acyclicity in African elephant females.
Prerequisites Data manipulation and visualization, knowledge of R.
Contact: Natalia Prado,

Multivariate Analysis of Blueberry Flavor

Breeding programs historically focused on producer-favored traits such as crop yield, which inadvertently resulted in worse tasting fruit. More recently, these programs have started to focus on consumer-favored traits such as flavor and texture. We have a dataset with consumer panel ratings for different blueberry varieties, along with measurements of these blueberries' chemical compositions. The goal of this project is to characterize the relationship between flavor perception and chemical composition using multivariate analysis approaches.
Prerequisites Regression, Linear Algebra, knowledge of R.
Contact: David Gerard,

Understanding the effect of Deep Architectures on the Class Imbalance Problem

The purpose of this project is to study the behavior of deep learning systems in settings that have previously been deemed challenging to classical machine learning systems to find out whether the depth of the systems is an asset in such settings.
Prerequisites  CSC-680, Python, Scikit-Learn, TensorFlow Keras, Colab
Contact: Nathalie Japkowicz,

Automating Survey Data Processes

This project is working with PRRI, a nonprofit, nonpartisan policy research organization, to develop automated processes for creating our topline and banner documents using open-source software. The automation process will include reading in data, performing recodes, weighting data, and producing formatted tables and crosstabs that are ready to be published. The deliverable for this project is a code file to produce high-quality survey topline and banner documents that can be easily customized to any dataset. Success in working with the PRRI research director on this project will result in practical resumé experience; this is a particularly good opportunity for someone interested in the survey research industry.
Prerequisites Strong R skills, R Markdown, knowledge of working with survey data and survey weights.
Contact: Natalie Jackson,

Tracing Policy through Congress

Legislative studies is often hampered by the necessity of observing policy changes and the preferences and behavior of members indirectly through coarse and heavily constrained measures like how members vote and what bills become law, both of which mask much of the negotiation around how policy is made. To get a better understanding of how certain policies become law, we are developing an approach to determining how individual policy provisions move through the legislative process in the US Congress by estimating the similarity between the text of sections of bills considered in Congress. Students will help 1) algorithmically and manually split the text of bills into sections, 2) code bill sections by policy topic, 3) compute the text similarity between sections, and 4) analyze the results.
Prerequisites Data analysis (data manipulation, regression, working with text strings) in R or Python
Contact: Andrew Ballard,

Searching for Signatures of Elusive Stellar Coronal Mass Ejections from Young Suns Using X-ray Data

NASA’s Kepler and TESS missions have revealed frequent explosive events called superflares from many planet hosting dwarf stars, providing a mechanism by which host stars may have profound effects on the physical and chemical evolution of exoplanetary atmospheres. Solar studies suggest that large flares are accompanied by fast ejection of coronal magnetized materials referred to as coronal mass ejections or CMEs. However, astronomers do not have reliable methods to detect and reveal these elusive stellar magnetic ejections from available data. My team and scientists from Penn State have collected large sets of data based on Chandra and XMM-Newton X-ray observations on hundreds of explosive events from very young stars resembling our Sun in its infancy. In this project, students will explore how statistical and machine learning techniques can be used to search for signatures of elusive CMEs. Students do not need to have any background in astronomy.
Prerequisites Statistics, Machine Learning, knowledge of Python, or IDL/Matlab.
Contact: Vladimir Airapetian, 

Molecules and machine learning becomes properties

Prediction of molecular properties using machine learning techniques

Due to its high computational speed and accuracy compared to ab-initio quantum chemistry and forcefield modeling, the prediction of molecular properties using machine learning has received great attention in the fields of materials design and drug discovery. In this project, students will use a data fusion framework that is based on Independent Vector Analysis to exploit underlying complementary information contained in different molecular featurization methods. This information will then be used to enhance the prediction ability of a regression model as well as to discover relationships between different molecular structures and properties. Students do not need to have any background in chemistry.
Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Contact Dr. Zois Boukouvalas,

Data collection, pre-processing, and visualization for understanding the spread of misinformation in social media

Due to the wide use of online media, false information can spread rapidly affecting decision making, cooperation, communications, and markets. Modern social technologies are capable to expedite a massive amount of information enabling the spread of misinformation (inaccurate or misleading). Thus, a crucial question that arises is how do true and false information diffuse and how do they correlate with each other. In this project, students are expecting to collect and pre-process data from social media, news sites and RSS feeds and perform different data visualization techniques in order to identify how false information diffuses and how it correlates with true information.
Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Contact Dr. Zois Boukouvalas,

Names of different chemicals plotted on a x and y axis

Extracting chemical insights from energetic materials using Natural Language Processing (NLP) techniques

The number of scientific journal articles and reports being published about energetic materials every year is growing exponentially, and therefore extracting relevant information and actionable insights from the latest research is becoming a considerable challenge. In this project, students will explore how techniques from natural language processing and machine learning can be used to automatically extract chemical insights from large collections of documents. Students do not need to have any background in chemistry. Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Contact Dr. Zois Boukouvalas,

Knowledge discovery and detection of misinformation on social media during high impact events

With the evolution of various social media technologies, there has been a fundamental change in how information propagates and is shared on the Internet and microblogs. During a high impact event, e.g. hurricane, terror attacks, stock market crash, social media users can be thought of as generative functions that output network posts. These posts are then propagating on the social network enabling the rapid spread of misinformation which can affect decision making, communications, and markets. Students working on this project will work with a data-driven approach based on latent variable analysis in order to extract information from data so that early detection of misinformation and knowledge discovery during a high impact event are achieved jointly.
Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Contact Dr. Zois Boukouvalas,

Seafood appearances on historical menus

The New York Public Library has compiled a database of menu items dating back to the 1840s. The goal of this project is to develop an automated approach for identifying and categorizing seafood dishes. The resulting categorization will be used to understand the change in seafood diversity and sourcing over time within this menu collection. Students working on this project will build off of an initial training dataset to apply machine learning techniques for identifying and categorizing menu items.
Prerequisites: R; machine learning; text mining.
Contact Jessica Gephart,

Estimating physical properties from videos with neural networks

Accurately estimating physical properties of objects from visual and multimedia inputs (e.g. stiffness, roughness, softness) is important for automatic scene understanding in everyday tasks in an AI system. This project aims to leverage human knowledge and physics to learn to estimate physical properties of objects in image/video using deep learning models, with a special emphasis on learning from limited data and with built-in uncertainty in the model.
Prerequisites:Numpy/Scipy/Python, Web programming, Basic machine learning, Linear Algebra, CSC476 Computer Vision is a plus.
Contact Bei Xiao,

Examining the impact of a multicomponent nutrition education intervention program

The goal of the Healthy Schoolhouse 2.0 intervention is to prevent childhood obesity in a high-needs community in Washington DC. This 5-year prospective study follows a pretest-post test design and includes data collected from teachers, students, and schools in Wards 7 and 8. The intervention engages teachers as agents of change by implementing a structured professional development program to support the integration of nutrition concepts in the classroom. Change in pre-post survey assessment of students’ nutrition literacy, attitudes, and intent; change in teachers’ self-efficacy toward teaching nutrition; fruit and vegetable consumption data collected 6 times/y in the cafeteria are examined. Cluster design effects arising from school assignment are accounted for using multilevel mixed modeling (MLM).
Prerequisites: Knowledge of R and SPSS, Regression
Contact Melissa Hawkins,

Inclusion by Design

Minority inclusion is at the center of not only creating more equal societies but also democratic stability and preventing ethnic conflict. Yet scholars have found no solution to the knotty problem of measuring inclusion across countries. This problem limits our ability to learn how to design political institutions, such as the electoral system or federalism, to enhance minority inclusion more effectively. My solution to this problem centers on estimating minority electoral support for governing parties in legislatures (and for winning presidential candidates where the president serves more than a symbolic role). Using this information, one can also estimate the minority share of the government’s (and the president’s) electoral supporters. Towards that end, I am taking a multipronged approach to estimating voting behavior by different groups, relying on both ecological inference and polling data.
Prerequisites I need students who are very comfortable (1) locating polling data, and (2) getting key descriptive stats properly weighted out it. Appropriate skills in statistical packages are helpful. I have a grant to pay students over the summer who are interested.
Contact: David Lublin,

Application of Machine Learning on the Survival Analysis of Breast Cancer Patients

This project is an attempt to study the applications of machine learning techniques in Weka (Clustering, Classification, Association rules, regression) for the survival analysis of breast cancer patients. The National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) Public-Use Data (years 1977-2017) breast cancer data set will be used for all experiments and R programming for visualization. The aim of this project is to investigate the significance of the prognostic factors such as age, ethnicity/race, tumor grade, tumor size, etc. on the survival of breast cancer patients.
Prerequisites Knowledge of R, Machine Learning Algorithms, Weka
Contact: Mehdi Owrang,

The Accountability Project

The Accountability Project was built as a tool to help researchers and newsrooms search across otherwise siloed data. We acquire, process and standardize hundreds of data sets, accounting for nearly 900 million records. We are seeking student researchers who are interested in either contributing data to the project or analyzing data within the project.
Prerequisites We are software agnostic, but it requires skills in basic data review, cleaning and analysis. Statistics not required, but rigorous data standards are. We're especially interested in folks who want to tell stories with data.Most of the team uses either R or SQL.
Contact: Jennifer LaFleur,

Methods for video-based behavior tracking

Methods for video-based behavior tracking have been of major interest in fields such as neuroscience over the past several years, and have widespread utility in many fields. These methods depend on recent developments in machine learning (e.g. deep learning and algorithms for classification and regression using high dimensional data) and computer hardware (e.g. GPUs, single board computers). Projects are available to explore the parameter spaces of currently popular methods (e.g. DeepLabCut, SimBA, B-SOiD), develop a database of recordings across a range of behaviors that are commonly used in the field of neuroscience, and to develop training materials for teaching novice users on carrying out video analysis. Projects would be done in collaboration with members of the Laubach Laboratory in the Department of Neuroscience at AU and an international team of researchers through the OpenBehavior project.
Prerequisites Basic coding skills in R and scientific Python; familiarity with Jupyter and Colab notebooks. 
Contact:  Mark

Transportation and Air Quality using Machine Learning techniques

This project is an attempt to help the public understand the effects of their transportation emissions on air quality and health, as well as actions that can be taken to reduce these effects. The aim of this project is to develop a digital tool for monitoring and analyzing traffic patterns, air pollution and weather in the D.C.area. Students will use large datasets to understand the significance of transportation emissions, such as SO2, Co2,NOx etc., and their impact on air quality. Moreover, they will identify underlying associations between air pollutant concentration and traffic. Through the Census Bureau Opportunity Project (TOP), students will meet the problem statement leaders from EPA and learn more about the transportation and air quality problem statement, any relevant data, and how this challenge affects communities.
Prerequisites Knowledge of R, Regression. 
Contact: Maria Barouti,

Trustworthy Machine Learning

The deployment of machine learning in real-world systems is growing faster than many had predicted. Today, organizations across different industries are increasingly using machine learning to augment human decision making, reduce costs and enhance productivity. Recent research has shown that malicious actors can use modified input data to make a machine learning algorithm behave in unexpected ways. For example, researchers have shown that they can trick the machine learning based computer vision algorithms designed for self-driving cars to mistake stop signs for speed limit signs. Thus, it calls for technologies that will ensure that machine learning is trustworthy so we rely on them to produce reliable outputs. Students working this project will study and investigate the trustworthiness of machine learning models. 
Prerequisites Already have a basic understanding of machine learning. Python required. Computer vision is a plus. 
Contact:  Leah Ding,

Data-driven material estimation from images 

We are interested in estimating 3D shape, material attributes and classes from photographs of translucent materials. The project involves assisting the PI to a large image dataset of translucent objects and use unsupervised learning and representational learning techniques to learn image features that are useful for teasing apart causal factors (geometry, lighting and optical properties) that influencing material properties from images. In addition, the project requires building crowd-sourcing experiments to collect human annotation from online-platform such as Amazon Mechanical Turks and compare the data with outputs from the machine learning models.
Prerequisites Python(Numpy, PyTorch), Basic Machine Learning, Deep Learning, Statistics.
Contact:  Bei Xiao,

Cardiovascular Risk Factor Prevention among Formerly Incarcerated African Americans

Mixed methodological study.
Prerequisites SPSS, Qualtrics knowledge
Contact: Ebony Russ,