At work I am an experienced data scientist currently doing consulting work. I have significant
experience in product oriented work as well as marketing optimization. As a product oriented data scientist,
I have built and models that scale lined of business and drive efficiency. As a marketing data
scientist, I have done extensive research in brand optimization, test design and execution, and
top of funnel strategies. My experience has touched on all parts of the data science process -
here is a sample of the steps I typically follow in developing and deploying a solution:
brainstorming and ideation, data acquisition, exploratory data analysis, model development
and optimization, deployment in CI/CD environments, test design and execution, dashboarding and
reporting, automation of ETL pipeline, automation of model retraining pipelines, automation of
model redeployment, and model monitoring. I am results driven and believe in constant communication
with stakeholders to align on outcomes and maximize investment.
I am open to both consulting and full time roles!
At home I am passionate about the environment and love the outdoors. On weekends you can
find me skiing, swimming, or hiking depending on the season. When I'm not outside you'll find me
watering my plants and watching soccer. I am a guitar player and enjoy supporting the
local music scene. Lastly, I studied the French language during my school years and love
the French culture! I'm always in search of a good croissant. In my spare time, data science
allows me to combine my analytical mindset and machine learning skills with hobbies and projects at home.
Take a look at my github to see some of my work!
Contact Me
Weather & Commodities
While in college, I took a weather and climate course. I've always carried an interest in nature and the environment, and that course allowed me to appreciate these things from a new perspective. Since then I've been curious about the link between commodities and the weather. Now I finally had the opportunity to examine it in depth! The goal of this study is to examine the effect of weather on agricultural commodity prices. Specifically, can we use recent weather to predict prices in the near future? Corn, cotton, and wheat are three of the US' biggest crops. The hypothesis is that weather will affect the supply of these commodities when they hit the market in the near future. For example, a particularly dry spell, or wildfires, could reduce the supply of corn that is currently being harvested and will hit the market in a couple weeks. This reduced supply would then drive prices up. I will use US spot prices and US weather data. This analysis assumes that the US market is the driver of US commodity prices; as three of the US largest crops, they certainly have a large effect on domestic prices. However, it is certainly possible that imports from other countries would affect prices too.
Data Two main datasets were used.
— First is historical price data for corn, cotton, and wheat. This data is a daily price for every business day dating back to the 60s. There are data points for every trading day (no weekends or holidays) which on average is 252 days per year. There was very little clearning required for this data. Data was retreived from http://www.macrotrends.net/charts/commodities
— Second is weather data from the National Oceanic and Atmospheric Administration(NOAA). The data set is the global summary of the day(GSOD). This provides data on a number of daily metrics such as temperature, humidity, and wind, for numerous locations across the globe. Both acquiring and cleaning this data were lengthy processes. NOAA has an FTP for the GSOD data, provided you are using it for non-business purposes (that's me!). I used a bash script to access this server and download data from the 1960s to present (range of the commodities data). For each year, the data could be unzipped into a separate .op file for each station on record that year (around 12,000 stations at most). Each of these files would contain at most 365 days of data for that particular station. I am still unsure what .op stands for, but it is a fixed width text file. My next bash script unzipped these files, removed their headers, and concatenated them into one large file for the given year (repeated for each year). At this point, I had one large text file for each year. I created a schema for these files and read them into pandas. My first step was to filter to just US stations. The weather data didn't have any info on station location, so I retrieved another fixed width file with on station locations from NOAA, made a schema for it, and read it into pandas to make a list of US stations. Next, I filtered to the US stations that were present in every year. For each commodity, I filtered to stations in states where that crop is grown. There were some quirks to the data : for example, in different columns, n/a values could be represented by 99.9, 999.99, 9999.99, and more, without rhyme or reason (same number could have been used for all variables to avoid confusion). After all cleaning was done, I averaged each variable of all stations for each day.
Key Insights This project was an excellent exercise in time series decomposition and modeling. I used a SARIMAX model to predict prices three weeks into the future. To test the practical use of my model, I built a simple trading strategy: if you are not in the market, buy if the predicted price in three weeks is higher than present, and do nothing if it is lower. If you are already in the market, sell if the predicted price in three weeks is lower than present, and do nothing if it is higher. I applied this strategy from 2010. Ultimately, my model and strategy proved better than holding the index for all three commodities. Corn performed the best with a 61% relative outperformance. There were many learning opportunities throughout this project. The process of attaining and moving the data into a usable format proved much more complicated than I orginally predicted, and I learned some bash scripting along the way. Working with historical weather data can be very challenging, as collection methods have varied over the years. Frequency of reporting was certainly an issue too. While the majority of weather stations had over 340 days of observations per year, there were a number with only ~20 days of observations. Additionally, this study only used spot prices without consideration for futures prices, which should be incorporated for more accurate price analysis. Finally, trading costs should be incorporated into the cost-benefit analysis in the future.
Concepts and Skills Used Bash Python Pandas SKLearn Feature engineering Time Series Decomposition Time Series Modeling
Neural Newtworks Exploration
Overview This project is an exercise in nerual network implementation using TensorFlow and Keras. I used the MNIST data set to train feed forward and convolutional neural networks using Tensorflow and Keras. The goal was to characterize each digit into a digit 0-9. This was an excellent primer on the difference in implementation between Tensorflow and Keras.
Data The MNIST data set was retreived from Kaggle (https://www.kaggle.com/c/digit-recognizer) as part of an image recognition competition. It consists of a training and test set which have pixel data mapped into csv format. I created a validation set on the training data to tune my models and passed the testing data through the final product.
Key Insights Both the network structures performed well; unsurprisingly the convolutional network was the best thanks to it's enhanced edge detection capacity. The most valuable outcome from this exercise was learning how Keras and Tensorflow interact with each other - my Keras models took almost twice as long to train! Early stopping will be a necessity with Keras moving forward. Keras is certainly more user friendly when setting up networks, but as of now this comes at the cost of functionality. Tensorflow networks are definitely more modular - one small example when training my model I can print a note for every X epochs just to make sure everything is working. With Keras it's all or nothing - I can print a line for every epoch or for none at all. This obviously becomes an issue when you want to run a large number of epochs.
Concepts and Skills Used Pandas SKLearn TensorFlow Keras Feed Forward Neural Networks Convolutional Nerual Networks
West Nile Virus Analysis
Overview West Nile virus was first spotted in North America in 1999. Since then, it has spread across the United States and arrived in Chicago in 2002. The goal of this project was to correctly predict the occurrence of West Nile virus in the Chicago area given data on weather and mosquitos caught in numerous traps throughout the city. This project was inspired by a Kaggle competition (www.kaggle.com/c/predict-west-nile-virus). The metric used in this competition is the Area Under the Receiver Operating Curve (AUROC or AUC), which measures the tradeoff between sensitivity (true positive rate) and specificity (true negative rate).
Data The data consisted of four main csv files. First a ‘train’ file giving locations of mosquito traps throughout Chicago, with entries for each time a different species of mosquito was caught. Finally, there was a column indicating whether West Nile virus was present among the mosquitos captured. Next, there was a ‘test’ file with the same format as the train file, but without the West Nile virus indicator column. Next, there was a ‘weather’ file with daily weather data for two locations in Chicago. Last was a ‘spray’ file with data on times and locations where mosquito spray had been deployed.
Key Insights There were a number of moving parts in this project, largely due to the segmented data sets. The majority of the cleaning was in the weather data. First, I grabbed a new package called geopy to calculate the distance from each trap to each of the two weather stations (using vincenty distance, which calculates the distance from two points on an ellipsoid). Then I made a function to map each trap to the weather from whichever station was closer. There were a number of missing values and repetitive variables (such as heating degree days and cooling degree days, which are measures of the number of degrees the average temperature is below or above 65℃F respectively). Lastly, it was necessary to bootstrap training data on West Nile virus cases, as there was understandably a very lopsided case on unbalanced classes.
I fit a random forest and a neural network for this project, and the best model was ultimately the neural network. This was a great exercise in tuning neural networks, as the model was initially not learning at all. After much regularization and playing around with the learning rate, I found that largely increasing the batch size was the key to getting an effective model. The training AUC score was 0.82, obviously not perfect but it was not a bad result for a tricky data set like this one. The competition was asking for the best AUC score. In the end, I think it may be more effective to focus just on sensitivity, so that the focus is just on correctly predicting outbreaks of the virus instead of also optimizing predictions of the common case where there is no virus found.
Concepts and Skills Used Pandas SKLearn Tensorflow Keras Boostrapping AUC ROC Random Forest Feed Forward Neural Networks
Reddit User Engagement
Overview This project was an exercise in webscraping and natural language processing and well as comparing random forests with other classifiers. The goal was to use data scraped from reddit to analyze drivers of user enagement, measured by number of comments. For this study, I classified the target into a binary variable — above or below median number of comments, with accuracy as the scoring metric.
Data HTML data was scraped from reddit (http://www.reddit.com). The web scraper retrieved data from the 'front page' of reddit and all following pages (there are 20 following pages). HTML data was then processed into key variables through feature selection and put into a pandas dataframe. Lastly, feature engineering was performed on the titles of the posts.
Key Insights Both the random forest and the logistic regression achieved accuracy scores in the low 70s, a certain improvement over the baseline of 50%. Part of the model tuning was the decision to include a countvectorizer. In this case, I do not believe the dataset has enough word data for the countvectorizer to be effective. For effective use of the countvectorizer, more data should be gathered, or the analysis should be performed on posts within a single subreddit (topic) where certain buzzwords and trends are more likely to exist.
Skills and Concepts Used Pandas SKLearn BeautifulSoup Regex CountVectorizer Feature Engineering Random Forest Logistic Regression
Iowa Liquor Sales
Overview Iowa liquor sales! This report provides an analysis of transactional data and demographic data to predict total sales by liquor stores in Iowa. The goal was to use this info to make locational recommendations for building new liquor stores in Iowa. I combined liquor data with demographic data to test if demographic data could predict total store sales. For this study, I used linear regression to model sales because I wanted to infer about the relations between my variables and my target. In this instance, I want to find the optimal demographic measures and then locate the areas with the best combinations of these features.
Data The data was sourced from the state of Iowa, and was grouped into store level data in order to infer about sales by store. Two datasets were used: Iowa liquor transactions, and Iowa demographic data.
Iowa liquor transactions — Provided by the state of Iowa, consists of every class E liquor transaction in Iowa from January 2015 to March 2016. Data includes store info and address, liquor type and quantity, and cost to store and buyer.
Iowa demographic data — Pulled from the Iowa State Data Center, a combination of demographic data organized by county. Data was pulled from the 'American Community Survey' section of the ISDC's website.
Key Insights While initially examining the sales oriented data, the main focus of this study is on the demographic data. Locational data is likely to have correlation but not causation with sales, and for that reason the report analyzes the demographic data, or quantifiable population statistics that can be categorized by location, to infer about the ability of a location to predict sales. The hypothesis was that demographic data will help explain the variability of store sales. The results of the study unfortunately tell a different tale - models using the demographic data had poor explanatory power on the sales of a store. As such, I concluded that demographic data at that level cannot reliably be used to predict store sales. Polk, Linn, and Scott counties were selected as target regions for new stores based on their relative outperformance in sales and undersaturation on a stores per county basis. Further assessment should address intercounty data to find the least saturated points in these counties.