US Unemployment Rates

Initial Data

The historical data that was collected for this project comes from two sources: the Bureau of Labor Statistics (BLS) and The Federal Reserve Bank of St. Louis' Federal Reserve Economic Database (FRED). These two sources provided the foundation of the beginning of the predictions. Some of the categories being used in the predicitions are the historical unemployment rates from 2000 to 2021 recorded by race, gender, and education as seen below. This collected data was scraped from the websites using API keys and placed into a PostgreSQL database once cleaned. This process allowed for the data to be properly used in the following machine learning models.

The Database

Before the data could be used in the models, it needed to be stored in a location that all members could access. The best solution was the cloud storage feature from Amazon, AWS RDS cloud database. Within that cloud storage, an S3 bucket was used to house the cleaned CSV files. From there, the team connected the S3 bucket to a local PostgreSQL database. Queries were used to combine tables of data in order to have data sets that could be run through the machine learning models tested by the team. The files stored in the S3 bucket and pgAdmin were easily accessed using Google Colab and PySpark to review the cleaned and sorted data prior to testing.

The First Machine Learning Model - KNN

With data that is continuous, it was important to utilize machine learning methods that would provide numerical results. The first model implemented was the K-Nearest Neighbor (KNN) algorithm. This particular model can be used for either linear regression or classification, however, for this prediction project the focus was on linear regression. The data collected from the API calls was split into training and testing sets to test the accuracy of the model. The first test with this model, as seen below, was using data from 2001 to attempt to predict the rates from 2002 in order to test the accuracy. It was discovered that the model was overfitting and could not predict future dates with the accuracy the project needed.

The Second Machine Learning Model - SVR

Another approach the team decided to take was using the machine learning mdelo known as Support Vector Regression (SVR). This particular method focuses on producing discrete value predictions. With the goal of predicting unemployment rates, discrete values are ideal and the most useful for the scope of the project. The initial dataset used to test and train the SVR model was sourced from an API scrape containing multiple features in hopes to gain the best outcome. This approach seemed to work and provide positive outcomes. However, the results could not be replicated any time after. The team decided it was best to move one from this method and try the third machine learning model.

The Third Machine Learning Model - ARIMA

Once it was decided that the first model, KNN, would not create an accurate predicition, a new method was selected. The next model explored was AutoRegressive Integrated Moving Average (ARIMA). This particular model utilizes a type of time series forecasting model. Forecasting is a popuilar machine learning method to use when predicting the future values the series in question will take. A time series can be of a yearly, quarterly, monthly, weekly, daily, hourly, minutes, and seconds frequency. With unemployment rates being recorded monthly and yearly by most reporting agencies, this particular model best fits the current data sets available in structure. This particular type of machine learning also requires that data be stationary for best results when performing testing and training. The overall unemployment rates CSV file was used as the sample data for three main tests of the ARIMA method. In the initial test, it was found that the unemployment rates provided for the next twelve months were suspiciously high, ranging from six to seventeen percent. It is suspected that the team did not have a full understanding of the machine learning method to make appropriate adjustments. The second attempt was created following similar methods for predicting temperature, however, the model provided a consisent result of 6.7% for every month. When adjustments were made, the unemployment prediction produced was stationary at 3.9%. This percentage rate was in line with the historical data. However, the second test of the ARIMA model did not provide variation in results which leaves the team to believe that the model was not successful despite being within range of acceptable results and creating accurate testing and training scores. The final attempt using the ARIMA model could not be moved past the training and testing phase as it did not produce future predictions like the previous two attempts. Overall, the model seemed extremely promising based on the research performed, but did not meet the team's standards and expectations.

The Results

After completing an exploration of the KNN and ARIMA machine learning models, it was discovered that both models had their pros and cons. The ARIMA model best fit the goals of the project itself. The model was built to predict similar concepts built around time frequencies. Yet, when the model was put through testing and training, the results proved too high or simply predicting the same percentage continuously. The KNN model on the other hand, did not have the same issues with predicting future unemployment rates. Testing the model proved to show a high accuracy. This high accuracy was a concern of being overfitting based on the single CSV file that was used for sample data. In both cases, more time and further research would provide a clearer picture of which model could accurately predict future unemployment rate in the United States based on a multitude of factors.