Rampant misinformation—often called “fake news”—is one of the defining features of contemporary democratic life.

An alien man brandishing a data stick and screaming It's a FAAAAKE!
Senator Vreenak of the Romulan Federation finding out that the article he read about Jewish Space Lasers wasn't completely accurate.

In this Blog Post, you will develop and assess a fake news classifier using Tensorflow.

Note: Working on this Blog Post in Google Colab is highly recommended.

Data Source

Our data for this assignment comes from the article

  • Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138).

I accessed it from Kaggle. I have done a small amount of data cleaning for you already, and performed a train-test split.

§1. Acquire Training Data

I have hosted a training data set at the below URL. You can either read it into Python directly (via pd.read_csv()) or download it to your computer and read it from disk.

train_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_train.csv?raw=true"

Each row of the data corresponds to an article. The title column gives the title of the article, while the text column gives the full article text. The final column, called fake, is 0 if the article is true and 1 if the article contains fake news, as determined by the authors of the paper above.

§2. Make a Dataset

Write a function called make_dataset. This function should do two things:

  1. Remove stopwords from the article text and title. A stopword is a word that is usually considered to be uninformative, such as “the,” “and,” or “but.” You may find this StackOverFlow thread to be helpful.
  2. Construct and return a tf.data.Dataset with two inputs and one output. The input should be of the form (title, text), and the output should consist only of the fake column. You may find it helpful to consult these lecture notes or this tutorial for reference on how to construct and use Datasets with multiple inputs.

Call the function make_dataset on your training dataframe to produce a Dataset. You may wish to batch your Dataset prior to returning it, which can be done like this: my_data_set.batch(100). Batching causes your model to train on chunks of data rather than individual rows. This can sometimes reduce accuracy, but can also greatly increase the speed of training. Finding a balance is key. I found batches of 100 rows to work well.

Validation Data

After you’ve constructed your primary Dataset, split of 20% of it to use for validation.

Base Rate

Recall that the base rate refers to the accuracy of a model that always makes the same guess (for example, such a model might always say “fake news!”). Determine the base rate for this data set by examining the labels on the training set.

§3. Create Models

Please use TensorFlow models to offer a perspective on the following question:

When detecting fake news, is it most effective to focus on only the title of the article, the full text of the article, or both?

To address this question, create three (3) TensorFlow models.

  1. In the first model, you should use only the article title as an input.
  2. In the second model, you should use only the article text as an input.
  3. In the third model, you should use both the article title and the article text as input.

Train your models on the training data until they appear to be “fully” trained. Assess and compare their performance. Make sure to include a visualization of the training histories.

Notes

  • For the first two models, you don’t have to create new Datasets. Instead, just specify the inputs to the keras.Model appropriately, and TensorFlow will automatically ignore the unused inputs in the Dataset.
  • The lecture notes and tutorials linked above are likely to be helpful as you are creating your models as well.
  • You will need to use the Functional API, rather than the Sequential API, for this modeling task.
  • When using the Functional API, it is possible to use the same layer in multiple parts of your model; see this tutorial for examples. I recommended that you share an embedding layer for both the article title and text inputs.
  • You may encounter overfitting, in which case Dropout layers can help.

You’re free to be creative when designing your models. If you’re feeling very stuck, start with some of the pipelines for processing text that we’ve seen in lecture, and iterate from there. Please include in your discussion some of the things that you tried and how you determined the models you used.

What Accuracy Should You Aim For?

Your three different models might have noticeably different performance. Your best model should be able to consistently score at least 97% validation accuracy.

After comparing the performance of each model on validation data, make a recommendation regarding the question at the beginning of this section. Should algorithms use the title, the text, or both when seeking to detect fake news?

§4. Model Evaluation

Now we’ll test your model performance on unseen test data. For this part, you can focus on your best model, and ignore the other two.

Once you’re satisfied with your best model’s performance on validation data, download the test data here:

test_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_test.csv?raw=true"

You’ll need to convert this data using the make_dataset function you defined in Part §2. Then, evaluate your model on the data. If we used your model as a fake news detector, how often would we be right?

§5. Embedding Visualization

Visualize and comment on the embedding that your model learned (you did use an embedding, right?). Are you able to find any interesting patterns or associations in the words that the model found useful when distinguishing real news from fake news? You are welcome to use either 2-dimensional or 3-dimensional embedding. Comment on at least 5 words whose location in the embedding you find interpretable.

I’d suggest that you create an embedding in a relatively large number of dimensions (say, 10) and then use PCA to reduce the dimension down to a visualizable number. This procedure was demonstrated in lecture.

Specifications

Please remember that you must meet all specifications in order to receive credit on the first submission!

Coding Problem

Data Prep

  1. Stopwords are removed during the construction of the data set.
  2. make_dataset is implemented as a function, and used to create both the training/validation and testing data sets.
  3. The constructed Dataset has multiple inputs.
  4. 20% of the training data is split off for validation.
  5. There is a comment on the base rate for the data set.

Models

  1. Model 1 uses only the article title.
  2. Model 2 uses only the article text.
  3. Model 3 uses both the article title and text.
  4. The training history is shown for each of the three models, including the training and validation performance.
  5. The most performant model is evaluated on the test data set.
  6. The best model consistently obtains at least 97% accuracy on the validation set.
  7. The best model’s performance on the test set is shown.

Embedding Visualization

  1. A visualization of the learned word embedding is shown.
  2. The written text discusses at least 5 words whose location is interpretable within the embedding.

Style and Documentation

  1. Code throughout is written using minimal repetition and clean style.
  2. Docstrings are not required in this Blog Post, but please make sure to include useful comments and detailed explanations for each of your code blocks.
  3. Any repeated operations should be enclosed in functions, regardless of whether they are explicitly required in the instructions.

Writing

  1. The blog post is written in tutorial format, in engaging and clear English. Grammar and spelling errors are acceptable within reason.