Saturday 3 August 2019

Using Python to Analyse a Dream Diary

A few years ago I started keeping a dream diary. The volume of writing has now accumulated to a point where I thought it might be useful to analyse it systematically on a larger scale. This post describes how I've used Python programming language to do this. Quantitative analysis of writing, and dreams in particular, are topics that already have a fair bit of academic literature and tools dedicated to them. I took a couple of approaches developed by other people (referenced below) and made them into a Python script for personal use.

Here is a quick summary of what the script does:

  • Takes a text file containing a collection of dreams and quantifies their contents. This gives the frequency of words related to categories like cognitive processes, positive/negative emotions, social interactions, sensory experiences etc.
  • Compares these scores to those derived from dreams of the general population. This makes the scores meaningful by showing which aspects of an individual's dream life are statistically different from the baselines found by researchers.
  • Graphically plots changes in the content of dreams over time. This can be used to explore the temporal development of specific themes (e.g. changes in emotional valence over several months/years)
  • Plots a network of names (people, places), where nodes represent names and links represent the occurrence of these entities together in the same dream.
Below I describe each of these functionalities and present examples of insights from an individual's dream diary. If you want to try it for yourself, the script can be found on GitHub, along with the example diary. This is my first time sharing code, all feedback welcome. The script can also be tweaked to work with a normal diary or expressive writing, where baselines from the general population can be added for comparison. I suggest having a corpus of at least 100 diary entries to work with.

Dream Content

To analyse the content of dreams, the script uses the Linguistic Inventory and Word Count (LIWC) system. This was developed by psychologist James Pennebaker and colleagues to automatically analyse text samples and calculate the frequencies of words used in different domains. The intention was to identify sets of words that reflect basic emotional and cognitive dimensions. The latest version of LIWC (2015) has over 70 categories including emotions, cognition, personal concerns, work and leisure activities, as well as grammar and vocabulary dimensions like the use of pronouns and verbs. Each category is triggered when the input text features a relevant word. For instance, the words “dish”, “eat” or “pizza” will contribute to the score for the “ingestion” category, and the words “worried” or “fearful” will count towards the “anxiety” category. Some words belong to more than one category. For example, the word “cried” will increment the scores for categories such as “sadness”, “negative emotion” and “past focus”. These scores are expressed as the percentage of words in the text sample that are related to each category.

The LIWC scoring system is a proprietary piece of software. The script presented here uses only the LIWC dictionary, which is a text file with a long list of words and the categories to which they belong. This file is fed into Python and category scores are calculated using LIWC dictionary reader functions developed by Sean Rife. I tweaked his script to work with the 2015 version of the LIWC dictionary.

Comparison to Baselines

Researchers have applied the LIWC system of analysis to study different kinds of text samples (e.g. blogs, novels, expressive writing). Most relevant here are the recent results of psychologists Kelly Bulkeley and Mark Graves (2018), who used LIWC to study the linguistic properties of dreams. Using over five thousand dream reports from a diverse collection of people, the authors identified baseline rates for the usage of each LIWC category. These baselines enable us to compare the linguistic contents of our personal dreams with those of a “normal” person, to highlight distinguishing aspects that may be unique to us as individuals.

In order to quantify the difference between a person's scores and population baselines, the script uses measurements known as Z-scores. These scores take into account the variability of scores in the population to estimate the degree to which an individual's value is different from the population average. If a Z-score exceeds an absolute threshold of 1.96, this would give 95% certainty that the value is significantly different.

Below is an example output produced using a publicly available diary containing 315 dreams from a woman named Merri. I scraped her diary from the DreamBank repository managed by Adam Schneider & William Domhoff.




The plot displays a selection of only the top 10 LIWC categories that show the greatest difference between the individual's dream diary and the population baselines. The height of the bars corresponds to the degree to which the category scores are different from the population average (either higher or lower). The red dashed line indicates the 95% confidence interval, a threshold that would be exceeded when the difference is statistically significant.

From the above plot we can see that Merri's use of words generally lies within the expected thresholds. Still, some interesting differences can be observed. For example, she uses fewer function words, more 2nd person pronouns ("you") and fewer 1st person singulars ("i") compared to the general population. For more insight into what this means, check out this great TED talk on the psychology of pronouns.


Temporal Analysis

To examine changes in dream content over time, the script splits the diary into quarterly chunks (3 months each) and calculates the LIWC scores for every period. It then identifies and plots the categories that show the greatest change over time.


Merri's dream diary shows a general decline in 2nd person pronouns ("you"), a decline in conjunctions ("conj") and a gradual increase in filler words over time.

If there is a specific category that interests you that isn't captured in the automated plot, it can be plotted separately like so:


The "affect" category covers words related to affective processes including positive and negative emotions. From the above plot, we see that Merri's references to affect peaked in the third quarter of 1999, and declined since then. Notice that these changes are very small, with an absolute drop in frequency from roughly 2.8% to 1.8% in relation to the overall number of words.


Extraction of Names

Dream diaries often include references to people or places by their names. It is possible to use named entity recognition functions from the nltk library in Python to automatically extract all the names mentioned in a diary. I used these to make some network visualisations. 

In the image below, names are represented by nodes and the links show which names have appeared together in a dream. This can give a sense of whether certain people or places play a central role or have a tendency to cluster together. The script produces a very basic network plot like this:

This particular plot shows only names that occur in at least 3 dreams and co-occur with their neighbours beyond a certain level (these thresholds can be adjusted). We see hubs formed around Merri's two siblings, Dora and Rudy, whose names frequently get mentioned alongside other characters and place names.

There is a lot that can be improved in the visualisations and analyses. Nonetheless, I hope this post gives a sense of some of the dream insights that can be automated with Python.

I'd love to know if this project will be useful or interesting to some of you. If you have any feedback or suggestions please let me know!

GitHub repository: https://github.com/mpriestley/dream-analysis

References:
Bulkeley, K., & Graves, M. (2018). Using the LIWC program to study dreams. Dreaming, 28(1), 43.
Pennebaker, J.W., Boyd, R.L., Jordan, K., & Blackburn, K. (2015). The development and psychometric properties of LIWC2015. Austin, TX: University of Texas at Austin.
Sean Rife's PsyLex functions for reading a LIWC dictionary in Python: https://github.com/seanrife/psyLex
DreamBank repository of dream reports: http://www.dreambank.net/

No comments:

Post a Comment