Taxonomy Icon

Analytics

Learning objectives

This guide will teach you some basic Python syntax and one of the ways to use Python to get and parse email data. You will also be using the pandas Python library. In addition to learning Python and pandas, you will learn how to perform some basic data analysis on email, providing you with some visualizations of that data.

Prerequisites

Before beginning this tutorial, you’ll need to download and install on your system:

Estimated time

Completing this how-to should take approximately 1 hour.

Steps

Learn or review some basic Python syntax

Before jumping right into the data analysis, you need to understand the tools used in this article. In this case, that means Python and the Python library, pandas. Python is an interpreted language, which means that instead of compiling a program to an executable file, Python interprets your code, line by line, at the time of execution. From a practical standpoint, it doesn’t usually matter too much whether a language is interpreted or compiled, but it’s important to know what you are working with. We’re first going to go over some basic Python syntax, just so we’re all on the same page, then we’ll take a look at pandas syntax before we use pandas for data analysis in the subsequent sections.

Basic Python syntax

It’s important to understand data types and data structures used in a language before getting too far into learning the syntax. Integers, long integers, floats, and complex numbers are all available for use in Python. Strings in Python are similar to other scripting languages. The data structures used most in Python are lists, tuples and sequences, sets, and dictionaries (see Listing 1). The code block shows that you can create basic variables using just an = operator. Line 9 of the code in Listing 1 shows that strings can be sliced into subsets of characters, creating smaller strings.

Listing 1. Some Python basics

inter = 10
longer = 10L
float = 123.456
com = 40.1J
pie = "The number Pi"
print inter; print longer; print float; print com; print pie
print type(inter); print type(longer); print type(float); print type(com); print type(pie)
print pie[3:8]
>>>
10
10
123.456
40.1j
The number Pi
<type 'int'>
<type 'long'>
<type 'float'>
<type 'complex'>
<type 'str'>
 numb

Listing 2 shows the use of a list, a tuple, and a dictionary. Lists can contain elements of different types, and individual elements can be returned or changed using their numerical index (beginning with 0). Tuples can be handled in many of the same ways as a list, but their values cannot change after they have been assigned. Dictionaries are a collection of key-value pairs that can also be accessed in ways similar to that of a list or tuple.

Listing 2. A list, a tuple, and a dictionary

Lis = [inter,longer,"Element"]
print Lis[1]
Lis[1] = 6.0
print Lis[1]
Counts = (1,2,3,4,5,6,7,8,9)
print Counts[8]
Ages = {"Erik": 27,"Stewart": 23, "Nora": 45}
print Ages["Erik"]
Ages["Erik"] = "Pir"
print Ages["Erik"]
x = "Monty"
y = "Python"
print x+" "+y
>>>
10
6.0
9
27
Pir
Monty Python

So, how does this all get put together to create useful scripts? The Python interpreter, or parser, reads a Python script one line at a time, executing each line as it reads it. To Python, a line ends with the token NEWLINE (\n). If you want to write single lines of code on multiple lines you can use a forward slash; however, it is preferable to use an unclosed parenthesis instead. Sometimes, code is easier to read when multiline statements are used, so a line might be short enough to fit within the confines of the Python parser, but should be broken up anyway to improve readability. An example is shown in Listing 3.

Listing 3. Multiple lines of code

rent = 1000; utilities = 500; income = 3500; debt_payment = 100; food_and_transport = 500;
gig_income = 200; interest_income = 100
spending_money =  (income +
(interest_income + gig_income) -
(rent + utilities + debt_payment + food_and_transport))
print spending_money
>>>
1700

Some other syntax details shown in Listing 3 are strings that can be concatenated using a “+” operator and the “:” is used after a logical statement. The “:” is explored later in this article. Listing 4 shows more basic syntax elements, including how, comments in Python are created using a hash mark, and that multiple statements can be executed on a single line using a semicolon.

Listing 4. Basic syntax

x = 45
y = 5
print x%y
print y%x
#This is how you combine statements on a single line
print x%y ;print y%x

Indentations are used in Python around blocks in the same way { } are used around blocks in other languages. This is important when defining functions and writing logical statements. The code in Listing 5 shows the use of indentation in the creation of functions and if-else statements.

Listing 5. Functions and if-else statements

def example_function( x ):
  if x == 0:
    print("It's zero")
  else:
    print("It's not zero")
examplefunction(0)
examplefunction(1)

The line after a colon has to be indented for the script to work. This indentation can be any number of spaces (presumably no more than a tab), and the only rule is that the indentations are consistent throughout the script. The official style, however, suggests using four spaces.

For-loops also use colons to define blocks. Listing 6 shows the for-loop in action.

Listing 6. The for-loop

def fib( pos ):
  x = 1
  y = 0
  for i in range(1,pos):
      z = x + y
      y = x
      x = z
  print(x)
fib(8)

The fib() function returns the Fibonacci number from the Fibonacci sequence in the position supplied by the pos parameter. The for-loop iterates over the given range and then the result is printed. Apart from the colon and focus on indentation, Python should look very similar to any other language in the C/C++ family.

Before moving on to pandas, here are just a few things to note related to style and Python. There is, of course, a litany of other conventions and expectations to follow when you write Python code, but that’s all in the style guide.

  • It is recommended that Python code should not contain unnecessary white space in brackets, in parentheses, or before and after colons.
  • Naming conventions are pretty simple.
    • Use short all lower-case name when naming packages.
    • For classes, use the CapWords convention.
    • Function names are expected to be lowercase with words separated with underscores to improve readability.

Python, like most other coding languages, relies on modules and libraries to perform certain advanced or more specific tasks that are not performed or not easily performed with the standard Python library. One such library, pandas, has many useful data structures and methods for performing data analysis.

pandas

The primary data structures used in pandas are DataFrames and Series. DataFrames can be thought of as similar spreadsheet, and a Series is similar to a standard Python list (see Listing 7).

Listing 7. pandas.Series

import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randn(5), index=[1,2,3,4,5])
print s1
Ages = {"Erik": 27,"Stewart": 23, "Nora": 45}
s2 = pd.Series(Ages,name="Dict Series")
print s2
print s2[1]
print s2["Erik"]
print s2.get("Sheila")
print s2 * s2
print s2[1] * 3

On line 3 in Listing 7, a series is created by supplying 5 random numbers and five indices. Under this construction, a Series from an array, the length of the array, and the length of the supplied indexes must be the same. The indexes do not have to be unique, however, and if indexes are not supplied, they are automatically created. You can also create a Series from a dictionary. Recreating the Ages dictionary, you can easily convert it into a series, and in this case, the name attribute of the Series is named. Lines 9, 10, and 11 in Listing 7 shows three different ways to pull individual elements out of a Series. Line 11 shows an attempt to pull out an element that doesn’t exist and so “None” is returned. Line 12 displays how Series can use vector operations, like multiplication and addition.

Listing 8 explores pandas DataFrames. A pandas DataFrame is pretty much a collection of pandas Series, each with the same number of elements. It is also very similar to the DataFrame data structure in R, if you are familiar with that. By making another dictionary, Weights, with two of the same keys as the Ages dictionary, you can coerce them into a DataFrame by feeding the DataFrame function a list of the dictionaries. The index is set to Age and Weight, although explicitly setting the index is not necessary. The df DataFrame is currently in what is known as the wide format. The data frame can be easily reshaped into the long format using a T transpose function.

Listing 8. pandas DataFrames

Weights = {"Erik": 175, "Stewart": 190}
dat = [Ages, Weights]
df = pd.DataFrame(dat,index = ["Age","Weight"])
print df
dfs = df.T
print dfs

If the data is more complex, it might be necessary to use the stack and unstack functions. The use of the stack function to reshape a long DataFrame into a wide DataFrame is shown in Listing 9.

Listing 9. The stack function

import urllib2
import io
import matplotlib.pyplot as plt
s = urllib2.urlopen("https://vincentarelbundock.github.io/Rdatasets/csv/car/Cowles.csv").read()
df1=pd.read_csv(io.StringIO(s.decode('utf-8')))
print df1.head()
del df1["Unnamed: 0"]
print df1.head()
plt.scatter(df1.neuroticism, df1.extraversion)
plt.savefig('Scatter.png', bbox_inches='tight')
plt.show()
df3 = df1.stack()
print df3.head()
df2 = df1.pop("sex")
print df2.head()
df1["sex"] = "male"
print df1.head()
>>>
0  neuroticism         16
   extraversion        13
   sex             female
   volunteer           no
1  neuroticism          8
dtype: object
0    female
1      male
2      male
3    female
4      male
Name: sex, dtype: object
   neuroticism  extraversion volunteer   sex
0           16            13        no  male
1            8            14        no  male
2            5            16        no  male
3            8            20        no  male
4            9            19        no  male
   neuroticism  extraversion volunteer
0           16            13        no
1            8            14        no
2            5            16        no
3            8            20        no
4            9            19        no

Line 4 of Listing 9 sends an HTTP GET request to a URL that points to a CSV with some data. The result of this request is parsed into a format that pandas method read_csc() recognizes as a CSV. This is usually done with a local file, and read_csv() is simply fed the filepath. In this case, using an online CSV makes this document more portable. Line 6 shows how to remove a column from a DataFrame. In this case, the column was row numbers from the CSV file and is of no use to us. Using the matplotlib.pyplot library, we can create a simple scatter plot of the neuroticism scores and extraversion scores, shown below. The scatterplot suggests there is no relationship between the two. Line 10 shows the use of the stack method that changes the DataFrame from a long format to a wide format. The remaining lines show how to pop a column off of the DataFrame (line 12) and how to create a column in a DataFrame and assign it a single value (line 14), see Figure 1.

Figure 1. A scatterplot of scores

Scatterplot

There is so much more to Python and pandas than has been covered so far. I strongly recommend digging deep into the documentation for both tools so you can greatly increase the number of tools and techniques in your data analysis toolbox.

Getting and parsing e-mail

There are many, many ways to get communication data using Python, and there are also many different forms of communication texts, comments, video uploads, etc. This section shows only one of the ways to use Python to get and parse email data.

First, we need to get some emails. You can do this with IMAP and the Python library imaplib. Any personal information has been removed from the script (see Listing 10), so you can either take all of this at my word, or you can use your own information.

  1. This example uses a Gmail account. For the script to work, IMAP needs to be enabled on your Gmail account.

  2. To enable IMAP, first open Gmail. Then click the settings button (settings icon) and select Settings.

  3. Choose the Forwarding and POP/IMAP tab.

  4. In the “IMAP Access” section, select Enable IMAP. Then click Save Changes. If you need more help, visit this Gmail help page.

  5. You also need to change some settings in your Google account. Navigate to your Google dashboard either by clicking on your account avatar in the upper right-hand corner of your screen and then clicking My Account or by navigating to https://myaccount.google.com.

  6. Then choose Sign-in & security, scroll down until you see the option Allow less secure apps, and turn the access on.

  7. After all of this is done, the script should be able to access your email using the imaplib Python library.

Essentially, you’ve now changed your Gmail settings so that when the script sends a request, with the proper credentials, Gmail will attempt to fulfill the request.

Listing 10. imaplib script

import imaplib
import email
import getpass
import pandas as pd
username =  #EMAIL ADDRESS
password = #PASSWORD
mail = imaplib.IMAP4_SSL('imap.gmail.com')#EMAIL SERVER
mail.login(username, password)

Now let’s discuss the script itself. First, the username and password are specified. Line 7 connects to the email server and stores a reference to that connection in the mail object. The final line of the code logs into the email server using the mail object. You’re now set to get some emails (see Listing 11).

Listing 11. Getting email

mail.select("inbox")
result, numbers = mail.uid('search', None, "ALL")
uids = numbers[0].split()
result, messages = mail.uid('fetch', ','.join(uids), '(BODY[])')
date_list = []
from_list = []
message_text = []
for _, message in messages[::2]:
  msg = email.message_from_string(message)
  if msg.is_multipart():
    t = []
    for p in msg.get_payload():
      t.append(p.get_payload(decode=True))
    message_text.append(t[0])
  else:
     message_text.append(msg.get_payload(decode=True))
  date_list.append(msg.get('date'))
  from_list.append(msg.get('from'))
  date_list = pd.to_datetime(date_list)
  print len(message_text)
  print len(from_list)
  df = pd.DataFrame(data={'Date':date_list,'Sender':from_list,'Message':message_text})
  print df.head()
  df.to_csv('~inbox_email.csv',index=False)

First, a mailbox is selected, in this case, Inbox. Then a search is performed using the uid() method. This function returns a list of unique identifiers that match the search criteria when the search option is specified. In this case, we are requesting all mail in the mailbox. Again, use the uid() method, but this time use the fetch option to get the email data. Then, iterate through each of the emails that were in the inbox mailbox. Each message is converted into an email object from the email library. This makes for easier retrieval of specific elements of each email. In this case, we are grabbing the sender, the date the email was received, and the message text and appending them each to a different list. We then take those lists, after converting the date into a panda datetime object, and store them in the df DataFrame as three variables: Date, Sender, and Message. This dataframe is then written to a CSV so it can be shared or analyzed later.

Analyze and visualize data

As with retrieving data, the ways you can go about analyzing and visualizing data are pretty much endless. We are only going to scratch the surface of the ways email data can be analyzed. As always, the code begins (Listing 12) with importing the necessary libraries.

Listing 12. Importing required libraries

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
from scipy import stats
from scipy.stats import kde
email_data = pd.read_csv('inbox_email.csv')
leng_list = []
for mssg in email_data['Message']:
 if isinstance(mssg, str):
   leng_list.append(len(mssg))
 else:
   leng_list.append(0)
email_data['MsgLen'] = leng_list
FMT = '%H:%M:%S'
email_data['Time'] = email_data['Date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
.strftime(FMT))
email_data['SinceMid'] = email_data['Time'].apply(lambda x: (datetime.strptime(x, FMT) - datetime
.strptime("00:00:00", FMT)).seconds) / 60 / 60

On line 8 of Listing 12, the pandas method read_csv is used to read the CSV created in the previous step. This analysis will cover the length of inbox email messages, in characters, and the times of day they usually arrive. So on lines 9 through 15 we create a new variable in our DataFrame that represents the character length of each of the message. The Date variable also needs to be transformed into only its time and then that time needs to be transformed into a numerical representation of the amount time passed since midnight. Lines 17 and 18 perform this task, first stripping the time from the data variable (line 17) and then converting that into hours past midnight (line 18).

Listing 13. Some statistics

print len(email_data['Sender'])
print email_data['Sender'].nunique()
print email_data['MsgLen'].max()
print email_data['MsgLen'].mean()
>>>
121
47
36378
7643.58677686

The code block above shows some basic statistics that are printed by lines 19 through 22. The total number of messages in the dataset is 121, and there are 47 unique senders, so there is some duplication there, since the maximum message length is 36,378 characters. In English, the average word length is 5.1 characters, if you add one character to account for spaces, this email was approximately 5,964 word in length, which is pretty long. The average email length in characters was about 7644, which is approximately 1,253 words-this is closer to what you might expect.

Listing 14. Creating the histogram

plt.hist(email_data['SinceMid'], weights=np.zeros_like(email_data['SinceMid']) +
1. / email_data['SinceMid'].size)
plt.xlabel('Hours Since Midnight')
plt.ylabel('Proportion of Total')
plt.title('Distribution of Emails Received Throughout The Day')
plt.axis([0,24,0,0.25])b(plt.savefig('HSM.png', bbox_inches='tight')
plt.show()

The code in Listing 14 creates the histogram shown in Figure 2. The histogram displays the proportion of the total emails that are sent at different times throughout the day. There is a bulk around 12-noon, and what you might approximate as dinner time. This makes sense because these are times people are more likely to be checking their email.

Figure 2. The histogram

histogram of email results

Listing 15. Creating another histogram

plt.hist(email_data['MsgLen'], weights=np.zeros_like(email_data['MsgLen']) +
1. / email_data['MsgLen'].size)
plt.xlabel('Message Length (characters)')
plt.ylabel('Proportion of Total')
plt.title('Distribution of Message Lengths')
plt.savefig('MLen.png', bbox_inches='tight')
plt.show()

This code creates a similar histogram, shown in Figure 3, and this plot shows the distribution of message lengths.

Figure 3. Another histogram of email results

histogram of email results

It makes sense that more than half of the messages have lengths less than 5,000 because we could consider these emails are likely less than 1,000 words long, which is consistent with my experience with spam email. Each variable is interesting by itself, but the next plot (Listing 16) relates the two variables to one another.

Listing 16. A new email plot

nbins=400
k = kde.gaussian_kde([email_data['MsgLen'],email_data['SinceMid']])
xi, yi = np.mgrid[email_data['MsgLen'].min():email_data['MsgLen'].max():nbins*1j, email_data['SinceMid'].min():email_data['SinceMid'].max():nbins*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))
plt.pcolormesh(xi, yi, zi.reshape(xi.shape))
plt.colorbar()
plt.xlabel('Message Length (characters)')
plt.ylabel('Hours Since Midnight')
plt.title('2D Kernal Density Estimate')
plt.axis([email_data['MsgLen'].min(),email_data['MsgLen'].max(),email_data['SinceMid'].
min(),email_data['SinceMid'].max()])
plt.tight_layout()
plt.savefig('KDplot.png', bbox_inches='tight')
plt.show()

The 2-dimensional kernel density plot in Figure 4 shows that there is a large portion of the messages that are received later in the day that are shorter and another fairly large portion that are received earlier in the day but are longer.

Figure 4. Our kernel density plot results

Our kernel density plot results

That plot suggests a negative correlation between the message length and hours since midnight. This can be tested using a linear model, which is performed in Listing 17.

Listing 17. A linear model example

slope, intercept, r_value, p_value, std_err = stats.linregress(email_data["SinceMid"],
email_data["MsgLen"])
print "slope:", slope
print "standard error:", std_err
print "P Value:", p_value
print "r-squared", r_value**2
>>>
slope: -705.380087862
standard error: 135.681074728
P Value: 8.4360195986e-07
r-squared 0.185085683793

A linear model estimates the line of best fit describing a relationship between two or more variables. A linear model essentially averages the relationship of the variables so that how the variables are related can be more easily expressed.

We will not estimate a linear model of the form: y = mx + b

or more specifically: Message Length = N20 + N21 * Minutes Since Midnight.

The slope of the line, N21, is statistic we can use to summarize the relationship between the two variables. Listing 17 shows the estimation of the linear model. The slope is approximately -705 suggesting that as the hour’s increase (it gets later in the day) the length of the messages decrease by about 705 characters. This is the result that was expected, based on our kernel density plot. The result is statistically significant based on its very small p-value. For those who don’t remember from statistics class, a p-value is a measure of the probability that a certain outcome would be observed if the outcome was only due to random chance. In this case, the p-value means that the probability that the slope of the true relationship is equal to zero (there is no relationship between time of day and email length) is less than 0.000001. Such a small p-value means there is a correlation between message length and minutes past midnight.

There is an obvious relationship, at least mathematically, between the time of day and length of a spam email message. However, the r-squared value is only about 0.19. In a linear model, the r-squared value is a measure of the proportion of the variation in the dependent variable that is explained by the linear model. So, this means that although there appears to be a bona fide relationship between the variables, the model only describes 19% of the variation in the message length variable. This means there is a whole bunch of what determines message length that this model doesn’t at all address.

Summary

Python is a great beginner’s language that can also be applied to complex problems. This is part of what makes Python a great tool for people to learn, whether they are new to coding or not, because you get a whole lot of bang for your buck. This article went over some of the basic Python syntax, and some basic syntax from the statistical Python library pandas. Retrieving emails to practice data analysis on was also explored. Emails were retrieved, and some basic data analysis and visualization techniques were demonstrated. I hope you found this informative!