Refactoring
Overview
Teaching: 10 min
Exercises: 10 minQuestions
When should I reorganize my code so it is more clear and readable for others?
How can I organize my code so that it is useable in other places?
Why do I almost always want to write my code as though it will be used somewhere else?
Objectives
Understand the value of refactoring code and use of functions.
Practice determining where code can be divided into smaller functions.
This code works nicely for generating plots of multiple data sets, but there is now a lot of code to digest in our script. Picture yourself looking at this code for the first time. Would it be immediately clear to you where are arguments are handled and where plots are generated?
It would be nice to break this work into clear chunks of code. This can be accomplished by making the argument checking section and the body of the for loop their own functions. This requires surprisingly few changes to the code, but makes it much more clear. This process is called refactoring..
Create a Branch
Before we set to work, let’s make a new branch to work in.
$ git checkout -b refactor
Once this is complete, our program now looks something like this:
import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt
def parse_arguments():
"""
Parse the argument list and return a list
of filenames.
Input:
------
none
Returns:
--------
filenames: list of strings, list of files to plot
"""
# make sure additional arguments or flags have
# been provided by the user
if len(sys.argv) == 1:
# why the program will not continue
print("Not enough arguments have been provided")
# how this can be corrected
print("Usage: python gdp_plots.py <filenames>")
print("Options:")
print("-a : plot all gdp data sets in current directory")
# check for -a flag in arguments
if "-a" in sys.argv:
filenames = glob.glob("*gdp*.csv")
else:
filenames = sys.argv[1:]
return filenames
def create_plot(filename):
"""
Creates a plot for the specified
data file.
Input:
------
filename: string, path to file to plot
Returns:
--------
none
"""
# read data into a pandas dataframe and transpose
data = pandas.read_csv(filename, index_col = 'country').T
# create a plot the transposed data
ax = data.plot( title = filename )
# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks( range(len(data.index)) )
ax.set_xticklabels( data.index, rotation = 45 )
# save the plot with a unique file name
save_name = filename.split('.')[0] + '.png'
plt.savefig(save_name)
def create_plots(filenames):
"""
Takes in a list of filenames to plot
and creates a plot for each file.
Input:
------
filenames: list of strings, list of files to plot
Returns:
--------
none
"""
for filename in filenames:
create_plot(filename)
def main():
"""
main function - does all the work
"""
# parse arguments
files_to_plot = parse_arguments()
#generate plots
create_plots(files_to_plot)
# call main
main()
The behavior of the program hasn’t changed, but it has been made more modular by separating the into different functions with their own purpose.
Why the extra function? Our function main has two main components to it
- parsing arguments handed to the program
- generating the desired plots
But we’ve defined three functions above the main
function: parse_arguments
, create_plot
, and create_plots
.
If the functions in our file reflected what main
directly we would only have the parse_arguments
and create_plots
functions. If we think about using these functions independently, however, a function which always takes in a list of
filenames isn’t very convenient to use on its own. By defining the create_plot
function, we have placed most of the
plot generation work there, while allowing for a very simple definition of the create_plots
function.
The importance of this design decision will be made clear in the next lesson.
Adding Docstrings
In an effort to create human-readable code. It is common to include a “docstring” or “Python Documentation String”, at the top of each function that clearly lay out three things: the purpose or objective of the function, a description of the inputs and their datatypes, and a description of what is returned and their data types. By including these things, debugging later on is much more efficient because now the developer knows the starting point (inputs), endpoints (returns), and what the intended change is (purpose).
Update the Repository
We haven’t changed the behavior of our program, but our code has changed, so let’s update the repository.
$ git add pandas_plots.py
$ git commit -m "Refactoring code."
Now that we’re satisfied with our refactor. We can merge this branch into our master branch.
$ git checkout master
$ git merge refactor
Branching and Refactoring
To demonstrate that the behavior of our program hasn’t changed, try running it a few different ways using both the
master
andrefactor
branches. Remember that the command for checking out a branch isgit checkout <branch_name>
.
Key Points
Refactoring makes code more modular, easier to read, and easier to understand.
Refactoring requires one to consdier future implications and generally enables others to use your code more easily.