Tags
In this series of posts, we teach programming concepts necessary for data analysis from the perspective of spreadsheets using pez, Zato Novo’s data analysis language. If you know Excel, then you already have the foundation to start coding!
Data structures form the backbone of any programming language (and software system), and for computer science students it can send a shiver down their spine. But data structures don’t have to be intimidating. By the end of this post, you’ll be able to work with them confidently and efficiently.
So what is a data structure? Simply put they are containers that hold data. A spreadsheet is actually a massive data structure that represents data as a grid. Spreadsheets are good for displaying all the gory details of a (tabular) dataset but are cumbersome when moving data around or creating custom functions to modify data. Programming languages, on the other hand, provide compact notation for working with data structures but it can be cumbersome to see all of the data.
Most programming languages come with “batteries included”, meaning once it’s installed you have everything you need to immediately play with it. What’s implied is that all sorts of data structures are provided out of the box, which is great for variety but difficult to pick up and remember. Pez likes to err on the side of simplicity, so there are two primary data structures: lists and data frames. We’ll explore both of these structures using an example of creating financial projections for a startup.
Forecasting MRR
To make the lessons concrete, we’ll use a business forecasting example. In a previous article I showed how to use Panoptez to calculate the MRR of Slack using a basic set of assumptions. For this article, we’ll forecast the MRR of my startup, Zato Novo, based on an even simpler set of assumptions. As with the previous article, we establish a baseline approach using a Google Sheets document. This spreadsheet has a handful of columns, starting with the forecast date, followed by a projected number of paying customers. For pedagogical purposes, I’m assuming a fixed subscriber growth rate of 5% per month, which annualizes to 80%. Then I take that user number and multiply it by the base monthly price of $25/user to get a monthly recurring revenue number. To keep things simple, I’m ignoring tiers, annual prepay, and churn. This spreadsheet will be examined throughout the article as we walk through various concepts.
Working with lists
Okay, now let’s see how to construct the same thing in pez. Lists are an ordered collection of items and can contain any type of data. In a spreadsheet, a range of cells is analogous to a list. When we say an ordered list, items in the list are guaranteed to be in the same order as you entered them. This is like a spreadsheet where the value in A4
always follows the one in A3
. In our revenue forecast example, each column is a list. It’s fine to treat each row as a list as well, although later we’ll see why it’s more convenient to think of lists as columns.
Let’s look at the first column that contains dates. In a spreadsheet we create this column by starting with an initial date. Next we define a formula that adds one month to create the next date (using [EDATE](http://www.aubrett.com/InformationTechnology/Productivity/GoogleDateFunctions.aspx) in Google Sheets). We then copy and paste this formula for each successive cell to create the whole range. Our final date range lives in the cells A2:A25
.
Notice that for each successive date, we are adding one month to the previous date. Hence, the second date adds 1 to the initial date, while the fourth date adds 3, and so on. In pez, we take advantage of this observation to create the dates more compactly. First, we create the initial date, which is simply the literal text 2016-01-01
. If you enter dates with this specific format, pez knows that it’s a date, just like in a spreadsheet. (The same is true of timestamps.)
Now let’s create an integer range that represents how many months the initial date needs to be added to create the complete date range. For this we use the range operator, ..
. For example, 0..23
creates 24 integers, from 0 to 23. The final step is to create the dates, which simply requires adding this date to the list of numbers.
!pez 2016-01-01 + 0..23
See how much simpler this is than copy and pasting a formula into a number of cells? In the spreadsheet, there is one other detail, which is that the column has a header. In pez, we just assign this expression to a variable, which we’ll call `month`. Here is what it looks like in our Panoptez-enabled Slack.
Literal list creation
We saw how easy integer ranges can be created in the previous section. What if you want to create a list that is not an integer range? In this case, a literal list can be created using bracket notation: [x1, x2, x3, ..., xn]
. With this syntax, each element is specified explicitly within square brackets. Using the date range above, the first four elements can be created as [2016-01-01, 2016-02-01, 2016-03-01, 2016-04-01]
. This approach is perfectly legal, but for efficiency, it’s often easier to think about using an expression to generate the appropriate range for you.
Element selection
So what can we do with this list? In a spreadsheet we can pull specific elements from a range and reference them in a separate cell using its coordinates. For example, January of 2017 is located at A14
. This approach is convenient, but what happens if we move this column somewhere else? Let’s say we add one column to the left of A
. Most of the time the spreadsheet automatically updates the cell references to reflect its new location. However, that means if we need to reference it anew, we need to know where it is in the spreadsheet! For complicated spreadsheets it can start to feel like a perverse Where’s Waldo exercise. Wouldn’t be nice if we could always reference the range using the same locations? In pez, our date range is called month
, so any time we access month[13]
we get the first day of 2017. That means no more missing references!
The operation using the name of the variable followed by brackets, x[y]
, is called indexing or subsetting. The number inside the brackets is called the index. In pez, the first element starts at an index of 1, while the last element is at length(x)
. There are other ways to index a list, but for now we’ll stick to the basics.
Compounding growth
Let’s move on to the second column, which contains a hypothetical user growth rate. Starting with an initial value of 100 users (hey, you gotta start somewhere), we assume a monthly growth rate of 5%. So growth is compounding monthly, meaning that each month is 1.05 times greater than the prior month. To model this in a spreadsheet, we again turn to a formula. This time the formula multiplies 1.05 to the previous value instead of adding a value.
In pez, there are a few ways to tackle this. One approach is to use the cumprod
function, which takes a list of numbers and computes the cumulative product of all the numbers in the list from the first element to the current element. For example, cumprod 1..4
yields [1, 2, 6, 24]
, which is equivalent to [1, 1*2, 1*2*3, 1*2*3*4]
. For the growth rate, we create a repeated list of 1.05 and apply cumprod
to it.
Calling functions is similar to calling functions in a spreadsheet, where the name of the function is followed by its arguments wrapped in parentheses. Pez supports a simpler syntax as well, which will be discussed in a future post.
You may have noticed that there’s one problem with this approach. While the spreadsheet starts at 100, our pez list starts at 105. We need to modify the list to do this. However, an even simpler approach takes advantage of how compounding works. Since the compounding rate is constant, each compounding term raises the power of the compounding. Month one is just 1, while month two is 1.05, month three is , and so on. Using what we’ve already learned, we can raise 1.05 to the sequence
0..23
, which produces all the powers for us!
Calculating the MRR
The last column to create is the monthly recurring revenue. The current assumption is $25/user/month, so we multiply each value in C2:C25
by 25.
In pez, the range C2:C25
corresponds to the variable customers
, so we multiply that by 25 and assign its result to a new variable mrr
.
Again, notice how simple it is to describe this operation.
Creating the data frame
The final step is to bring all these variables together into a single table. Data frames are organized by column, which is why we claimed that it’s best to think of lists as columns. Each variable we defined is simply a column in the table.
!pez revenue_forecast = { date=month, customers=customers, mrr=mrr }
The output table is just like the spreadsheet. To make the table easier to work with, it’s actually better to assign our dates to the index of the table. This reduces the number of columns and sets the index to the dates. We use a special `@index` key at the end of the table definition to specify the index.
!pez revenue_forecast = { customers=customers, mrr=mrr, @index=month }
This looks pretty good. However, notice that we had to create a whole bunch of variables to create this table. This pollutes your workspace, which makes it harder to find useful stuff in the future. It’s better to use a let expression to define temporary variables instead.
!pez revenue_forecast = let month = 2016-01-01 + 0..23 customers = 100 * 1.05^0..23 mrr = customers * 25 in { customers=customers, mrr=mrr, @index=month }
Now only the variable you care about is created in your workspace. All the others are deleted once the let expression is evaluated.
As a final goodie, here is a plot of the MRR based on the data we created.
Conclusion
Data structures are an important part of programming. In this article, we took your existing knowledge of Excel and showed how cell ranges are lists and tables are data frames. You also got a taste of let expressions and vectorization, which are two powerful features of pez.
Panoptez is a collaborative data analysis and visualization platform accessible via chat systems, like Slack. Request an invite to the beta or contact us for preferred access.