A Toronto Data Guy

Goodbye Tumblr

For anyone following me on RSS, this is just a quick note that I’m moving my blogging away from tumblr and over to medium. You can find me at:

https://medium.com/@zachaysan

Tumblr has been fun, and I really respect the founders, but Yahoo is a terrible company. Plus Medium’s design is amazing.

Geolibertarianism: a political system that will create the supply for an unconditional basic income

(Partially in response to: http://simulacrum.cc/2013/07/10/three-trends-that-push-us-towards-an-unconditional-basic-income/ found on Hacker News)
 
I am certain that robots and AI will replace individual human’s ability to create items of economic benefit, which on the surface appears to be at odds with my libertarian political beliefs, but I assure you, it is not.

Geolibertarianism - The ethical side (feel free to skip this if you are not a libertarian)

Most libertarians that have thought about the basis of morality enough (Non-Agression Principle) realize that at a certain level any agreement is supported by violence, even if done for a seemingly just cause.

While anarcho-capitalism seemed appealing at first, resolving situations justly seemed impossible. For example, imagine a scenario where someone, who looks pretty similar to you, runs around a corner and jumps into the shadows while seconds later someone else rounds the corner and tackles you. He holds you there “temporarily” so that a group of people can determine if you murdered a shopkeeper. Morally speaking the situation is predicated on violence. You are well within your “rights” to resist arrest and he is within his “rights” to imprison the highly suspected murder (you, at least in his eyes).

The only thing that keeps you from violently withstanding him is your own choice to apply the golden rule, but morally speaking under libertarianism you are not required to always hold the golden rule, since libertarianism is defined in terms of negative liberty, while the golden rule could lead to cases where positive rights could be demanded.

Most libertarians won’t resort to violent means; provided that people have the freedom to move, trade, and speak freely. But just because they don’t doesn’t mean that they do not believe that their rights are not currently being violated. But when asked to consider the following case, they have trouble:

Imagine you are born to a poor, but honest family that rents a house on a small island. Resort developers buy up the majority of the land on the island and serve you an eviction notice. You find yourself faced with a choice, the resort owners are willing to give you room and board and a small amount of money, or you can try your luck at swimming to the next island over where they MIGHT let you through their borders.

The problem is that while comparative advantage normally allows all parties to win on voluntary transactions, when the relative value of the land goes up the effect of the transaction has you displaced by others that are gaining in efficiency. This pushes you out to further and less habitable areas of the country or planet. Simultaneously, while your fore-bearers had the option of immigrating to a country like Canada in exchange for a 500 acre lot that they kept if they farmed it, you do not. You are essentially imprisoned within your country since no other country will let you in.

My ethical argument: In order for the free market to work, transactions must be voluntary. Voluntary transactions are impossible without the option to live, work, and breath somewhere if you do not agree to the terms of the transaction. Adam and Eve (or the Queen’s ancestors) did not EARN the land, they CLAIMED the land and passed onto their children who enjoyed it for free or maybe later sold it to others. Most other things were a product of individual men’s minds or hands, but land originated from on one. 

(End of section to skip) 

Here is my practical argument: The only tax that a government can reasonable wager in a future of encrypted communications, transactions, and trade, is one on land. btilly on hacker news almost got there, advocating the taxation of CO2, but that would have totally warped the market, and would have needed to be applied world wide in order to be effective. I don’t argue that a pure land tax won’t warp the market, I think it will, but just in the same way that it warps the market when the government throws people that preform fraud into jail.

There are many other nice things about a land tax + Basic Income. The first is that there is no more horrible loss of privacy that normally comes with income tax (how much did you make, what did you do, where did you do it, who are you married to, who are your children, etc), capital gains tax, sales tax, financial transaction tax, import tax, or corporate tax. That means no silly container searches along the rail, highway, and seaports. No more joint form filling vs single filling vs alternative minimum tax vs capital gains as a citizen. Or worrying about cyclical ups and downs increasing effective tax due to the graduated tax curve. No more corporation validation for overseas subsidiaries, no more Double Dutch Irish sandwhiches. Or worrying that the worlds largest furniture sales company is actually a charity, exempt from most taxes even though their true donations are a rounding error.

How are you going to tax the companies of the future anyway? Say I create some AI that can design really awesome 3d effects in movies. Why should I keep my company in your country when I could just as easily move it to a tax haven in a matter of seconds over SSH? This isn’t far fetched, this is coming.

The details on how to levy a land tax are broad, but they essentially boil down to the following:

1. Land is valued based on a geometric average of relative density, with maybe a couple other rules that municipalities compete with to implement locally.

2. Mining rights are valued/decided upon separately.

3. Each plot of land has a UUID.

4. The government has a publicly known wallet.

5. Any person wanting to stop the government from reclaiming some property (ie, send people with guns to move other people off of land, as I said above, this whole thing really is built upon violence) can send any amount of money to the wallet to top up the time, although the price is only guaranteed for a limited amount of time forward (say, 5 years).

This allows people to fund parks, or to keep open historical buildings whose tenants could not afford to remain in without subsidy, but it generally keeps money naturally flowing to the government anonymously, painlessly, and fairly.

Basic income distribution works similarly.

1. Entirely opt-in. If you don’t want the money, no worries.

2a. Either track people as is currently done (SIN/SSN) or

2b. Rely on a trust net with provably random checks of citizenship. Upon finding a scam, adjust the trust net.

3. Regulate the prices of the land tax through subregionalism (ie, more people move to places where there is a fairer balance of BI to land tax).

This way people can effectively opt-out of the AI/robotics economy and use their BI to rent a pre-modern plot of land where they can darn socks and plant turnips. It brings choice back to humanity. It also protects the economies of those that will almost certainly not be part of the AI Age (Zimbabwe, for example) from being permanently hampered (since voluntary trade and the catchup effect will both still be effect drivers of growth in the third world).

This is only a near term solution; we’ll ultimately need to colonize Venus or upload our consciousnesses to computers (Hi descendants :D wish I was there in the future!), but in the short term, without comparative advantage the poor cannot feed or shelter themselves; they are being forced into a fate worse than serfdom: undeserved, yet self-acknowledged failure.

Edit: Also, this is my last post on this blog. I will be launching a new one soon. If you would like me to email you when it is live, feel free to send me an email.

LinkedIn must have an awesome data team

When I first saw a message from LinkedIn titled Add skills like “Ruby” to make your profile easier to find in my inbox I let out a little chuckle. Cute. LinkedIn crawled my Github or maybe the text content of my LinkedIn page and wants me to make sure they got it right by adding a skill into their formal system.

I click on the email to see this.

They got every single one spot on. I could come up with simple possibilities for any of those except for Python.

I started coding full time in python about 3 months ago and haven’t had time to open source anything. I haven’t tweeted about it. I haven’t posted about it on Hacker News. Nothing. All I’ve done is Google for “Python Object Inheritance” or “Networkx MultiDiGraph Methods” (best library ever, btw).

So here is some guesses as to how I think they did it:

  1. They got lucky. Though Ruby and Python have completely different core models* they both feel pretty similar to one another.
  2. They relied on a (possibly supervised) LDA-like model that essentially said “look, this guy is in startups, he is a data guy, and he knows Ruby, he really should know Python at this point”
  3. They watched who started following me recently on Github and noticed that there was a bump in people that were proficient in Python. Similarly for twitter.
  4. They bought a portion of my search data from one of those pixel tracker sites that power search re-marketing.

The thing that will be very interesting to see is how they use the other aspects of what they know about me to their clients (recruiters, and possibly clandestine intelligence organizations). This would be a gold mine for them. They could, for example, say they know that I’m trained as a structural engineer, a data scientist, and a developer. An organization looking to develop software that simulates wind over 50 years on a free standing structure to develop more detailed failure scenarios and risk profiles would be desperate for me. The time it would take them to find a guy like me would be immense, so the value of LinkedIn is in closing information gaps, but unlike Google’s search, they do it in an area that the market is willing to pay for up front.

If this post happens along the desks of anyone working at LinkedIn’s data dept. please feel free to email me if you want to have a chat about how you guys did it. I trust you know where to find me.

*(Ruby is a Object Oriented language with functional aspects for pleasure, while Python is a functional programming language that bolted on an Object Oriented paradigm) 

On Economics

I read this and was inspired to write this in semi-response.

One thing I have found is that economics has a heartbeat by the decade and that the number of factors is very high.

For example: I used to take Ireland as an example of a European country that degregulated and subsequently out-preformed nearly everyone. I would show graphs, etc. to everyone I discussed politics with.

Then of course the crash happened. Things are not *simple* they are very, very complex and worse growth happens over decades and during that time population demographics change, technology changes, relative resource pricing changes.

Heck, something that keeps my brain fairly active is the importance of intelligence distribution. What if countries with a lower average iq but a higher average top %1 are better able to organize? Where does this work and where would this be a disadvantage?

What about economy diversification? The Canadian dollar has a roughly 0.1 correlation with the price of oil due to Alberta and NFLD, but we also have manufacturing, tech, media, and finance. Does this stabilize our dollar and allow further investment into any one of the hot sectors or does it needlessly put the out of favor group up against severe currency pressure?

What about subgroup ideology shifts? Does Obama being president enfranchise blacks, the highest crime/welfare subgroup. If he does and there is a 10% positive shift in “outlook” and a subsequent drop in crime and welfare burden 10 years from now how well can that be discovered?

Ultimately I’m a Libertarian because of ethics, not economic expediency. But people want answers and often times “it is really complicated” isn’t the appropriate answer. There are so many inter-dependencies and factors that simple statements like “Sweden does it and it is amazing there!” are worse than useless. They allow people to cheat intellectually, as I once did with Ireland.

I can think of nothing more complicated and intractable than economics. If we do live in a simulation, I suspect our simulators are split testing economic policy.

Statistical Immortality

Presume that there is no God.

Presume that there are either(see update2);

Presume that the human mind can be uploaded to a computer either through;

Presume that there are sufficient bits and ops in the universe to simulate a human mind.*

From the point of view of a simulated consciousness, pausing the hosting computer for a time then resuming it will go completely unnoticed besides the change of state of what is external to the consciousness. For example, if a clock was placed in front of a webcam of a computer that held a simulated human consciousness and the computer was paused for one hour and then resumed the human consciousness would report only that the hour had incremented on the clock nearly instantly, not that he saw black for an hour. This concept will be called non-conscious-time-irrelevance.


From the point of view of a simulated consciousness, pausing the hosting computer then perfectly copying its state to a new hosting computer, destroying the original hosting computer and resuming the process on the new hosting computer will go complete unnoticed to the simulated entity. This concept will be called simulator-irrelevance.


A Turing Machine in a finite universe can only contain a finite set of logical expressions. Since a human mind can be simulated by a computer it follows that there is a finite set of logical expressions that comprise a human mind-state.


Since there are an infinite number of universes, an infinite number of dimensions, or an infinite number of non-identical universe “cycles”, there are an infinite number of particles and particle arrangements. 


Since there is no God, an infinite number of particle arrangements will not be artificially hindered from creating every possible organization and combination of these particles.


Since the simulated human mind-state is finite and every necessary particle combination that can happen will happen, the mind will continue to exist at some point in the past or future, and since non-conscious-time-irrelevance and simulator-irrelevance, it follows that from the point of view of the consciousness death is unattainable, even from a non-computer-simulated human consciousness.

* The argument that since human consciousness exists in this universe, it follows that there are enough bits for an appropriately constructed Turing Machine can be made.

Updates: 

An early reader has responded that:

I think I see what you’re saying, but I think for me there’s still an unresolved conflict between what your proof would imply and what we experience. Basically, if what you say is true, switching Turing Machines wouldn’t happen just at death, but at all moments along your consciousness. Also, I don’t see a particular reason why you’d switch into a reality that’s exactly the same as the one you switched from, just from probability you should switch into a reality that’s different. Shouldn’t you notice a lot more external changes during switches?

Which is an excellent point, but in no way refutes the proof, which does not require that current observers need to have the expected path. I responded weakly with the following:

I guess it would come down to a very hard statistical problem. What is more likely, a reality based body or random bits in some extra-planar computer happening to come into alignment that would form “you?. On the one hand, when you die even if you have to wait 10^51 years eventually you will come into existence again, even if only for a second.

But explaining our experience is not necessary because our experience does not refute the proof.

I personally reject the proof because I don’t believe all the premises, but if the premises are revealed to be true, then I would accept that death is unattainable. Which is interesting because the lack of a omniscient, omnipresence God is a fundamental requirement of the proof, unless that God truthfully promised mental immortality or continually random universes.

If the premises are true I would suspect the reason one doesn’t seem to pop in and out of multiple realities would probably have to do with the computer simulation argument. Which would drastically increase the ratio of bits and ops that are organized towards intelligence in the universe as well as providing a higher ratio of predictable continued realities.

Update2: based on feedback I realized I needed to clarify what is meant by “infinite universes”. What I mean is universes or dimensions that are much like ours (same physical constants particle sizes, etc), but where the initial conditions were slightly different resulting in different distribution of stars, planets, etc.

Update3: It turns out I’m not the first to have this sort of theory. People have brought up lazy immortality as someone that makes the same argument from a different angle. Also, Permutation City was brought up as a book that encapsulated some of the ideas I presented.

Lots of great feedback from readers on this one. Feel free to reach out to me on twitter or gmail if you have anything else that you think would be interesting to add.

tech.is_in_bubble?(actual_data) # => false

Yes. Given my limited view, Color is probably overvalued. Yes, Groupon better role out something new, and fast, or it will never give any decent ROI for its investors. 

But enough with the anecdotal evidence. For each Color or Groupon there is a Google (also had high valuations) or a Dropbox (quite possibly the most awesome company on the planet). 

The real question about whether a bubble is happening. 

"An economic bubble is trade in high volumes at prices that are considerably at variance with intrinsic values"

source: Wikipedia

Right now interest rates available in Canada are south of 1% and from what I can glean online current “risk-free” interest rates in the United States are even less.

At less than 1% it would take over 70 years to double an investment. (I put aside inflation because A. it will only further my argument and B. I have a controversial view on inflation that I don’t want to get into.)

Let’s look at the expected returns from startups. I was going to write a CrunchBase API client, but luckily a site already does this for me.

So first thing to notice: Total amount invested is barely changing, even despite plummeting interest rates. Second, and more importantly, total acquisition amount: $450 Billion. This isn’t including Facebook, or any other cool startups that haven’t been acquired, it doesn’t include companies that went IPO, it doesn’t include companies that have always stayed private, unfunded and quietly paying out dividends to their founders. 

But lets work with $450 Billion just because we’re trying to really make the case. 

60,000 companies on CrunchBase with a total of $450 Billion in acquisition spend. 

Imagine it was like buying a lottery ticket. If you invested at valuation of $7.5mm per company right from the beginning (seed stage or Series A) you would expect to break even (well, actually you would probably have a 1X liquidation preference, so in reality it would be higher than that but lets ignore that). 

Complaining about “sky high valuations” is crazy talk. Most of the companies coming out of the best incubator in the world are getting about $5mm well below the $7.5mm figure and that doesn’t even account for the fact that they are objectively better performers than a random sampling of companies.

The truth is this: Bubbles don’t exist without my aunt’s mutual fund getting involved or my next door neighbor getting told to mortgage his house to invest by his financial advisor. Web 1.0 was all about IPOing on the nasdaq and fleecing the public with business models that disregarded profit. This time around things are different. At every stage of the process you see startups with business models. GuestlistGithub, FreshBooks, heck even Groupon, are making nontrivial money relative to the valuations and expected future growth.

People are getting online, and living online. Just as it wasn’t a bubble when we entered the automotive era and huge car companies were popping up everywhere, it isn’t a bubble when whole groups of people are spending more time online than watching TV or book reading or newspaper reading. 

Except pure* social.

Stay the %&#* away from social. Go find an online circuit board diagram company to invest in or something.

* to differentiate from something like Github which calls itself “social coding”.

Weave Silk, scripting, and vanity website headers

Weave Silk is amazing.

But I wanted more. I wanted to write my name in Weave Silk for a personal website I’m putting together. After trying to write it out manually and not being happy with the results (colors were never what I wanted them to be, my hand couldn’t stay steady for long enough, I would miss time the wind) I decided to look into the source code, even though I barely know how to write a “hello world” program in JavaScript. Obviously that failed nearly instantly. I’m sure I could go through it and do what I want to do, but that might take days or longer. 

Enter: The Lowly AutoHotkey Script. (see below for a video of the script in action)

I use these as a last resort (usually in Excel) when deadlines are fast approaching. Compared with Ruby, or even VBA, AHK scripts are a source of constant surprise.

I’ve always needed to riddle my code with “sleep 10” just to get the most basic key presses to work reliably. The script tries to execute so fast that either my OS or the application I’m using doesn’t register the scripted key strokes.

Also, in certain cases, there is just no way of using a variable where you want to. Rather than setting variables as being accessible by prefixing them with an “@” or “@@” as in Ruby, you need to invoke the keyword “global” within the function you want to call the otherwise inaccessible variable. But in predefined functions it raises an error when you call “global” so you are out of luck. I’m sure that somebody somewhere knows how to do it, but I’ve googled around enough to give up trying to solve that problem.

Anyways, here is the video of the script at work!

http://www.youtube.com/watch?v=bUr_rZMY61A

If you’re struggling with an AHK script and want to give me a shout feel free to email me: zachaysan@gmail.com

Short emails, I’m the last to find out

As my startup has been getting nearer to launch, I’ve made an effort to reach out to people that I’ve helped or connected with over the past year. I wrote individual emails to 95 people, recalling when we spoke last, what they are working on, how my startup is going, etc. These were heartfelt, non-spammy reach-outs.

My first 30 were discouraging. Under half of the people I emailed got back to me. So I tried something new: a hard cap of 500 words, but under 100 was what I aimed for. 

Since the switch every person has replied.

(This blog post is 97 words, not including the title or this sentence.)

Fight

I just did it. I wrote my first “Hello, World!” program in my own little programming language. A tiny crest of a hill on my way up the mountain.

The first thing I wanted to do after I came back from a smoke was to throw on a movie. It’s 10 pm on boxing day, and it isn’t like I’m on the clock at a day job. But I think I’ve finally learned that stopping after a minor success is something to fight against. So I’m moving forward, for two reasons:

  1. Stopping means less stuff gets done. (duh)
  2. Because I’m in the zone, stubbing out what to do next will be much easier now than later, which means picking up where I left off later will be easier which means I’ll get more stuff done.

I’ve found that pushing past the initial highs into the next phase, whether it is coding or otherwise, means you get more done and life is better.

I haven’t fully decided on the name of my programming language, but in case you are wondering it’s going to be highly concurrent, data analysis geared language with influences from Erlang, Ruby, and Anic. Not only will every line of code try to execute at once, they will be fun to write, like Ruby!

Back to coding :D

Paul Graham is right (using AVC’s data)

I don’t like getting into the whole, someone is wrong on the internet thing, but I find the recent discussion that Paul Graham kicked off fascinating: 

[P]ractically all the returns are concentrated in a few big successes. The expected value of a startup is the percentage chance it’s Google. 

He then goes on to say

Some super-angels seem to care about valuations. Several turned down YC-funded startups after Demo Day because their valuations were too high. This was not a problem for the startups; by definition a high valuation means enough investors were willing to accept it. But it was mysterious to me that the super-angels would quibble about valuations. Did they not understand that the big returns come from a few big successes, and that it therefore mattered far more which startups you picked than how much you paid for them?

This has got to piss off some people that invest in startups for a living. Especially coming from a guy that typically gets 6-7% of a company for at or under $30k. I’ll be analysing Fred Wilson’s response below, but first you should read it in its entirety here. (protip: pressing the middle mouse button opens a new tab, I’m still shocked by how few people know this)

On the surface it looks pretty reasonable. He took a 2004 fund, so there should be enough time that has gone by. The first thing that struck me was that he only had two companies go bankrupt during this time. That is outstanding. Fred is clearly an expert investor, his insights are amazing and along with Gabriel Wineberg (who also had a quibble with the Paul Graham post) Fred’s blog is one of the very few I follow outside of what is submitted to Hacker News.

But here it looks like he is wrong. Not only that, he proved Paul Grahams point with his own data. 

The first thing to remember about investing is that you don’t care about 10x returns or 2x returns. I would take doubling my money in a day over 10 folding it over a lifetime, as would any other sensible investor. It is the compounding returns that matter.

So the first thing we need to do with Fred’s graph is convert it into a spreadsheet.

Woah, lots going on there, so let me break down what we have done. I manually counted up the number of companies from the original chart on Fred’s blog post. I’ve assigned a value of 0.5 to the bankrupt companies because I know you at least get tax breaks when you lose money and that in certain circumstances companies can get sold for their on book losses. It might be too high, could only be worth 0.1, but it just helps my argument for it to be 0.5, so I’m going to stick with that. (At least I’m honest.)

Next what we do is convert the 25x and similar returns into their yearly compounded return rates and bucket them into the nearest decile of percentage. Which leads us to…

Maybe not truly bimodal, but not bad for the sample size we are working with. Sure, if I put the value of the bankrupt ones down at 0.1 value they go to -30% per year, but really, Fred isn’t making money on the people that are only worth 1x six years later anyways. So really, his returns are bimodal(ish). 

Really it makes sense that returns are bimodal, especially in software. The cost of the next incremental sale is nearly zero once your product becomes commonplace, so it is natural for a whole host of startups to fail early on (high upfront costs, like developer salaries) and for a few to get into growth stage (highly optimized sales cycles, enough volume for split tests, a recognizable brand, marketplace trust, cheaper capital, CPAs below NPVs, leveraged coder hours, etc) and beyond. 

With a few exceptions, in software you either make it, or you don’t.

I will make one point though, based on some back of the napkin calculations of Google’s Series A investment size, market cap in the year Google went public, and what is openly available of what the founders of Google continued to own (20% each), I’ve estimated Google’s annualized returns from the time they took Series A funding to the time they went IPO to be somewhere between 125% to 200% (that’s annualized(!)). Which is clearly not what Fred is making on his stars. 

This discrepancy is just fine. Obviously Paul Graham doesn’t think that to be a successful Angel you need to get a company that pulls in triple digit yearly gains. His point is that you don’t let the really good ones get away because they are asking twice as much as you were expecting, the Series A venture fund that worked with Google didn’t. Returns are bimodal (or quasi-kinda bimodal). One interesting observation is that Google was notorious for the founders having a large equity stake so late in the funding game. Just one point of data, but maybe good founders know not to give investors an unreasonably large amount of equity.

An ugly, but insightful, oil spill infographic

People have trouble visualizing large numbers of things. After looking at a terrible infographic on cnbc, I decided to take the ugly-but-works approach. Most people have a feeling for how large the twin towers are, so I decided to map the amount of oil in numbers of world trade center towers. It came out less than I expected, just over 40% of a single World Trade Center’s volume.

This comes back to something I think about often, when it comes to infographics there is often a fight between useful and pretty. 

For extra understanding, click here to see a link showing the building footprint in red on a map that you can zoom in and out on. Really shows you the scale of the earth to how much oil was spilled.

Cool math expression with the numbers e, 5, and 0.5

I have discovered a pretty cool mathematical expression: 

-5*e^(5arccos((((5^(0.5))*0.5)+0.5)*0.5)*((-5/5)^0.5))

OR

Algebraic representation of the "five" formula

What is neat about this is that:

How does it work?

I relied on a few tricks while constructing it. 

  1. e^(pi*i) = -1
    an infamous formula
  2. 5arccos((((5^(0.5))*0.5)+0.5)*0.5) = pi
    Alex Williams discovered this little gem
  3. Balancing out where the square roots should go, so (-5/5)^0.5) versus (-5)^0.5) / (5^0.5).

I find sharing how math tricks are made does take away the mystery sometimes, but maybe in the future I’ll post one that other people solve.  

Intelligence Quotient Visualized

This graph shows how sensitive IQ is to small increases at the upper end of the range and why I dislike this measurement system. 

Edit: I purposefully didn’t count the dots for you. If you want to know how many there are, use wolframalpha to back calculate them. 

An increasingly daunting IQ graph that shows just how unlikely it is that someone has an IQ of 150

True, False, & NULL/None/nil/Blank logic in MySQL, Python, Ruby, and Excel

(NOTE: This blog post is extremely old and may give the reader the false impression that I have no idea what I’m doing. I’m keeping it up because I’m not a coward.)

EDIT: see below for an explanation on the “nil and False => nil while nil or False => False”

Being a Data Guy (in the corporate world, a “Business/Market Intelligence Analyst/Manager”) has its advantages. There is always a new question to answer, a new system to optimize, and a new split test to run. Management makes a ton of money through discoveries found in the data, and that means that bonuses are not far behind. But one dark side of being a Data Guy is the sheer number of tools you need to use (quickly!) and the inconsistencies across them. Even if some of them are as cool as Ruby.

Ruby Logo

Allow me to introduce True, False, and NULL problem. On the surface this doesn’t seem that hard. How many inconsistencies could there be across just 3 values (or lack thereof in the case of NULL) represented in 4 languages?

As it turns out: Many.

The old pros out there are already screaming, “hey! NULL in MySQL has nothing to do with nil in Ruby! This discussion is meaningless! Don’t you know anything? Go read some books by W. Richard Stev…”

Well, they have a point. NULL in MySQL is not the same as nil in Ruby, but since this message has been pounded into me the hard way I was hoping to save others the headaches that I’ve befallen upon. Grey beards, go back to kernel hacking for now, we’ll grab beer later.

The “Or” Logical Operator

A table of "or" logical operators

Ok, so far so good, we can see that regardless of the tool used “NULL or TRUE” returns “TRUE.” And it should, given that no matter what is on the other side of an “or” operator we already have a “TRUE” value. The “NULL or False” column shows an inconsistency between MySQL and the rest of the bunch. Why does MySQL return “NULL” when the rest of the group give “FALSE”? Pretty easy answer there is that in MySQL “NULL” basically means “unknown”.

MySQL dolphin saying "oh shit"

If that still doesn’t answer the question in your mind, imagine this: You are a home to home surveyor, asking people about their favorite politician. You get to a house to ask the owner whether they are going to vote for the Libertarian Candidate or the Constitution Party Candidate (there are many surveyors in dreamworld, you see) but you can’t tell if the person is male or female, say because they are behind a screen door. When you get to your trusty MySQL database you leave the value of NULL in the “is_a_man?” column  because you cannot tell one way or another. That is what the MySQL guys and girls had in mind when they decided “NULL or FALSE” returns “NULL”. They basically said the meaning of “NULL” is “unknown”. So when we say “NULL or FALSE” we really can’t tell one way or the other. This is especially important in numeric type fields where we can’t just print the string “couldn’t tell the height in inches due to darkness.”

What about the rest of the bunch, why do they default to “None, nil, etc…”? Because to them the lack of a value indicates exclusion from the logical operator. In other words, imagine you were a computer program interpreter. Whenever you saw the word “nil” (or otherwise) after an “or” operator you basically said “screw it, forget about that guy, he hasn’t called me in months”. That would describe Ruby, Python, and Excel. They only look to the remaining side of the “or” operator.

What about the last column of the table? Well, MySQL stays consistent with the whole NULL = Unknown value situation, basically saying: “here I’m giving you a NULL, but that actually means ‘unknown’ because either of these two values could be a true”. Totally cool and consistent.

Here we also see Ruby and Python try to tell both sides of the “or” operator to go #*&^ itself, but are left with nothing so they say “got nothing” in their own way, nil and None, respectively.

Excel is different. And as we will come to learn, when in doubt Excel will be different. It basically says “Hey! How’s it going? So I’m trying to hide this concept of ‘NULL’ and ‘nil’ from you because you guys are mostly MBAs that have a hard enough grasp dealing with a blank cell, but you are making it really hard for me. So I’m just going to throw up the error ‘#VALUE!’ and you can call over the programmer paid one fourth as much to figure it out for you.”

Excel logo

This is a recurring theme with Excel. Excel doesn’t really have a NULL or even nil field. Excel has a “Blank” field, which it will happily take as input, but will almost never serve as output, so it gets creative. (As a side note, entering a single “’” into an excel cell gets you an “empty non-blank field” useful for when the default behavior isn’t what you want. This comes in handy everywhere, from conditional formating to logical comparisons to type safety.)

The “And” Logical Operator

The "And" logical operator chart

Here is where things start to get batty. Let’s start with the “NULL and TRUE” column. MySQL handles this situation as we would expect, given that in MySQLese “NULL” just means “unknown”. Ruby and Python can’t tell the nil/None boys to stuff it this time, because there is a pesky “and” operator, which foiled some plans for world domination, but that’s a story for another day. They basically say “hey, you may actually want to take a look at this sarge”. Personally, if I were a programming language interpreter I would either say “nil or false => false” OR “nil and true => nil” not both. I’d make it easier on the girls dancing with me. Then again, 80% of my keyboard time is spent with MySQL, so maybe I’m just used to blonde haired girls.

Things were just starting to get boring until Microsoft stepped in. Ladies and gentlemen Microsoft assumes that, if you aren’t FALSE you may as well be TRUE in an “and” operator. Blank, the number 3, “luapnor” - it doesn’t matter. Like republicans in negative land, everyone is their friend until you specifically tell them you hate their guts. Just for fun, try this in cell “B1” in a brand new Excel spreadsheet: ‘=if(A1,”lol_true”,”fffuuu”)’  should come out to “fffuu”. Great. Now replace it with this: ‘=IF(AND(A1,TRUE),”lol_true”,”fuuu”)’ now what does it come to? “lol_true”.

So why is this? It comes back to MS Excel trying to hide complexity to MBAs. Given an “and” operator Excel will ignore any non boolean values to make the lives easier for people just messing around with spreadsheets (these often can have huge holes in them that MBAs want to ignore). But when there is no “and”/”or” present, the if statement NEEDS to try to use the blank cell, which it does in the ’=if(A1,”lol_true”,”fffuuu”)’ Excel cell.

To the next column!

MySQL still acting as it should. Treats “NULL” as “unknown” and predictably says “no matter what ‘NULL’ was supposed to be, I have one FALSE, so I can safely say “FALSE”.

Excel will predictably say “F that NULL, I’m just going to ignore it. This whole thing is False.”

Which brings us to Python and Ruby. I don’t have words for why “nil or false” gives me false in Ruby (and Python), but “nil and false” gives me nil. Even if Ruby told nil to get out town it would still be left with FALSE. I’m going to try to keep my faith in Ruby by saying this fifteen times slowly: “Ruby was made by intelligent people, there’s a logical explanation… Ruby was made by intelligent people…”

The last column makes total sense. NULL and NULL should be NULL. you have nothing to work off of! No idea about anything. True, False? “Is the answer to this question the same as if I asked you if true and false were the same thing?” Of course Excel wants to say NULL, but it can’t because people like Mitt Romney might throw a fit if they see an term so unfamiliar as NULL. So it just throws an error, which makes sense when you look at all the other behavior Excel has exhibited over the past couple paragraphs.

The “!=” Logical Operator

Using the not equal to operator

You know the drill at this point. “TRUE != NULL”: makes sense. NULL for MySQL ‘cause NULL is basically unknown, the rest say “hey, you know what True really isn’t equal to NULL, that MySQL guy is on smack.”

Next Column: False != NULL, MySQL says “NULL could be anything. It could even be a boat! So we can’t tell if FALSE != NULL.” Ruby and Python ask us what MySQL is smoking and say that, without a doubt, that FALSE isn’t nil/None.

Then comes Excel. Apparently, to excel, an empty cell is not not equal to false. Which means it is equal to false. Except when you put it in an “or” operator or an “and” operator,then it isn’t false, otherwise those things wouldn’t have thrown an error, they would have returned false, or when used with an and operator they would have acted like false, not true.


Er…

At this point I don’t even care anymore. Onto the last column: NULL != NULL. Makes sense all round. Even in Excel.

The “Greater than” Logical Operator

One last image showing truefalsenull greaterthanhree

I just included this one to make sure everyone knew that Python and Ruby couldn’t be counted on to return the same thing. Of course I’ve only chosen to look at some aspects of the NULL/TRUE/FALSE problem. I’m not even going to get started on “not nil or nil” problem. Or PHP. Or Ruby’s difference between Case equivalence and normal equivalence.

I don’t really have an overarching conclusion in this blog post. No “wtf Micro$oft engineers are dolts!” message. I look at each tool fulfilling each role extremely well and for their intended audience. There is a difference between NULL and nil, even if it makes it harder on multi-tool Data Guys, like myself. At least business-y people will feel more comfortable rocking a spreadsheet when their whole world is true-false-error, and I mean this sincerely. I must say, though, that I certainly do prefer MySQL’s consistent handling of NULL. Makes me wish that Ruby had its own type of unknown class. Also, don’t think that I don’t have love for MS Excel. Excel is amaaaaazing. By far MS best product. Understanding its quirks is just part of life.

If anyone wants to add to this list, I’ve made a open spreadsheet here. It is a Google spreadsheet, which I guess is kind of funny after all the MS Excel explaining I’ve had to do. Any questions for me? Send ‘em over to p.engineer@gmail.com.

Update from Pavpanchekha, whom emailed me following my post (many thanks, I KNEW there was a reason):

Wanted to explain the strange inconsistencies you saw in your recent essay on Nil in Python/Ruby/Excel/MySQL.


You questioned the sanity of people who decided that nil and False was nil while nil or False was False
It all comes down to short-circuiting operators, that is, a specific optimization/feature that ever programming language known to man has. (I stress programming, as MySQL and Excel really aren’t programming languages).
The basic idea is that an and statement, and an or statement, will only examine as many elements as they need to decide their value.
This is good; if you have, say:
cheap_function_that_is_usually_false() and big_computation()
you will almost always not compute big_computation(). It’s also a feature, not just an optimization. In python:
if len(s) > 1 and s[1] == “bob”:    do_stuff()
if and weren’t short-circuiting, this would cause errors, since it’d try to access s[1] even when len(s) == 1 (and thus there is no s[1]).
Now, how does the short-circuiting work? Well, it’s simple, really. For and: evaluate the first argument; if that’s false, return it, otherwise, evaluate and return the other argument.
So:
nil and False
Well, we evaluate the first. We get nil. Is nil truthy? No, because we want “if nil” to not do anything. So we return nil.
False and nil
Well, we evaluate the first. We get False. Is False truthy? No, so we return False.
And yes, this does mean that and and or are no longer commutative. But the benefits are great enough to justify it.
For or, the algorithm is similar. Evaluate the first. If that’s true (truthy), we’re done, so return it. Otherwise, return the other.
So:
nil or True
We evaluate the first. We get nil. Is nil truthy? No, so we return True
True or nil
We evaluate the first. We get True. Is True truthy? Yes, so we return it.
In this case, both orders were the same, but they’d be different if something truthy was used in place of nil: try True and 1 vs 1 and True
Hope this explains something!

I want privacy because I break the law

To the police and future employers:

  1. I don’t really do illegal things. I’m actually a pretty top notch guy.
  2. Some of the stories in the article may be embellished/fabricated. I do have to say that, right?

Bruce Schneier is one of the greatest minds of our time. Schneier on Security, is a collection of some of his best essays from 2002 to 2008 and has really shaped my thinking towards privacy and security. (Also available, at a higher price but personally signed, from his own website).

Back in December he posted this in response to Eric Schmidt’s (CEO of Google) claim that:

If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place. If you really need that kind of privacy, the reality is that search engines — including Google — do retain this information for some time…

I’ve thought about Schneier’s response (that people want privacy for a whole host of reasons, like when we make love, sing in the shower, and do things that are totally legal at the time of law) for some time now and I have come to this conclusion:

Yes. I do want privacy for those reasons. I do not want people knowing when I search for “smelly foot rash" or, even worse, "why do women cheat on good men”. These are embarrassing or very emotionally painful subjects that I don’t want anyone to know about. Say there is only a 0.1% chance that in the next year Google’s servers have a search history leak (between all their sharing of data back and forth with the US government). If it does happen, my searches will forever be available for people to find. I’m always logged into my Gmail account, so my coworkers wouldn’t even need to know my IP. All they would have to do is search “[my email] google search history leak” or possibly just my full name.

But that isn’t everything. I want privacy because I break the law and I don’t want to be fined or thrown in prison. No, I’ve never done or dealt illegal drugs. No, I don’t jack cars or commuter bikes. But I do break the law. Probably every day. Some things are minor: 12 km/h over the limit, parking for 2 seconds to drop something off when the sign clearly says “parking after 8 pm only.” Some things are major: keyloggers and password dictionary attacks while the Grade 11 English teacher was out of the room.

(Sidenote: My friends and I were stupid in high school. We never got caught with our hackety, crackity shenanigans and we I never changed my grades. But it was still stupid. I also understand the hilarity of blogging about privacy after installing keyloggers on highschool computers and dict forcing teachers email passwords. At least I have the I-was-an-idiot-teenager excuse, unlike some major corporations.)

What about not even knowing about breaking the law? Let me ask you this: Have you ever committed a felony? Before you answer, have you read through and understood the millions of laws you must abide by? If not, your truest answer to the original question would be “I hope I haven’t committed a felony, and if I have, I hope nobody finds out because I don’t want to go to prison. I’m basically a good person and I don’t deserve to be financially ruined and separated from my family.”

Check out this lovely law that Prof. Duane shared with the class. (square brackets and bold emphasis mine, italics emphasis source)

It is unlawful for any person to import, export, transport, sell, receive, acquire, possess, or purchase any fish, wildlife, or plant taken, possessed, transported, or sold in violation of any Federal, State, foreign [!?], or Indian tribal law, treaty, or regulation.

Criminal penalties fall into two categories. For a felony offense, a maximum $250,000 fine per individual and $500,000 per organization, and/or up to 5 years imprisonment for each violation of the Act can be assessed. A misdemeanor offense carries a maximum $100,000 fine per individual and $200,000 per organization, and/or up to 1 year imprisonment.

Now I’m not an attorney, so I’m hoping I’m reading this wrong, but to me (and my completely limited knowledge of the law) this is a technically possible scenario:

You buy a lemon for Ceasars at home with some friends. Unfortunately, last week Russia declared it illegal to possess lemons due to new Russian research that the rest of the world thinks is crazy. A Google search you made tipped off your local American authorities that you are breaking 16 USC 3370. Do Not Pass Go, Do Not Collect $200. Instead go to prison for 1 to 5 years after laying out up to $250k on a fine, unless you get an understanding judge.

Remember: Absent knowledge of the law is NOT exemption from the law. You are required to know and follow all the laws in your country, provice/state, county/region, municipality/city. The government never really tells you that this is physically impossible. I couldn’t possibly read laws as fast as legislature or consul write them, let alone catch up on centuries of already written laws and judicial interpretation.

Getting back to knowingly breaking the law. My mom had surgery a couple years back and ran out of Tylenol 3s (T3s are basically a small dosage of codeine with caffeine). Because I’ve had excruciatingly painful bi-yearly migraines since I hit adolescence I have an unlimited, legal supply of T3s. Personal use only, of course. But even though it was illegal, did I give my mom two or three T3s to keep her pain down until she could get her bottle refilled the next morning? You bet. I didn’t even blink. Was I trafficking narcotics (or whatever giving prescription drugs to other people is called)? You would have to ask a Canadian judge and jury that.

But luckily for me big brother doesn’t have a log of me giving my mom a couple pain killers.

This is why I want privacy. I break the law. Sometimes for good reasons, sometimes for stupid reasons. Now, I rarely knowingly break big laws, but I’m sure it has happened a couple of times. Have I ruined anyone’s life? No. Have I destroyed anyone’s wealth? No. Do I breach others privacy? Not since I was an idiot kid.

Then stop snooping. Leave me the hell alone. Maybe if I’m doing something online that I don’t want anyone to find out I should do it anyway, safe in the knowledge that I live in a free country and that my right to privacy is assured - unless I do something that gives the police enough evidence for a judge signed warrant.