All you need is cache

Cache is all you need
Cache is all you need

What is cache

More than a formal definition, I think that the best way of thinking about cache is an result from an operation (data) that gets saved (cached) for future use.

The cache value should be identifiable with a key that is reasonably small. This normally is the call name and the parameters, in some sort of hashed way.

A proper cache has the following three properties:

  1. The result is always replicable. The value can be scrapped without remorse.
  2. Obtaining the result from cache is faster than generate it.
  3. The same result will be used more than once.

The first property implies that the cache is never the True Source of Data. A cache that’s the True Source of Data is not a cache, it’s a database; and need be treated as such.

The second one implies that retrieving from cache is useful. If getting the result from the cache is slower (or only marginally better) than from the True Source of Data, the cache can (and should) be removed. A good candidate for cache should be a slow I/O operation or computationally expensive call. When in doubt, measure and compare.

The third property simply warns against storing values that will be used only once, so the cached value won’t be ever used again. For example, big parts of online games are uncacheable because they change so often they are read less times than written.

The simplest cache

The humblest cache is a well known technique called memoization, which is simply to store in process memory the results of a call, to serve it from there on the next calls with the same parameters. For example,

NUMBER = 100
def leonardo(number):

    if number in (0, 1):
        return 1

    return leonardo(number - 1) + leonardo(number - 2) + 1

for i in range(NUMBER):
    print('leonardo[{}] = {}'.format(i, leonardo(i)))  

This terribly performant code will return the first 100 Leonardo numbers. But each number will be calculated recursively, so storing the result we can greatly speed up the results. The key to store the results is simply the number.

cache = {}

def leonardo(number):

    if number in (0, 1):
        return 1

    if number not in cache:
        result = leonardo(number - 1) + leonardo(number - 2) + 1
        cache[number] = result

    return cache[number]

for i in range(NUMBER):
    print('leonardo[{}] = {}'.format(i, leonardo(i)))

Normally, though, we’d like to limit the total size of the cache, to avoid our program to run wild in memory. This restrict the size of the cache to only 10 elements, so we’ll need to delete values from the cache to allow new values to be cached:

def leonardo(number):

    if number in (0, 1):
        return 1

    if number not in cache:
        result = leonardo(number - 1) + leonardo(number - 2) + 1
        cache[number] = result

    ret_value = cache[number]

    while len(cache) > 10:
        # Maximum size allowed, 10 elements
        # this is extremely naive, but it's just an example
        key = cache.keys()[0]
        del cache[key]

    return ret_value

Of course, in this example every cached value never changes, which may not be the case. There’s further discussion about this issue below.

Cache keys

Cache keys deserve a small note. They are not usually complicated, but the key point is that they need to be unique. A non unique key, which may be produced by unproper hashing, will produce cache collisions, returning the wrong data. Be sure that this doesn’t happen.

Python support

Just for the sake of being useful, on Python3 there is support for a decorator to cache calls, so the previous code can look like this.

from functools import lru_cache

lru_cache(maxsize=10)
def leonardo(number):

    if number in (0, 1):
        return 1

    if number not in cache:
        result = leonardo(number - 1) + leonardo(number - 2) + 1
        cache[number] = result

    return cache[number]

so you can use it instead of implement your own.

The stereotypical web app cache

In the context of web apps, everyone normally thinks of memcached when thinks of cache.

Memcached will, in this stereotypical usage, use some allocated memory to cache database results or full HTML page, identified by an appropriate unique key, speeding up the whole operation. There are a lot of integrated tools with web frameworks and it can be clustered, increasing the total amount of memory and reliability of the system.typical_usage

In a production environment, with more than one server, the cache can  be shared among different servers, making the generation of content only happen once on the whole cluster, and then be able to be read by every consumer. Just be sure to ensure the first property, making possible to obtain the value from the True Source of Data at any point, from any server.

This is a fantastic setting, and worth using in services. Memcached can be also replaced by other tools like Redis, but the general operation is similar.

But there are more ways to cache!

Assuming a typical distributed deployment on a production web server, there are a lot of places where a cache can be introduced to speed up things.

This described service will have one DB (or a cluster) that contains the True Source of Data, several servers with a web server channeling requests to several backend workers, and a load balancer on top of that, as the entry point of the service.

full_distributed_app

Typically, the farther away that we introduce a cache from the True Source of Data, the less work we produce to the system and the most efficient the cache is.

Let’s describe possible caches from closest to the True Source of Data to farther away.

Cache inside the DataBase

(other than the internal cache of the database itself)

Some values can be stored directly on the database, derivating them from the True Source of Data, in a more manageable way.

A good example for that are periodic reports. If some data is produced during the day, and a report is generated every hour, that report can be stored on the database as well. Next accesses will be accessing the already-compiled report, which should be less expensive than crunching the numbers again.

InsideDB

Another useful way of caching values is to use replication. This can be supported by databases, making possible to read from different nodes at the same time, increasing throughput.

For example, using Master-Slave replication on MySQL, the True Source of Data is on the Master, but that information gets replicated to the slaves, that can be used to increase the read throughput.
full_distributed_app_replication

Here the third property of cache shows up, as this is only useful if we read the data more often than we write it. Write throughput is not increased.

Cache in the Application Level

The juiciest part of a service is normally in this level, and here is where the most alternatives are available.

From the raw results of the database queries, to the completed HTML (or JSON, or any other format) resulting from the request, or any other meaningful intermediate result, here is where the application of caches can be most creative.

Memory caches can be set either internally per worker, per server, or  externally for intermediate values.

  • Cache per worker. This is the fastest option, as the overhead will be minimal, being internal memory of the process serving the requests. But it will be multiplied by the number of workers per box, and will need to be generated individually. No extra maintenance needs to be done, though.
  • External cache. An external service, like memcached. This will share all the cache among the whole service, but the delays in accessing the cache would be limited by the network. There is extra maintenance costs in setting the external service.
  • Cache per server. Intermediate. Normally, setting on each server a cache service like memcached. Local faster access shared among all workers on the same box, with the small overhead of using a protocol.

Another possibility worth noting in some cases is to cache in the hard drive, instead of RAM memory. Reading from local hard drive can be faster than accessing external services, in particular if the external service is very slow (like a connection to a external network) or if the data needs to be highly processed before being used. Hard drive caches can also be helpful for high volumes of data that won’t fit in memory, or reducing startup time, if starting a worker requires complex operations that produces a cacheable outcome.

Cache in the Web Server

Widely available web servers like  Apache or Nginx have integrated caches. This is typically less flexible than application layer caching, and needs to fit into common patterns, but it’s simple to setup and operate.

There’s also the possibility to return an empty response with status code 304 Not Modified, indicating that the data hasn’t changed since the last time the client requested the data. This can also be triggered from the application layer.

Static data should be, as much as possible, stored as a file and returned directly from the web server, as they are optimised for that use case. This allows the strategy of storing responses as static files and serve them through the web server. This, in an offline fashion, is the strategy behind static website generators like Nikola or Jekyll.

For sites that deal with huge number of requests that should return the same data, like online newspapers or Wikipedia, a cache server like Varnish can be set to cache them, that may be able to act as a load balancer as well. This level of cache may be done with the data already compressed in Gzip, for maximum performance.

Cache in the Client

Of course, the fastest requests is the one that doesn’t happen, so any information that can be stored in the client and avoid making a call at all will greatly speed an application. To achieve real responsiveness this needs to be greatly taken into account. This is a different issue than caching, but I translated an article a while ago about tips and tricks for improving user experience on web applications here.

The dreaded cache invalidation

The elephant in the room when talking about cache is “cache invalidation”. This can be an extremely difficult problem to solve in distributed environments, depending on the nature of the data.

The basic problem is very easy to describe: “What happens when the cache contains different data than the True Source of Data?

Some times this won’t be a problem. In the first example, the cached Leonardo numbers just can’t be different from the True Source of Data. If the value is cached, it will be the correct value. The same would happen with prime numbers, a calendar for 2016, or last month’s report. If the cached data is static, happy days.

But most of the data that we’d like to cache is not really static. Good data candidates for being cached are values that rarely change. For example, your Facebook friends, or your schedule for today. This is something that will be relatively static, but it can change (a friend can be added, a meeting cancelled). What would happen then?

The most basic approach is to refresh periodically the cache, like deleting the cached value after a predetermined time. This is very straightforward and normally supported natively by the cache tools, like allowing to store a value with a validation date. For example, assuming the user has a cached copy of the avatars of friends locally available, only ask again every 15 minutes. Sure, there will up to 15 minutes where a new avatar from a friend won’t be available, and the old one will be displayed, but that’s probably not a big deal.

On the other hand, the position on a leaderboard for a competitive video game, or the result on a live match in the World Cup is probably much more sensible for such a delay.

Even worse, we’ve seen that some options involve having more than one cache (cache per server, or per worker; or redundant copies for reliability purposes). If two caches contains different data, the user may be alternating between old and new data, which will be confusing at best and produce inconsistent results at worst.

This is a very real problem on applications working with eventually consistent databases (like the mentioned Master-Slave configuration). If a single operation involves writing a value, and then read the same value, the returned value could have a different value (the old one), potentially creating inconsistent results or corrupting the data. Two very close operations modifying the same data by two users could also produce this effect.

Periodically refreshing the cache can also produce bad effects in production environment, like synchronising all the refresh happening at the same time. This is typical in systems that refresh data for the day at exactly 00:00. At exactly that time all workers will try to refresh all the data at the same time, orchestrating a perfectly coordinated distributed attack against the True Source of Data. It is better to avoid perfectly round numbers and use some randomness instead, or set numbers relative to the last time the data was requested from the True Source of Data, avoiding synchronised access.

This avalanche effect can also happen when the cache cluster changes (like adding or removing nodes, for example, when one node fails). These operations can invalidate or make unavailable high numbers of cached content, producing an avalanche of requests to the True Source of Data. There are techniques to mitigate this, like Consistent Hash Rings, but they can be a nightmare if faced in production.

Manually invalidating the cache when the data changes in the True Source of Data is a valid strategy, but it needs to invalidate the results from all the caches, which is normally only feasible for external cache services. You simply can’t access the internal memory of a worker on a different server. Also,  depending on the rate of invalidation per read cached value, can be counter productive, as it will produce an overhead of calls to the cache services. It also normally requires more development work, as this needs a better knowledge of the data flow and when the value in the cache is no longer valid. Sometimes that’s very subtle and not evident at all.

Conclusion

Caching is an incredibly powerful tool to improve performance in software systems.  But it can also be a huge pain due all those subtle issues.

So, some tips to deal with cache

  • Understand the data an how it’s consumed by the user. A value that changes more often than gets read it’s not a good cache candidate.
  • Ensure the system has a proper cache cycle. At the very least, understand how cache flows and what are the implications of cache failure.
  • There are a lot of ways and levels to cache. Use the most adequate to make caching efficient.
  • Cache invalidation can be very difficult. Sorry about that.

Gorgon: A simple task multiplier analysis tool (e.g. loadtesting)

Load testing is something very important in my job. I spend a decent amount of time checking how performant are some systems.

There are some good tools out there (I’ve used Tsung extensively, and ab is brilliant for small checks), but I found that it’s difficult to create flows, where you produce several requests in succession and the input depends on the returned values of previous calls.

Also, normally load test tools are focused in HTTP requests, which is fine most of the time, but sometimes is limiting.

So, I got the idea of creating a small framework to take a Python function, replicate it N times and measure the outcome, without the hassle of dealing manually with processes, threads, or spreading it out on different machines.

The source code can be found in GitHub and it can be installed through PyPi. It is Python3.4 and Python2.7 compatible.

pip install gorgon
Gorgons were mythological monsters whose hair were snakes.
Gorgons were mythological monsters whose hair were snakes.

Gorgon

To use Gorgon, just define the function to be repeated. It should be a  function with a single parameter that will receive a unique number. For example

    
    def operation_http(number):
        # Imports inside your function 
        # is required for cluster mode
        import requests  
        result = request(get_transaction_id_url)
        unique_id = get_id_from(result)
        result = request(make_transaction(unique_id))
        if process_result(result) == OK:
            return 'SUCCESS'
        return 'FAIL'

There’s no need to limit the operation to HTTP requests or other I/O operations

    def operation_hash(number):
        import hashlib
        # This is just an example of a 
        # computationally expensive task
        m = hashlib.sha512()
        for _ in range(4000):
            m.update('TEXT {}'.format(number).encode())
        digest = m.hexdigest()
        result = 'SUCCESS'
        if number % 5 == 0:
            result = 'FAIL'
        return result

Then, create a Gorgon with that operation and generate one or more runs. Each run will run the function num_operations times.

        from 
        NUM_OPS = 4000
        test = Gorgon(operation_http)
        test.go(num_operations=NUM_OPS, num_processes=1, 
                num_threads=1)
        test.go(num_operations=NUM_OPS, num_processes=2, 
                num_threads=1)
        test.go(num_operations=NUM_OPS, num_processes=2, 
                num_threads=4)
        test.go(num_operations=NUM_OPS, num_processes=4, 
                num_threads=10)

You can get the results of the whole suite with small_report (simple aggregated results) or with html_report (graphs).

    Printing small_report result
    Total time:  31s  226ms
    Result      16000      512 ops/sec. Avg time:  725ms Max:  3s  621ms Min:   2ms
       200      16000      512 ops/sec. Avg time:  725ms Max:  3s  621ms Min:   2ms

Example of graphs. Just dump the result of html_report as HTML to a file and take a look with a browser (it uses Google Chart API)

Gorgon HTML report example
Gorgon HTML report example

Cluster

By default, Gorgon uses the local computer to create all the tasks. To distribute the load even more, and use several nodes, add machines to the cluster.

        NUM_OPS = 4000
        test = Gorgon(operation_http)
        test.add_to_cluster('node1', 'ssh_user', SSH_KEY)
        test.add_to_cluster('node2', 'ssh_user', SSH_KEY, 
                             python_interpreter='python3.3')
        ...
        # Run the test now as usual, over the cluster
        test.go(num_operations=NUM_OPS, num_processes=1, 
                num_threads=1)
        test.go(num_operations=NUM_OPS, num_processes=2, 
                num_threads=1)
        test.go(num_operations=NUM_OPS, num_processes=2, 
                num_threads=4)
        print(test.small_report())

Each of the nodes of the cluster should have installed Gorgon over the default python interpreter, unless the parameter python_interpreter is set. Using the same Python interpreter in all the nodes and controller is recommended.
paramiko module is a dependency in cluster mode for the controller, but not for the nodes.

As a limitation, all the code to be tested needs to be contained on the operation function, including any imports for external modules. Remember to install all the dependencies for the code on the nodes.

Available in GitHub

The source code and more info can be found in GitHub and it can be installed through PyPi So, if any of this sounds interesting, go there and feel free to use it! Or change it! Or make suggestions!

Happy loadtesting!

Leonardo numbers

I have my own set of numbers!
I have my own set of numbers!

Because Fibonacci numbers are quite abused in programming, a similar concept.


L0 = L1 = 1

Ln = Ln-2 + Ln-1 + 1

My first impulse is to describe them in recursive way:

def leonardo(n):
    if n in (0, 1):
        return 1
    return leonardo(n - 2) + leonardo(n - 1) + 1 

for i in range(NUMBER):
    print('leonardo[{}] = {}'.format(i, leonardo(i)))

But this is not very efficient to calculate them, as for each is calculating all the previous ones, recursively.

Here memoization works beautifully


cache = {}

def leonardo(n):
    if n in (0, 1):
        return 1

    if n not in cache:
        result = leonardo(n - 1) + leonardo(n - 2) + 1
        cache[n] = result

    return cache[n]

for i in range(NUMBER):
    print('leonardo[{}] = {}'.format(i, leonardo(i)))

Taking into account that it uses more memory, and that calculating the Nth element without calculating the previous ones is also costly.

I saw this on Programming Praxis, and I like a lot the solution proposed by Graham on the comments, using an iterator.

def leonardo_numbers():
    a, b = 1, 1
    while True:
        yield a
        a, b = b, a + b + 1

The code is really clean.

ffind v0.8 released

Good news everyone!

The new version of find (0.8) is available in GitHub and PyPi. This version includes performance improvements, man page and fuzzy search support.

Enjoy!

Optimise Python with closures

This blog post by Dan Crosta is interesting. It talks about how is possible to optimise Python code for operations that get called multiple times avoiding the usage of Object Orientation and using Closures instead.

While the “closures” gets the highlight, the main idea is a little more general. Avoid repeating code that is not necessary for the operation.

The difference between the first proposed code, in OOP way

class PageCategoryFilter(object):
    def __init__(self, config):
        self.mode = config["mode"]
        self.categories = config["categories"]

    def filter(self, bid_request):
        if self.mode == "whitelist":
            return bool(
                bid_request["categories"] & self.categories
            )
        else:
            return bool(
                self.categories and not
                bid_request["categories"] & self.categories
            )

and the last one

def make_page_category_filter(config):
    categories = config["categories"]
    mode = config["mode"]
    def page_category_filter(bid_request):
        if mode == "whitelist":
            return bool(bid_request["categories"] & categories)
        else:
            return bool(
                categories and not
                bid_request["categories"] & categories
            )
    return page_category_filter

The main differences are that both the config dictionary and the methods (which are also implemented as a dictionary) are not accessed. We create a direct reference to the value (categories and mode) instead of making the Python interpreter search on the self methods over and over.

This generates a significant increase in performance, as described on the post (around 20%).

But why stop there? There is another clear win in terms of access, assuming that the filter doesn’t change. This is the “mode”, which we are comparing for whitelist of blacklist on each iteration. We can create a different closure depending on the mode value.

def make_page_category_filter2(config):
    categories = config["categories"]
    if config['mode'] == "whitelist":
        def whitelist_filter(bid_request):
            return bool(bid_request["categories"] & categories)
        return whitelist_filter
    else:
        def blacklist_filter(bid_request):
            return bool(
                categories and not
                bid_request["categories"] & categories
            )
        return blacklist_filter

There are another couple of details. The first one is to transform the config categories into a frozenset. Assuming that the config doesn’t change, a frozenset is more efficient than a regular mutable set. This is insinuated in the post, but maybe didn’t get the final review (or to simplify it).

Also, we are calculating the intersection of a set (operand &) to then reduce it to a bool. There is currently a set operation that gets the result without calculating the whole intersection (isdisjoint).

The same basic principle applies to calculate the bool category for the black filter. We can calculate it only once, as it’s there to short-circuit the result in case of an empty config category.

def make_page_category_filter2(config):
    categories = frozenset(config["categories"])
    bool_cat = bool(categories)
    if config['mode'] == "whitelist":
        def whitelist_filter(bid_request):
            return not categories.isdisjoint(bid_request["categories"])
        return whitelist_filter
    else:
        def blacklist_filter(bid_request):
            return (bool_cat and categories.isdisjoint(bid_request["categories"]))
        return blacklist_filter

Even if all of this enters the definition of micro-optimisations (which should be used with care, and only after a hot spot has been found), it actually makes a significant difference, reducing the time around 35% from the closure implementation and ~50% from the initial reference implementation.

All these elements are totally applicable to the OOP implementation, by the way. Python is quite flexible about assigning methods. No closures!

class PageCategoryFilter2(object):
    ''' Keep the interface of the object '''
    def __init__(self, config):
        self.mode = config["mode"]
        self.categories = frozenset(config["categories"])
        self.bool_cat = bool(self.categories)
        if self.mode == "whitelist":
            self.filter = self.filter_whitelist
        else:
            self.filter = self.filter_blacklist

    def filter_whitelist(self, bid_request):
        return not bid_request["categories"].isdisjoint(self.categories)

    def filter_blacklist(self, bid_request):
        return (self.bool_cat and
                bid_request["categories"].isdisjoint(self.categories))

Show me the time!

Here is the updated code, adding this implementations to the test.

The results in my desktop (2011 iMac 2.7GHz i5) are

        total time (sec)  time per iteration
class   9.59787607193     6.39858404795e-07
func    8.38110518456     5.58740345637e-07
closure 7.96493911743     5.30995941162e-07
class2  6.00997519493     4.00665012995e-07
closur2 5.09431600571     3.39621067047e-07

The new class performs better than the initial closure! The optimised closure is anyway trumping, saving a big chunk compared with the slower implementation. The PyPy results are all very close, and it speeds up 10x the code, which is an amazing feat.

Of course, a word of caution. The configuration is assumed to not change for a filter, which I think is reasonable.

Happy optimising!

Some characteristics of the best developers I worked with

I had a conversation last November on the PyConEs, when I was on a conversation stating that I am working with truly brilliant people in DemonWare, and then someone asked me: “Do you have problems agreeing in  what to do? Normally great developers have problems reaching consensus on tech discussions”. My answer something like: “Well, in my experience, truly awesome developers know when to have a strong argument and they usually are ok reaching an agreement in a reasonable time”.

So, I wanted to, as sort of follow-up, summarise what are the characteristics that I’ve seen in the best developers I’ve been lucky to work with. This is not a list I am making on “what’s my ideal developer”, but more a reflexion on the common traits I’ve seen on my experience…

    • Awesome developers are obviously smart, but that’s not typically shown as bursts of brilliance, solving really difficult issues with “aha!” moments. In my experience, genius ideas are rarely required nor expressed (though they surely happen once in a blue moon). Instead, great developers are consistently smart. They present solutions to problems that are reasonable all the time. They find and fix typical bugs with ease. They struggle with very difficult problems, but are able to deal with them. They are able to quickly present something that will make you say “Actually that’s a nice point. Why didn’t I think about this?”. They do not typically present something ingenious and never heard of, but deliver perfectly fine working ideas over and over, one day after another.  Their code is not full of mind blowing concepts, but it is logical, clean and easy to follow most the time (and when’s not, there is a good reason). They are able to remove complexity and simplify stuff, to a degree that it almost look easy (but it’s not)
Normally brilliant people on real life do not come with crazy great ideas out of nowhere
Brilliant people on real life do not come with insanely great ideas out of nowhere
  • They keep a lot of relevant information on their minds. They are able to relate something that is in discussion with something that happened three months ago. They seem to have the extraordinary ability of getting out of the hat some weird knowledge that is applicable to the current problem.
  • While they have a passion for coding, it is not the only thing in their lives. They have hobbies and interests, and they don’t usually go home in the weekends to keep working on open source all day, though they may occasionally do.
  • They love to do things “the right way”, but even more than that, they love to make things work. This means that they will use tools they consider inferior to achieve something if it’s the best/most convenient way. They’ll complain and will try to change it, but deliver will be more important that being right. They have strong opinions about what language/framework/way of doing stuff is best, being that Python, Ruby, Haskell, PostgreSQL, Riak or COBOL, but that won’t stop them knowing when it’s important to just stop arguing and do it.
  • They are humble. They are confident most of the times, but far from arrogant. My impression is that they don’t think that they are as awesome as they truly are. They will want to learn from everyone else, and ask when they have questions. They will also catch new ideas very fast. They are also friendly and nice.
  • Communication is among their best skills. They are very good communications, especially, but not limited, about tech issues. They may be a little social awkward sometimes (though this is not as common as stereotypes portrait), but when they have the motivation to express some idea, they’ll do it very clearly.
  • In some of the truly remarkable cases, they’ll be able to fulfil different roles, when needed. I mean different roles in the most broad sense, basically being able to be what’s needed for that particular moment. Sometimes they’ll have to be leaders, sometimes they’ll be ok being led. They’ll know when a joke is the proper thing to do and when to remain formal. They’ll be the person that helps you with a difficult technical question, or the one that will tell you “you’re tired, just go home and tomorrow it will be another day”
  • And they’ll have a great sense of humour. I know that almost everyone thinks that they have a good sense of humour. That’s not totally true.

Again, this is sort of a personal collection of traits based in my experience and on what I consider the best developers I’ve been honoured to work with. Any ideas?

My concerns with Bitcoin as a currency

Today I retweeted this brilliant tweet:

So, to start the year, I’ve decided to share some of my thought on the bit coin issue, and some of the problems I see. As I am not an economist, I’m not going to go into the deflation / long term scenario. For what I know, that’s very bad, but as that can lead to a deep economic conversation, one I don’t really want to get into, as I lack of the required knowledge, I’m going to concede that. Let’s imagine that bitcoin, from the macroeconomic point of view is absolutely sound. Even in that case, my impression is that it is not very safe from the user point of view. These are “social problems“, more than “tech problems“.

(I am also going to assume that it is cryptographically sound, as I don’t have any reason to think is not)

One of the main problems the system have is that you are entirely on your own to safe your bitcoins / wallets. I guess some people don’t perceive this as a “real problem“, but as someone that can be considered tech-savy, the perspective of a virus, a hardware problem or a missing password that can make disappear my money forever  is really worrying. Even a common problem like transferring money from a dead person (unfortunately, everyone gets to that point) can be impossible if not planed in advance. A Bitcoin wallet (which can be reduced to a private key, a sequence of bits that should be secret) associated to all your Bitcoins can be gone or inaccessible in seconds. Accidental deletion, hardware problems, a malicious virus … Yes, there are countermeasures to this, like backups (if you’re reading this and you don’t have a backup in place, PLEASE DO), but the sad truth is that most of the people out there does make regular backups.

Gone in 10 minutes
Gone in 10 minutes

The single most important quality of any currency is trust. I trust that, if I have money in Dollars or Euros, they are not going to be vaporised for a stupid reason like a failing hard drive. All you need is some horror stories of people loosing all their savings on Bitcoin because there is a virus out there, and non-tech-wavy people will be scared, loosing trust on the currency.

Of course, this scenario can be avoided by an intelligent move. Hey, I don’t have my Euros with me in cash because of these problems. I put them in the bank! Awesome. I can move all my bitcoins to a bank, and interact with my money in the usual way, like credit cards, getting some from time to time from the ATM (in this case, a virtual online ATM). But, in this case, what’s the point of  Bitcoin? If I relay on a bank, I am using the currency exactly as I am using Dollars, Euros or Sterling Pounds (and the banks will charge accordingly). It could have some small benefits, like getting the money out of the bank to transfer it to someone else in an easier fashion than with a traditional currency (especially for small amounts), but I doubt it will be different enough or advantageous enough to justify using Bitcoin instead of regular currencies for most people.

I must say that Casascius coins are gorgeous
I must say that Casascius coins are gorgeous

Another insidious problem I can see is privacy. Bitcoin is pseudonymous, meaning that all the transactions are public, but there is no association between a wallet and someone. I don’t see that as reassuring, as getting to know that wallet A belongs to person B is definitively not a extremely difficult operation. In case Bitcoin was popular, there will be a lot of transaction, and most people would use a couple of wallets at most, for convenience. If you need to send goods to someone, for example, it won’t be that difficult to associate the wallet that pay for the goods with the person receiving the goods. Again, this can be obscured and some people will use complex schemas to hide who they are, but in a typical operation, I’d say that most people wouldn’t care too much about it, just as they don’t care at the moment with a credit card.

Ok, so you manage to know that person B is behind wallet A. Now you can track all the activity of wallet A (because it is public) and use it for whatever you want. A lot of wallets will be simply obvious what they are (known shops), so for example that will be a great way of  “directed marketing”. For example, Amazon could know that you have a contract with Vodaphone that looks like a mobile contract. Now you’ll get “directed information” of all the million offers that Amazon has about mobile products. Great, now you have more spam in your inbox. The data mining implications are incredible.

Of course, any purchase that you don’t necessarily what to share with the world can be exposed. And it’s there, publicly available, forever. If you move to a different wallet and move your bitcoins around, hey, that’s registered, so you can’t hide unless you transfer all the money out of the system, and then exchange it back, to a new wallet(s) that, this time, hopefully won’t be discovered. Plus all the inconveniences of doing so, of course.

Of course, there are ways of dealing with it. Using a lot of wallets, circulating the money among them (and hoping this is safe enough, as there could be advanced methods of detection for common uses). Being aware of what information is being shared. But, seriously, are we expecting everyone that just wants to use a currency to make common operations to add all that overhead and knowledge? I think that’s asking too much.

As the objective of a currency is to be used as means of payment, to be exchanged often, I think that these problems are in the way of considering Bitcoin as a currency replacement that can get some real traction in the world. The potential risks are quite big, and not well understood for a lot of people at the moment. Of course, these problems are at the moment less important that the fact that Bitcoin is used at the moment as an investment / speculation product, making the exchange rate so volatile that using Bitcoin as a currency is currently unviable. But assuming that Bitcoin can leave this state behind, I still see these issues in the way of becoming a viable currency.

I am not an expert in this subject, so if I am mistaken at some point, let me know. Comments welcome :-P