Emptiness Blogging

Friday, March 24, 2023

一些好數字

08, 一個年份、一個月份、一個日子、一個時間

一百年前的奧運三問.. 在一百年後, 2008年8月8日晚上8時... 有了答案

51、21、28 ... 加起上來, 就是一百!

100 , 是一個圓滿的數字.

Saturday, August 06, 2016

Design first

After joining the new company, the oracle always mentions "design first!".

At first, I do not really understand why proposing an implementation would get that message in return. Today, I read that again in a slack conversation. As an observer, my brain get flashed with a saying - "Design is problem setting, Planning is problem solving". Then, when that linked to the use cases that oracle keeps asking, I got something.

I'm really lucky to join this company, even though I'm not having much interactions with the oracle.

Thursday, March 03, 2016

Things I learnt in the past 5 years

“It’s hard to teach a new dog old tricks.” - On many annual letters to shareholders written by the Oracle of Omaha

In the past 5 years, I worked with a team that have 2 product lines being alive while 5 others were dead. It is my pleasure to work with the team that let me learn and strengthen certain tricks. Old tricks take the past to learn and easy to forget, this post reminds my future self.

On risk taking

“Over the years, a number of very smart people have learned the hard way that a long stream of impressive numbers multiplied by a single zero always equals zero.” - Buffett

Startup could try to solve a problem with really new way of doings, only if the environment allows it. In the Valley, the legal and finance system allows a startup to take risky bet until the company gets enough eyeballs and/or moneys. This is not applicable to HK. We don’t have a court that knows technology, law that creates (or, at least try to create) a level playing field, money that supports risky bet, etc. So, being conservative sounds really legitimate here. No matter how much success you have achieved in the past, a single misstep could put you to the end. This also applies to engineering where new technology may not work as advertised, sometimes. Traditional technology with an active ecosystem implies battle scars on the face of others.

On decision making

Data driven is great, only when you can differentiate signal verses noise. Sometimes, a B2B SaaS could use data to drive decision, when the data looks consistent. Often, the sample size is too small to even make a reasonable guess. Leaders give strong opinion but weakly held. When we don’t have the data, we better admit that that is just a guess. (By the way, the infrastructure behinds data collection and data analysis is really hard to get right. I did that 3 times but not even come close to any.)

On product design

Minimum Viable Product means minimum, viable, and throw away. Even before you come up with the right question for the audience, how could you come up with an answer? This translates to “doing the right things is more important than doing things right”. Make tests to validate the question. MVP also applies to engineering. Leaky abstraction is bad but not even starting one sucks, whenever you find yourself repeat more than twice. That abstraction leads you to find your ultimate question. And, pre-mature optimization just wastes your time. Let it runs until it hits the wall.

On culture

“We shape our buildings and afterwards our buildings shape us.” - Churchill

Values should be deep-rooted while culture would be ever changing. Culture evolves around the values set by the team. Don’t be afraid to start with something small or vague. Once set, stick with it and shape it. Remember, making something up doesn’t mean doing it. Sometimes, peoples do make mistakes and toxic spreads in the speed of light, delay no more to apply the fix. Eventually, the team would be shaped by it. Engineering do also needs culture and value to scale. We shape our codebase and afterwards our codebase shape us. We eager to find something that can be repeated, nicely.

On multi-tasking

Some tasks can be run in parallel and some cannot. Most of the time, the best way is not to multi-task. Take pause, schedule the tasks using whatever you are comfortable with. Delegation enables true multi-tasking. In engineering, data processing shares this as well. Multi-processing sometimes does not work as expected.

On ownership

Ownership implies dedication. No-one can own everything since bandwidth is limited. In short term, sole ownership provides cost efficiency. In long term, that is a trap! Knowledge transfer is a daily routine that cannot be prevented. Being an owner of one thing makes you feel doing good, being an owner of multiple things makes you feel doing nothing. How could we take care of many things with the same level of dedication as one? Ownership transfer re-gains dedication.

On communication

Communication is king. Both written and verbal. Verbal is fast but lossy, written is persistent. They are not mutually exclusive and complement each other. In engineering, written is favored. Commit logs, documentations, source comments, discussion on issues. These help scaling and understanding. Not improving or ignoring it is fool.

On tooling

“Less is more.”

Having fewer things to care give you focus. Managed service like Heroku is always a good place to start with. Don’t try to build a custom PaaS when that is not your business. Though, the eager to reinvent a wheel should be respected, that gives you better understanding. Your business should already take your 80, the other 20 is better not to get more troubles.

On meeting

Meeting has many types and they all need an agenda. It is better used for discussion but brain-storming that could be done alone. Stick to the agenda, come to action items fast. And, treat synchronous meeting as a limited resource.

On tradeoff

“There are many ways to Rome.”

Most of the decisions have opportunity cost, taking one path may lose upsides of another. What really matter is whether that achieves the goal. And, making the tradeoff verbose provides better understanding. That could also gain support from the team.

Last but not least

Stay hungry. Stay foolish. Self-explanatory.

Thanks @anthonycyl for the edit.

Friday, October 02, 2015

Rate limiting Shopify API using Cuttle

During the development of a Shopify app, it is required to respect the API rate limit set by Shopify. Typically, we can use sleep() statement to make pause between API calls. This simple method works great until there are multiple processes that make API calls concurrently.

There are quite a number of ways to solve the problem.

1. Serialize all API calls into a single process, though not all business logics can work in this way.
2. Host a RPC server / use a task queue to make API calls. The RPC server / queue manager has to rate limit the API calls. [http://product.reverb.com/2015/03/07/shopify-rate-limits-sidekiq-and-you/]
3. Centralize all API calls with a HTTP proxy where the proxy performs rate limiting.

Personally, I think the RPC server / task queue option is quite heavy weighted since that requires:

* A RPC / task framework, and
* A RPC server / task queue, and
* A rate limit system built around the RPC server / task queue.

In contrast, the HTTP proxy option only requires a HTTP proxy server plus a HTTP client. And, HTTP is well supported in many programming languages and systems. It sounds as a great starting point.

(BTW, HTTP can be considered as the underlying protocol of a RPC system.)

With the HTTP proxy option, there are quite a few options to get started.

1. Use Nginx reverse proxy to wrap the API, use its limit module to perform simple rate limit or write a Lua/JS plugin for more sophisticated control. [http://codetunes.com/2011/outbound-api-rate-limits-the-nginx-way/]
2. Use Squid forward proxy to perform simple rate limit by client info (e.g. IP address).

At the first glance, the Nginx reverse proxy option looks superior since we can have sophisticated rate limit control deployed. Though, using such approach would need to use the Nginx wrapped URL of Shopify API. Or, we have to modify DNS/host configuration to route the traffic.

Personally, I am not comfortable in modifying the URL to Shopify API since that may prevent a smooth upgrade of the Shopify API client in the future. For the DNS option, shall I modify the DNS config once per a new Shopify store install the app?

(We may also route all traffic to the default virtual host of Nginx and use Lua/JS plugin for the host routing. This does not require URL wrapping or DNS configuration. Though, I personally think this is kinda abusing Nginx.)

So, reverse proxy may not be a good way to go. Let's come to the forward proxy option. In this case, we do not need to do anything on the URL to Shopify API and just let the traffic goes through the proxy by configuring the HTTP client. A forward proxy with rate limit control sounds like a good way to go.

Here, we come to Cuttle proxy. [http://github.com/mrkschan/cuttle]

Cuttle proxy is a HTTP forward proxy solely designed for outbound traffic rate limit using goroutine. It would provide a set of rate limit controls for different scenarios. In case of Shopify API, we can use the following Cuttle settings to perform rate limiting.

addr: :3128
zones:
  - host: "*.myshopify.com"
    shared: false
    control: rps
    rate: 2
  - host: "*"
    shared: true
    control: noop

Then, set the HTTP proxy of the Shopify API client like below to route API calls through Cuttle.

# apiclient.py
import shopify

shop_url = 'https://{}:{}@{}/admin'.format(API_KEY, PASSWORD, SHOPIFY_DOMAIN)
shopify.ShopifyResource.set_site(shop_url)

print json.dumps(shopify.Shop.current().to_dict())

# Run
HTTPS_PROXY=127.0.0.1:3128 python apiclient.py

As long as all API clients are configured to use Cuttle, API calls will be rate limited at 2 requests per second per Shopify store. So, the rate limit bucket would rarely go empty.

Note: It is up to you to set the rate of API calls in Cuttle, using 3 requests per second per store would be another great option. You will receive HTTP 429 sent by Shopify roughly after 120 continouos API calls to the same store over 40 seconds.

Note: API calls will be forwarded by Cuttle using the first come first serve manner. If the concurrency level of API calls to the same Shopify store is high, some API calls will wait for a significant amount of time instead of receiving HTTP 429 sent by Shopify immediately. Remember to set a reasonable HTTP timeout in that case.

(FYI, the Shopify API rate limit only favors high concurrency level for a short duration. If you really need that in your case, Cuttle would not be a good option.)

Friday, May 15, 2015

Scope finding in a source file

This post is going to discuss an issue I met when building a text editor plugin that tries to find the class/function scope which the current line on the editor belongs to (http://atom.io/packages/ctags-status). The problem I met can be broken into two parts: (i) Given a set of ranges that may be overlapping on a one dimension plane, find the ranges that cover a point on the plane. (ii) Given a set of overlapping ranges, get the topmost range where the height of ranges follows the ascending order of the starting point of all ranges (the higher in the stack, the later in the sequence). Note, the issue is not a hard problem. This post documents how I encounter and work on the problem.

So, here is the story.

When I build the early version of the plugin, I want to ship it as soon as possible and see if it is downloaded by anyone (Atom editor does not expose plugin usage data to its author yet, so the only number I have is downloads). Thus, there was not much thought process in those days.

The early implementation models each scope as a range with start and end line. To find the scope that the current line belongs to, the problem becomes a range search problem. Ranges would be overlapping when there is nested scope. In that case, the start and end lines of the inner scope would always be enclosed by those of the outer scopes. So, I can sort all scopes by their start line in ascending order, and the innerest scope on the current line would be the last one in the sequence that its line range encloses the current line. This is a O(N log N) preprocessing + O(N) lookup. I was happy with it.

So far so good?

The issue was not surfaced until I used the plugin to browse a long source file that has dozens of functions (yup, shouldn't the file be split for readability?). When I kept moving down the cursor for a while, its movement was no longer smooth. The issue was that the plugin needs to find the scope upon each cursor line change. When I fired up the profiler, I found 300 - 400ms were spent on scope finding when there were dozens of continuous cursor line changes. I was not sure whether the plugin was really the cause of the UX problem but it is the one that took most of the processing time. So, time for optimization!

Since this is a range search problem, KD tree, segment tree, and interval tree quickly came to my mind. There are several factors to consider in picking a solution: (i) availability of existing implementation (I don't like reinventing without enhancement), (ii) speed of insert / delete / update (when a source file is edited, there is a high chance that scopes are moved), and (iii) lookup speed of course. When I was still deciding which search tree best fits the issue, I raised a questions to myself. Why don't simply hash the scope(s) on each line? A simple hash with a stack in each bucket is a good fit because:

(i) I just need JavaScript object (hash) and array (stack) to build it.
(ii) A typical source file has less than thousand of lines with a dozen of scopes. The worst case is having thousands of pointers (lines * scopes) referring to a dozen of strings (scope names). That should not take a lot of spaces.
(iii) A file edit would introduce quite a lot of scope movements (e.g. insert a new line at top of the file pushes all scopes down). Maintaining a data structure via insert / update / delete is like rebuilding it in the worst case. Building a big hash takes O(NL), number of scopes * number of lines in the file (which is several thousands of iterations). The hash building process is offline and I don't expect it would take long, so I am happy with that.
(iv) O(1) lookup, the best that I can get.

As a result, the plugin is using a hash for scope finding.

Sunday, February 23, 2014

Python descriptor, Django CharField with encryption

This post is part of the pyfun series, I will try to *log* some of the features that I think they make Python funny :)

One of the most recent topics in my reading list is Python descriptor.

An object attribute with “binding behavior”, one whose attribute access has been overridden by methods in the descriptor protocol. Those methods are __get__(), __set__(), and __delete__(). If any of those methods are defined for an object, it is said to be a descriptor. - http://docs.python.org/2/howto/descriptor.html

When I finish the howto on python.org, I don't really understand what is it and thus my read-later list kept expanding with a lot of related articles; until I came across this post (If you don't know what is Python descriptor, I recommend you to read the post first since I'm not here to re-post the details with my poor English).

The purpose of this post is to extend the recommended reading to provide another example use of Python descriptor - a encryption/decryption wrapper of a Django `CharField`.

One of the major purpose to implement a Python descriptor is to provide getter and setter to attributes. In some traditional programming languages, we have to implement/generate a set of getter and setter to protect the read/write access of attributes. Or, we can use a generic attribute class that has the protection but the access of the attributes looks like `object.attribute.get()` and `object.attribute.set(xxx)`. Python descriptor solves both of the mentioned problems.

To encrypt/decrypt a `CharField`, it is obvious to override its `get()`/`set()` functions. We can simply do so by extending the `CharField` just like this snippets. However, I would like to demonstrate the use of Python descriptor (yep, I'm abusing it here).

At first, we need the descriptor with encryption and decryption. The cipher we use here is a simple 32-bytes XOR without padding (which is simply uesless in most of the cases).

class EncryptedAttr(object):
    '''Descriptor that encrypt content on write, decrypt on read'''
    def __init__(self, attr, secret_key):
        self.attr = attr
        self.key = secret_key

    def encrypt(self, v):
        '''A simple XOR chiper'''
        return ''.join(chr(ord(a) ^ ord(b)) for (a, b) in zip(self.key, v))

    def decrypt(self, v):
        '''A simple XOR chiper'''
        return ''.join(chr(ord(a) ^ ord(b)) for (a, b) in zip(self.key, v))

    def __get__(self, obj, klass):
        '''Get `attr` from owner, and decrypt it'''
        cipher_text = getattr(obj, self.attr, None)
        if not cipher_text:
            return ''

        return self.decrypt(cipher_text)

    def __set__(self, obj, value):
        '''Encrypt value, and set to owner via `attr`'''
        if not value:
            setattr(obj, self.attr, '')
            return

        cipher_text = self.encrypt(value)
        setattr(obj, self.attr, cipher_text)

The descriptor requires a Django model attribute name and a secret key in its constructor. The attribute name is used to look up the wrapped attribute of the Django model in its `__get__()` and `__set__()` functions. To use it, we just assign it as an attribute to the model class.

class Secret(models.Model):
    wrapped = models.CharField(max_length=32)
    content = EncryptedAttr('wrapped', 'This is the 32-bytes secret key.')


# Let's make a secret
payload = 'The secret must be 32-bytes long'  # Because we use a 32-bytes XOR
s = Secret()
s.content = payload

s.wrapped
>>> '32-bytes blah blah blah blah ...'

s.content
>>> 'The secret must be 32-bytes long'

In this example, the CharField `wrapped` attribute is not expected to be accessed directly. When we assign plain text to `content`, the plain text is encrypted and stored to `wrapped`. The `content` attribute does not hold anything at all. On the other hand, when we read from `content` attribute, it actually decrypts the cipher text from `wrapped`.

You may get the sample Django project to play around at https://github.com/mrkschan/encrypted-field.

Thursday, October 17, 2013

Partial function call

This post is part of the pyfun series, I will try to *log* some of the features that I think they make Python funny :)

Again, I was reading the Scala tutorial and find that she has built-in support of partial function call (see http://twitter.github.io/scala_school/basics.html#functions). This reminded me that Python does also has functools.partial(), which can be used as function shortcuts.

Let's see this example of Django.

# Let's have a Coupon that can either be fixed amount discount or percentage off
# But, we don't want to have model inheritance and table join to get the data
class Coupon(models.Model):
    code = models.CharField(max_length=8)
    type = models.CharField(max_length=11, choices=['fixedamount', 'percentage'])
    amount = models.DecimalField(max_digits=8, decimal_places=2)
    currency = models.CharField(max_length=3, choices=['USD', 'CAD'], default='')


# To create a Coupon based on certain conditions, you can have this
kwargs = {'code': code}

if condition_a:
    kwargs.update({'type': 'fixedamount', 'currency': 'USD'})
elif condition_b:
    kwargs.update({'type': 'percentage'})

kwargs.update({'amount': x if condition_c else y})
coupon = Coupon.objects.create(**kwargs)


# Or with functools.partial()
FixedamountCoupon = functools.partial(Coupon.objects.create, type='fixedamount')
PercentageCoupon = functools.partial(Coupon.objects.create, type='percentage')

if condition_a:
    coupon = functools.partial(FixedamountCoupon, code=code, currency='USD')
elif condition_b:
    coupon = functools.partial(PercentageCoupon, code=code)

coupon = functools.partial(coupon, amount=x if condition_c else y)
coupon = coupon()

Yes, we just shortcuted two types of Coupon using functools.partial(), created "sub-class" of Coupon. Furthermore, if the underlying function accepts positional arguments, we can also shortcut those arguments.

Clean code FTW.