Python Hijinks: Parsing OTF File

published-date: 26 Feb 2023 12:19 +0700
categories: you-tried python-hijinks
tags: python

you tried

Few days ago I stumbled upon a documentation for OTF font-file structure on learn.microsoft. For some reason I remembered watching tsoding dialy stream on Parsing Java Bytecode with Python (highly recommended!) which leads me into this post.

Thus in the past few days, I’ve been learning byte parsing with only python builtin library with my deadline being this week’s weekend. Suffice to say though, it’s still far from being usable. But the output of parsed table (at least for me) is always satisfying to read.

You can view the source code here (colab). I tried my best to keep the notebook as concise as possible, by showing my way on brainstorming the steps required to do this task.

Feel free to copy and modify the colab notebook I linked above. The source code is under Unlicense.

There are few things to note:

Choosing a font from fonts.google

Inside the notebook, font file is downloaded directly from fonts.google. In theory you can parse any Open Type Font that is available on the internet. But if you want to download the font like in the snippet, then you’d have to replace fpath font filename and furl query argument ?family=... to match your target font from fonts.google.com.

To obtain the url or the url argument, you can download one font and immediately press escape to get the download url. If the url does not appear on the address bar, Developer Tools can be used to fetch the get request on the page.

Here’s a step-by-step guide.

fonts-google-downloading-a-font

Once you’ve selected the family, and click download all button, you’ll be redirected to a blank screen. Once you’re there, hit esc before your browser closes the tab.

fonts-google-blank

If your browser address bar is empty or only shows about:blank, fret not! hit f12 (assuming you’re using firefox) and navigate through the Developer Tools into the Network tab.

developer-tools

Which would look like the following figure.

developer-tools-network-tab

Now you’d need to reload once again and hit escape to see the request that went trough that page.

developer-tools-get-request

And now you can copy-paste the url onto the flink variable on the notebook. Don’t forget to change the fpath as well.

Limitation and Over-engineering side of the notebook

This colab notebook was only used as an exercise. The main focus was implementing required tables from the spec documentation. There are a lot of tables that is required and is fairly complex to parse. Take a look at cmap table and you’ll see barely half of the spec is implemented.

There are also a redundant functionality of the notebook where I foresaw I’d need to account total_parsed_size in bytes to make sure all the TableRecord length are fulfilled. But this turns out to be a hassle.

Function parameters from parsing tables are also differs from one to the other. Which cause a bit of confusion initially on what goes where and why.

One thing I over-engineer the most was the Container class which was built so I can assign attribute dynamically, but is not a standard way to store parsed values as a structured object. One reason I used this to contain all parsed values is the custom __repr__ class model method where I can just directly put any instance in print() to see all the structure.

It also acts like linked list thus (initially) it has the ability traverse between table nodes. But this, again, turns out to be a huge hassle since it assigns parent node by the end of the parsing process.

But I like it. so here’s the whole snippet of Container class.

class Container:
  def __init__(self, ID='Unassigned'):
    self.ID = ID
    self._dict = OrderedDict()
    self._parentNode = None
    self._reserved = list(dir(self)) + ['_reserved']

  def put(self, key, value):
    self._dict[key] = value 
    if isinstance(value, self.__class__):
      value._parentNode = self
    elif isinstance(value, (list, tuple)):
      for i in value:
        if isinstance(i, self.__class__):
          i._parentNode = self

  def get_parent_node(self):
    return self._parentNode

  def __repr__(self):
    s = ''
    s += f'=v= {self.ID} =v=\n'

    l = max(map(len, self._dict.keys()), default=1)
    # scare the hoes
    for k in self._dict.keys():
      s += padstr(k, l)
      s += ' : '
      
      item = self._dict.get(k)
      if item == None:
        s += 'null'

      elif k.startswith('_data') and type(item) == 'bytes':
        s += '[Blob]'

      elif isinstance(item, self.__class__):
        s += f'\n  '
        s += f'\n  '.join(str(item).split('\n'))
      
      elif isinstance(item, (list, tuple)):
        if not item:
          continue

        elif isinstance(item[0], self.__class__):
          s += f'\n╤' + '═' * (l + 3)
          _item_l = list(item)
          _item_r = []

          if len(item) > 40:
            _item_l = item[:15]
            _item_r = item[-15:]

          for subitem in _item_l:
            s += f'\n├───╢ '
            s += f'\n│     '.join(str(subitem).split('\n'))

          if _item_r:
            s += f'\n┴'
            s += f'\n░' + ' ' * (l + 3) + f'... (truncated {len(item)-30} entries)'
            s += f'\n┬'
            s += f'\n│'
            for subitem in _item_r:
              s += f'\n├───╢ '
              s += f'\n│     '.join(str(subitem).split('\n'))

        else:
          _temp_item = list(item)
          if len(_temp_item) > 20:
            _temp_item = str(_temp_item[:10])[:-1] + ', ... , ' + str(_temp_item[-10:])[1:]
          s += str(_temp_item)
      else:
        s += str(item)
      
      s += '\n'
    s += f'=^= {self.ID} =^=\n'
    return s

  def __getattr__(self, __name: str):
    if __name in self._dict:
      return self._dict.get(__name)
    return object.__getattribute__(self, __name)

wack.