Home] [About] [Posts] [Resources]
dir: Home /
Posts /
Python Hijinks: Parsing OTF
published-date: 26 Feb 2023 12:19 +0700
categories: [you-tried] [python-hijinks]
tags: [python]
Few days ago I stumbled upon a documentation for OTF font-file structure on learn.microsoft. For some reason I remembered watching tsoding dialy stream on Parsing Java Bytecode with Python (highly recommended!) which leads me into this post.
Thus in the past few days, I’ve been learning byte parsing with only python builtin library with my deadline being this week’s weekend. Suffice to say though, it’s still far from being usable. But the output of parsed table (at least for me) is always satisfying to read.
You can view the source code here (colab). I tried my best to keep the notebook as concise as possible, by showing my way on brainstorming the steps required to do this task.
Feel free to copy and modify the colab notebook I linked above. The source code is under Unlicense.
There are few things to note:
Inside the notebook, font file is downloaded directly from fonts.google. In theory you can parse any Open Type Font that is available on the internet. But if you want to download the font like in the snippet, then you’d have to replace fpath
font filename and furl
query argument ?family=...
to match your target font from fonts.google.com.
To obtain the url or the url argument, you can download one font and immediately press escape to get the download url. If the url does not appear on the address bar, Developer Tools can be used to fetch the get request on the page.
Here’s a step-by-step guide.
Once you’ve selected the family, and click download all button, you’ll be redirected to a blank screen. Once you’re there, hit esc
before your browser closes the tab.
If your browser address bar is empty or only shows about:blank
, fret not! hit f12 (assuming you’re using firefox) and navigate through the Developer Tools into the Network tab.
Which would look like the following figure.
Now you’d need to reload once again and hit escape to see the request that went trough that page.
And now you can copy-paste the url onto the flink
variable on the notebook. Don’t forget to change the fpath
as well.
This colab notebook was only used as an exercise. The main focus was implementing required tables from the spec documentation. There are a lot of tables that is required and is fairly complex to parse. Take a look at cmap
table and you’ll see barely half of the spec is implemented.
There are also a redundant functionality of the notebook where I foresaw I’d need to account total_parsed_size
in bytes to make sure all the TableRecord length are fulfilled. But this turns out to be a hassle.
Function parameters from parsing tables are also differs from one to the other. Which cause a bit of confusion initially on what goes where and why.
One thing I over-engineer the most was the Container
class which was built so I can assign attribute dynamically, but is not a standard way to store parsed values as a structured object. One reason I used this to contain all parsed values is the custom __repr__
class model method where I can just directly put any instance in print()
to see all the structure.
It also acts like linked list thus (initially) it has the ability traverse between table nodes. But this, again, turns out to be a huge hassle since it assigns parent node by the end of the parsing process.
But I like it. so here’s the whole snippet of Container
class.
class Container:
def __init__(self, ID='Unassigned'):
self.ID = ID
self._dict = OrderedDict()
self._parentNode = None
self._reserved = list(dir(self)) + ['_reserved']
def put(self, key, value):
self._dict[key] = value
if isinstance(value, self.__class__):
value._parentNode = self
elif isinstance(value, (list, tuple)):
for i in value:
if isinstance(i, self.__class__):
i._parentNode = self
def get_parent_node(self):
return self._parentNode
def __repr__(self):
s = ''
s += f'=v= {self.ID} =v=\n'
l = max(map(len, self._dict.keys()), default=1)
# scare the hoes
for k in self._dict.keys():
s += padstr(k, l)
s += ' : '
item = self._dict.get(k)
if item == None:
s += 'null'
elif k.startswith('_data') and type(item) == 'bytes':
s += '[Blob]'
elif isinstance(item, self.__class__):
s += f'\n '
s += f'\n '.join(str(item).split('\n'))
elif isinstance(item, (list, tuple)):
if not item:
continue
elif isinstance(item[0], self.__class__):
s += f'\n╤' + '═' * (l + 3)
_item_l = list(item)
_item_r = []
if len(item) > 40:
_item_l = item[:15]
_item_r = item[-15:]
for subitem in _item_l:
s += f'\n├───╢ '
s += f'\n│ '.join(str(subitem).split('\n'))
if _item_r:
s += f'\n┴'
s += f'\n░' + ' ' * (l + 3) + f'... (truncated {len(item)-30} entries)'
s += f'\n┬'
s += f'\n│'
for subitem in _item_r:
s += f'\n├───╢ '
s += f'\n│ '.join(str(subitem).split('\n'))
else:
_temp_item = list(item)
if len(_temp_item) > 20:
_temp_item = str(_temp_item[:10])[:-1] + ', ... , ' + str(_temp_item[-10:])[1:]
s += str(_temp_item)
else:
s += str(item)
s += '\n'
s += f'=^= {self.ID} =^=\n'
return s
def __getattr__(self, __name: str):
if __name in self._dict:
return self._dict.get(__name)
return object.__getattribute__(self, __name)
wack.
Built with Hugo | previoip (c) 2025