I'm working on a data-driven event generation platform for software engineers

Table of Contents

Subscribe to TinyDevCRM development here.

Discussion on Hacker News


I'm taking the covers off an effort I've been making over the past two weeks. I call it TinyDevCRM. Friends and acquaintances also call it “tinydev”. It's a new software project I'm building to address some of the data and platform issues I see blocking me from creating more software tools. I'm hoping this tool at minimum lays the foundation for at least my personal infrastructure for the next 50 years, and at most could even be forked, extended, and monetized to change my current BATNA.

Data is eating the world

I've wanted to build a “something” in software for quite a while now.

The summer of my junior year of college, I talked to a few hundred people to talk about the prospects of an Industrial Internet of Things product to provide analytics for manufacturing tools. Turns out, you have to learn how to code first before people will trust a college student with machines that cost millions of dollars. Interfacing with 30 year old proprietary standards to read data out and over networks or to local disk also poses an interesting logistical challenge.

After I graduated with college, I worked with a friend at the time on a tool to help plan out travel itineraries and share them amongst friends. I called it Traveltile. Turned out okay-ish, but I didn't know what I didn't know, and I thought working in industry would help with that. That and reading a lot of books. Also, the platform only becomes useful if there's network effects, or existing data to reference. Without network effects, it just looks pretty.

During my first year working in industry, I wanted to explore whether you could algorithmically crawl earnings reports filed with the SEC on EDGAR to determine which companies make sense to invest in from a value investing standpoint (as opposed to a day trader's standpoint). I called it ShortRoad. I got a pretty lukewarm response, and I personally only invested in a handful of blue chip companies that in a bull market delivered okay returns anyways, so that died. Also clean data is really expensive, and effective NLP over PDFs (especially older PDFs or OCR scans) is a difficult challenge for one newbie programmer to handle.

The correlation I saw between all of these failed efforts, and the biggest lesson I learned over these years, is how software is only valuable when it makes data easy to obtain, easy to manage, and easy to make useful. Without access to the underlying data, without a value proposition to tie views into that data to business-critical workflows, all of my software ideas quite simply went into the dustbin of history.

Marc Andreessen says “software is eating the world”. My spin on that is “data is eating the world”.

Data in today's world of software

Ben Thompson of Stratchery published an interesting article about the state of the software industry, where he claims that the current tech incumbents (Facebook, Google, Amazon, etc.), due to their control over vast data lakes and ability to take in more, could posses the staying power of the “Big Three” automobile manufacturers. In this way of thinking, clients are ephemeral, and data lakes are eternal.

I think there's a lot of great things about keeping data in one place. For example, data integrity issues are much more urgent, and get fixed or fixed faster. Things like fysncgate matter a lot more to larger companies than smaller companies, since they manage more data, more users, more money, and more liability. Large companies also take data security far more seriously. I can't and likely never can execute on at least two thirds of this list on data center security, like maintaining my own network security infrastructure or providing 24 / 7 security guards to watch my data.

Unfortunately, there is (always) another side of the coin of vast data lakes:

  • Businesses generate and retain value through platforms, not protocols. Data is one of the most effective business moats. Many companies lock down data to, among other reasons, retain a business advantage, and avoid misuse of data by third parties. Even data exports, which may be demanded by law, aren't useful if the data relies on the application to be accessible or useful. Hence, the formation of walled gardens, which fundamentally result in an unequal relationship between consumer and producer, like the situation Adam Conover describes in this tweet. The larger the data lake, the more unequal the relationship.

  • Data oftentimes isn't a first-class citizen during the software development cycle. You know how API-driven companies have cleaner and more usable APIs than companies that expose a product and only have internal APIs? I think something similar occurs with data lakes and data swamps. I've heard a good number of stories of data scientists turning into ML infrastructure engineers due to data engineering bottlenecks choking the amount of effective data science work that can take place. It takes a lot of hard work to keep a data lake from turning into a data swamp, and the job turns much harder when the physical data isn't part of the discussion. The significance of this situation scales with the size of the data lake, especially considering the amount of tooling and infrastructure necessary to address bugs and issues of new dimensionalities.

  • The higher you rise, the harder you fall. Smaller data lakes are less attractive targets for bad actors, and I don't think it's possible to never make mistakes, that can't then be exploited. It's not a rebuttal of Kerckhoff's principle, but if you controlled your data, you could build alignment between your problems and your fault, which helps retain a sense of agency and control.

Whoever controls the data holds the keys to their temple. I wish I owned mine.

An enlightening conversation

A few months ago, I had a conversation with a former director of engineering at a company I regard highly for their engineering prowess. When I asked what his favorite internal tool was, he responded with a tool that backs up their internal data lake architecture. In short, it uses AWS Redshift, generates materialized views, and upon a refresh of said views, fires off AWS Lambda events to send emails, push notifications, and “do things”. It's probably a good deal more sophisticated than that, but I immediately became enamored with the basic idea, and I couldn't get it out of my head.

Everyone has the same source of truth. One source of writes mean unitary flow of concrete data, which may reduce concurrency conflicts. The database has a context of views, which means it can do scheduling or optimization to increase query performance. Finally, lambdas help tie the service to third-party integrations through a common channel. What a wonderful tool!

Then I thought, I could do all this with PostgreSQL. PostgreSQL packages features like CREATE MATERIALIZED VIEW and CREATE TRIGGER out of the box. PostgreSQL also maintains a rich extensions ecosystem, and a quick search of automated job schedulers revealed pg_cron, which uses SQL in order to run cron jobs. I'll admit, when I realized all the pieces were there, I got pretty excited.

For example, I want to build a tool to help me keep in touch with my friends. My calendars are (still) all over the place, and I'm not sure how merging .ics works. But I can model the basic workflow in PostgreSQL.

First, take a list of friends:

-- Something like this
CREATE TABLE friends (
    "firstName" NVARCHAR(256),
    "lastName" NVARCHAR(256),
    "email" NVARCHAR(256),
    "birthday" DATE
);
COPY friends FROM '/tmp/friends.csv' WITH (FORMAT csv);

Then, create a materialized view on this data to match friends whose birthdays are today:

-- Something like this
CREATE MATERIALIZED VIEW bdays_today AS
SELECT * FROM friends WHERE birthday == current_date;

Then, refresh that materialized view once every day at say 12:00AM:

SELECT cron.schedule(
    '0 0 * * *',
    'REFRESH MATERIALIZED VIEW bdays_today'
);

-- Assuming table of refresh events
SELECT cron.schedule(
    '0 0 * * *',
    'INSERT INTO refreshes(view_name, status, time) VALUES (''bdays_today'', ''new'', NOW())'
)

And upon refresh, send an event to notify any third parties interested:

CREATE OR REPLACE FUNCTION bday_notify()
    RETURNS trigger AS
$$
BEGIN
    -- Replace NEW.id::text with some useful package
    PERFORM pg_notify('events_channel', NEW.id::text);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER bdays_refresh
    AFTER INSERT OR UPDATE OF status
    ON bdays_today
    FOR EACH ROW
-- Some
EXECUTE PROCEDURE bday_notify();

TinyDevCRM is simply taking this very simple PostgreSQL workflow, making it generalizable, and exposing it all via both a GUI and an API.

First steps

  1. I put up an introductory landing page, which got 60 early adopter signups.

  2. After getting some constructive feedback from the good folks on Pioneer.app and YC Startup School, I conducted a product discovery roadmap study to distill my value proposition, which I updated my landing page with. I probably could improve the landing page, and I tried to revamp it entirely, but I decided to change the content and add a navbar to the existing page instead, since I'm one guy and this is less a company at the moment and much more a software tool.

  3. I created a Dockerized proof of concept, available here. Here's a screenshot of the “front-end”, that has a series of button you can click on to execute the workflow:

    Frontend

What's next?

My strategy for TinyDevCRM is:

  • Get a minimal viable out the door for a 0.1 release, by the end of next week. I'd like to keep that timebox, especially since I finished reading through “The Haskell Book” the first week of January, and the first week of February was focused on creating the proof of concept and the product discovery roadmap, and most of this week was updating the landing page. My original schedule was to finish “The Haskell Book” by middle of January, and I assumed my full-stack skills hadn't deteriorated so much in the past two years.

  • After 0.1, continue to build in features as 0.x releases until I get to “feature parity”, as TinyDevCRM should almost exclusively communicate with third-party services using protocols. This will involve a good amount of wireframing and product development work, which I'm doing with Balsamiq and Basecamp.

  • After the final 0.x release, I'll lock everything down, which may include rewrites of certain portions of the stack to something that's rock-solid stable. By everything, I'd like to mean vendoring all source dependencies and building at a commit-level granularity. If I stick with Python, that means building my own version of OpenSSL for my own version of Python. This step could be quite slow (I really want to compile to a single binary, which precludes using Python, which would then necessitate a complete backend rewrite), and may involve multiple 1.x releases.

  • If there's continued public interest in taking this further (such as integrating enterprise features like LDAP/ActiveDirectory or relaxing constraints like assuming deployments are AWS-only and going cloud-native), I can fork the project and see whether I can get a bit of funding and put together a team to take it to the next level.

I'm also looking to publish weekly updates to have a public auditable log of product development, on this blog and share with Bytes by Ying subscribers and TinyDevCRM subscribers. I'll share release notes with TinyDevCRM subscribers.

Software things I've learned so far

Mostly to prove to myself that I've learned new things while working on this.

  • Flask forces app.config['UPLOAD_FOLDER'] for HTTP/POST requests to work properly. This makes sense to me, since REST as I currently understand it is a stateless protocol (as opposed to SOAP, which as I understand it is uses actor models and messages and isn't completely stateless). I honestly thought I could be able to use a stateful cursor and load data row by row into the database, and avoid eagerly loading blobs.

    This presents the concern that on certain deployment targets (e.g. AWS EC2 t3.nano), I might run out of disk space and crash the entire system trying to upload files of certain sizes. This might mean I need to have client-side validation around file size – which still isn't a guarantee since under heavy write load by concurrent users, I could still crash.

  • PostgreSQL CSV upload assumes an existing PostgreSQL table. As a data engineer, I really don't like CSVs. Ideally, a CSV is a well-structured document with proper escaping, quoting, delimiters, same number of columns for every row, a singluar null values, RFC-4180 compliance with ABNF grammar, etc. Then you worry about problems like “there's no byte-level type system definition?” or “what do you mean you can't assume a byte offset for each row?” or “how do we stripe a generated or actual CSV header on chunked data for parallelizing ingest?” or smart things like that.

    Instead, a CSV is any file that has a “.csv” suffix. There is no limitation to how crummy CSV data can become and still work. Failing fast isn't really a thing for large datasets, especially without validated intermediate state (like MapReduce generates per stage). Ideally, I would use something like Apache Parquet for ingesting large datasets. But Excel and Google Sheets and other tools all have universal CSV support and probably no support for Parquet.

    It's okay. This isn't middleware, this is a tool for people, and people like their flexibility. I think I'll just have to add a form for the user to create a PostgreSQL table to their liking, and tie that to the workflow for CSV ingest.

  • Access control will likely become tricky. I was talking to one senior data engineer a few months ago who mentioned how PostgreSQL permissioning wasn't automated and reproducible. If confirmed, this will almost certainly complicate deployments. How would database-level permissioning work? I really want everybody to access a cursor to get their data out at any time, but I don't want people to have direct write access to the database for tables and views that aren't theirs.

    For the MVP, I can manage API endpoint access only using HTTP methods (e.g. use GET only), and use one user for the database.

  • You can't attach a trigger to a materialized view. You have to have a concrete table to fire a statement or row-level trigger. I created an events table, but that means additional disk usage, and another table to manage (which may grow very quickly depending on how many events are managed).

    I'll need to VACCUM old events, or flush them to a log file, which I'll have to rotate and flush to blob storage. I do wish this feature existed, but building dependencies between extensions is one quick way to destroy an ecosystem, so I also understand why it doesn't exist.

  • cron has per-minute jobs available. At first, I thought I wanted something like AWS Lambda for TinyDevCRM, which charges by the event and can get expensive real quickly. Therefore, I promised one refresh per day to people.

    If I use cron though, it's just CPU time, which I'm paying for on EC2's (or DigitalOcean's or Linode's) pricing schedule. I can go per-minute and it shouldn't linearly scale my costs per cron job.

    Events are so important to TinyDevCRM (I see them as taking the human out of the loop for automated processes), I want to manage my own cron (and I'd bet that Lambda is built on top of cron anyways).

  • cron is tricky to work with. Sometimes, cron just silently stops working, and you don't know if it's actually working unless you wait for hard minutes (no way to simulate time). Something to do with pg_hba.conf. This means log management for PostgreSQL is important, which means getting PostgreSQL logs out of systemctl or wherever they are. It also means rotating and flushing logs is even more important.

  • Citus Data only publishes a build of pg_cron for PostgreSQL 11.x, and only on package archives for some versions of Ubuntu. I'm using PostgreSQL latest (12.x), and I'm on Ubuntu 19.10 (though I'm using Ubuntu 18.04 for Dockerized development). I'm already forced to build at least one dependency and target a different location than the pre-built version. I don't want to take on build overhead at this point in time, hopefully I can automate this in a CI / CD pipeline and have it automatically cache the build, and write a simple bash script to build a first order approximation version.

  • pub/sub uses one channel, and subscribers can disambiguate between different events. I thought I might need to keep open a socket per user, and multiplex between them somehow. Nope, I just need to ensure each event has a unique identifer of some sort for the subscriber to pick it up. That's a huge relief to me from a resource utilization standpoint.

    I do want to ensure events are secure, and that users not authorized to take a package can't access its contents. End-to-end encryption is a later concern however; no point in encrypting events if nobody's there to use it.

  • AWS RDS doesn't allow PostgreSQL extensions it hasn't vetted. I think they at least support PostGIS, but they don't support pg_cron. So I can't use AWS RDS for deploying PostgreSQL with pg_cron. That means I need to deploy PostgreSQl on EC2, and manage my own PostgreSQL backups, among other concerns. It's fine because TinyDevCRM is meant to deploy anywhere and not rely on a specific cloud vendor, but this may make things more complicated than necessary.

There's other concerns, some of which I wrote down in backend.py, but these are some of the top concerns I have as of now.

Questions for myself and the community

If you have answers to these questions, I'd love to hear from you! My email is me@yingw787.com.

  • Is there was a way to tell Flask to upload files to AWS S3 or blob storage, and is there a way to configure this dynamically during runtime?

  • Is there a reproducible access control infrastructure framework at the database level for PostgreSQL?

  • What's best practice for log rotation and flushing to blob storage? Is there a best practice?

Subscribe to my mailing list