CME
Cloudwatch event based async data processing that handled 100G of data a night
with some individual files upwards of 30G (which introduced unique challenges
due to the sheer size of the files). The solution to processing such large files
that I implemented was:
S3 put event -> lambda -> container in fargate that shells out and pipes aws cli
into sed to drastically reduce the raw file size -> s3 put event -> lambda uses
request lib to post s3 location and schema to 3rd party normalization tool -> s3
put event -> lambda submits normalized data location and schema to druid for
indexing
Google Big Query
Worked directly with head of product, architected and built a fleet of
cloudwatch scheduled lambdas (all with CI/CD stacks and SDLC) that would
intermittently query Google Big Query to ingest Ethereum and Bitcoin chain data.
Required learning how GBW indexes their data to optimize the queries and balance
cost vs latency. Data was then transformed in the lambda using pandas over
several steps (raw -> flattened -> normalized) and written to S3 in a hive style
path to then be ingested be internal tools (athena, druid, kafka)
Node scraper
Using "serverless framework" architected and built a fleet of cloudwatch
scheduled lambdas (all with CI/CD stacks and SDLC) that provided the Research
Analysts (who could write some python but were not full fledge developers) with
a platform where they could query a given ethereum node for arbitrary data,
perform their arbitrary transformations and then would accept a well formatted
CSV. The platform would then take the CSV and write it to S3 in a hive style
path to then be ingested be internal tools (athena, druid, kafka)
Celery
Build out a Celery/RabbitMQ system tied into the django admin for running
scheduled tasks.