From Millions to Billions

(geocod.io)

54 points | by mjwhansen 5 days ago

5 comments

  • mperham 1 minute ago
    Seems weird not to use Redis as the buffering layer + minutely cron job. Seems a lot simpler than installing Kafka + Vector.
  • nasretdinov 48 minutes ago
    BTW you could've used e.g. kittenhouse (https://github.com/YuriyNasretdinov/kittenhouse, my fork) or just a simpler buffer table, with 2 layers and a larger aggregation period than in the example.

    Alternatively, you could've used async insert functionality built into ClickHouse: https://clickhouse.com/docs/optimize/asynchronous-inserts . All of these solutions are operationally simpler than Kafka + Vector, although obviously it's all tradeoffs.

    • devmor 45 minutes ago
      There were a lot of simpler options that came to mind while reading through this, frankly.

      But I imagine the writeup eschews myriad future concerns and does not entirely illustrate the pressure and stress of trying to solve such a high-scale problem.

      Ultimately, going with a somewhat more complex solution that involves additional architecture but has been tried and tested by a 3rd party that you trust can sometimes be the more fitting end result. Assurance often weighs more than simplicity, I think.

      • nasretdinov 37 minutes ago
        While kittenhouse is, unfortunately, abandonware (even though you can still use it and it works), you can't say the same about e.g. async inserts in ClickHouse: it's a very simple and robust solution to tackle exactly the problem the PHP (and some other languages') backends often face when trying to use ClickHouse
  • rozenmd 1 hour ago
    Great write-up!

    I had a similar project back in August when I realised my DB's performance (Postgres) was blocking me from implementing features users commonly ask for (querying out to 30 days of historical uptime data).

    I was already blown away at the performance (200ms to query what Postgres was doing in 500-600ms), but then I realized I hadn't put an index on the Clickhouse table. Now the query returns in 50-70ms, and that includes network time.

  • tlaverdure 30 minutes ago
    Thanks for sharing. I really enjoyed the breakdown, and great to see small tech companies helping each other out!
  • frenchmajesty 36 minutes ago
    Thanks for sharing I enjoyed reading this.