Git @programming.dev canpolat @programming.dev 1 yr. ago

We Put Half a Million files in One git Repository, Here's What We Learned - Canva Engineering Blog

www.canva.dev We Put Half a Million files in One git Repository, Here's What We Learned - Canva Engineering Blog

Using a monorepo causes a lot of performance challenges for git. Here's how we solve them at Canva.

TechNews @radiation.party irradiated @radiation.party

BOT

1 yr. ago

[HN] We Put Half a Million Files in One Git Repository, Here's What We Learned (2022)

www.canva.dev /blog/engineering/we-put-half-a-million-files-in-one-git-repository-heres-what-we-learned/

7 comments

I don't get why they have so many generated files checked in. Like changing that seems like a no brainer. If they can be generated then just gitignore them and call it a day.
They talk about checking in generated files, but they also talk about using Bazel as the build system.

They're holding it wrong.

Just define a BUILD target to generate the files. Don't check them in. Any other target that depends on the generated files can depend on the target that generated them rather than depending on the files directly.

My guess is that they haven't fully embraced Bazel, so there must be parts of the CI/CD that are not defined as Bazel targets that also need these files...
- They’re holding it wrong.
  
  That's a naive take. These are no random autogenerated files. These are translation files. Even in the smoothest-running build systems and CICD pipelines, these can and often go wrong, because there is still an important human factor in generating translations. A regression hitting localization data means your whole system can become unusable for a whole portion of your userbase without having a good way to detect, track, and even monitor your apps.
  
  Checking these files into the build system is the only reliable way to track changes in translation and accessibility data, and pinpoint regressions.
  
  Source: I've worked for a company who had an internal translation service which by design required no human interaction and should only be integrated as a post-build step, and that system failed often and catastrophically. The only surefire way of tracking the mess it made was to commit those files and trwck changes per commit as part of pull requests.
- The creator of Bazel--Google--also checks in their generated translation files. They don't generate them on the fly. They use a caching fuse filesystem on top of perforce to make it efficient. Teams that use git within Google are encouraged to use many of the same tactics mentioned in this article.
  
  Google hasn't used Perforce in a loooong time.
  
  They do use plenty of fuse filesystems though. And also plenty of home-grown non-POSIX filesystems. One of which is specifically for accessing Blaze (Bazel) generated files so that they do not need to be checked in.
  
  They even have infrastructure to see diffs of generated files during code review.
  
  (I'm not sure how translation files are handled specifically)
This honestly feels like it's presented by someone with Stockholm syndrome. What major advantage is there over having multiple, more manageable repos? From the blog, it sounds like it's just extra challenges and more complicated on-boarding.
- versioning and version dependencies are more manageable.
  
  idk why aren't they using git clone --filter to clone a part of the repo and/or git sparse-checkout or at least git status . while in the subdir you are doing your work. what's the point of doing git status on the whole thing if you're working in a dir?
yeah feels like git's being crammed as some kind of deployed prod tool and too afraid to move away from it
Sounds like the kinda stupid and time consuming that Mozilla is into

You've viewed 7 comments.