Thanks for the comment and questions! Very wise.
Let me break it down as to what harvest does:
1) Detects the OS and its package backend (e.g dpkg vs rpm etc)
2) Detects what packages are installed
3) For each package, it tries to detect files in /etc/ that have been modified from the default that get shipped with the package . It does't walk the whole filesystem, it looks straight in /etc for stuff here.
4) It detects running/enabled services and timers via systemd. For each of these, it looks for the unit files, any 'drop-in' files, environment variable files, etc, as well as what executable it executes, and tries to map those systemd services to the packages it's already learned about earlier (that way, those 'packages' or future Ansible roles, can also be associated with 'handlers' in Ansible, to handle restart of the services if/when the configs change)
5) Aside from known packages already learned, it optimistically tries to capture extra system configuration in /etc that is common for config management. This is stuff like crons, logrotate configs, networking settings (as you noted!), hosts files, etc.
6) It also looks for other snowflake stuff in /etc not associated with packages/services or other typical system config, and will put these into an etc_custom role
7) Likewise, it looks in /usr/local for stuff, on the assumption that this is an area that custom apps/configs might've been placed in. These go into a usr_local_custom role.
8) It captures non-system user accounts, their group memberships and their .ssh/authorized_keys
9) takes into account anything the user set with --exclude-path or --include-path . For anything extra that is included, it will put these into an 'extra_paths' role. The location could be anywhere e.g something in /opt, /srv/ whatever you want.
10) writes the state.json and captures the artifacts
So yes, you're right - it does capture stuff that many people might want to exclude if they are going to use the manifests to build other machines from that harvest (as opposed to just rebuild the same machine itself).
But you can use --exclude-path /etc/network and so on to skip the bits you don't want. You also can always comment out from the playbook.yml or delete certain roles it generates once you've run the 'enroll manifest'.
It doesn't have any knowledge of Ansible Galaxy roles/modules etc. It generates all the roles itself. I admit, many of the existing roles out there are a lot more pleasant/easier to read (especially Jeff Geerling's). I still use those myself day to day. A lot of configs out there are also good candidates for being Jinja templates with abstracted vars for separate hosts. Enroll does use my companion tool JinjaTurtle if it's installed, but JinjaTurtle only recognises certain types of files (.ini style, .json, .xml, .yaml, .toml, but not special ones like Nginx or Apache conf files which have their own special syntax).
I consider Enroll to be a good 'quick, grab it all, so I can sleep at night' method, perhaps best for DR purposes.
In terms of safety measures: it doesn't traverse symlinks, and it has an 'IgnorePolicy' that makes it ignore most binary files (except GPG binary keys used with apt) - though if you specify certain paths with --include-path and use --dangerous, it will skip its own policy. See https://git.mig5.net/mig5/enroll/src/branch/main/enroll/igno... .
It will skip files that are too large, and it also currently has a hardcoded cap of the number of files that it will harvest (4000 for /etc/ and /usr/local/etc and /usr/local/bin, and 500 per 'role'), to avoid 'bomb' situations.
I think your caution is very warranted and wise, and I highly recommend to always use --check with Ansible when/if you get to applying the playbook!
Thanks again.