What you are hearing is sporadic DTMF tones (aka "Touch Tone", the sound made when dialing on a land-line phone keypad.) When VoIP was a new thing, I played around with it quite a bit. I eventually ported my long-time land-line number to a Vonage number and used that for a couple years until I dropped it and went cell only. This was an issue with VoIP (and I assume still is) because of the way DTMF tones are sent.
When you use an IVR service (press 1 for this, press 2 for this, etc.) the tones are sent "in-band" meaning they can be heard on the line. The other end of the line received them as part of the voice call and needs to interpret them properly. VoIP, however, send them out-of-band as data. The voice adapter (the box that connected the phone to the IP network had to "hear" these in-band tones and convert them to data being sent over the connection. The receiving end of the VoIP call would take take that signal and regenerate the in-band tones before passing the call back to the PSTN (the "regular" telephone network) so IVR equipment could hear and use them.
The problem happened when certain sounds, or frequencies of human voices, most often female voices because of the pitch, would be heard by the voice adapter as a DTMF tone. The beep you would hear during the conversation was the echo of that tone.
This happens occasionally on (digital) cellular as well, for the same reasons. Equipment is much better now in dealing with this and it's much less common. When I used VoIP, I'd hear 1 or 2 tones and nearly every call. I still would hear them occasionally over a cell call, but with no where near the frequency. I still heard them occasionally on AT&T LTE. I've yet to hear one on Project Fi, but I haven't had that many phone conversations since I joined.