ARTICLE AD BOX

This technological investigation has 3 components. First, my astir caller advances towards solving 1 of nan astir famous, multi-century aged conjectures successful number theory. One that kids successful simple schoolhouse tin understand, yet incredibly difficult to prove. At nan very core, it is astir nan spectacular quantum dynamics of nan digit sum function.
Then, I coming an infinite dataset that has each nan patterns you aliases AI tin imagine, and overmuch more, ranging from evident to undetectable. More specifically, it is an infinite number of infinite datasets each successful tabular format, pinch various degrees of auto- and cross-correlations (short and agelong range) to test, heighten and benchmark AI algorithms including LLMs. It is based connected nan physics of nan digit sum usability and linked to nan aforementioned conjecture. This synthetic information of its ain benignant is useful successful discourse specified arsenic fraud discovery aliases cybersecurity.
Finally, it comes pinch very businesslike Python codification to make nan data, involving gigantic numbers and precocious precision arithmetic.
Summary
The cosmopolitan dataset is an invaluable strategy to test, heighten aliases benchmark shape discovery algorithms for fraud detection, cybersecurity, and different applications. The methodology relies connected drawstring auto-convolutions to observe heavy insights astir nan digit sum function, offering a caller position towards solving a celebrated multi-century aged conjecture: are nan binary digits of e evenly distributed? In this paper, I talk nan results obtained truthful far, some empirical and those formally proved, including respective caller ones. I besides talk nan dataset, its relevancy to modern AI arsenic a basal testing system, nan incredibly rich | and diversified group of patterns that it boasts, arsenic good arsenic connections to ample connection models (LLMs), quantum dynamics, synthetic data, and cryptography. I besides supply very efficient, accelerated Python codification to nutrient nan data, dealing pinch numbers larger than 2n +1 astatine powerfulness 2n, pinch n larger than 106.
The dataset
While nan insubstantial has a bully chunk of very absorbing worldly dedicated to nan number mentation conjecture successful question, good beyond PhD level yet made accessible to first-year assemblage students, present I summarize nan dataset. In nan end, this is what astir of my readers are willing in, arsenic it offers galore applicable applications.

Each array consists of a number of rows. Each statement contains 2n bits of accusation (n tin beryllium arsenic ample arsenic you want). The patterns and correlations, trivial astatine nan beginning, go harder and harder to observe later successful nan data. At statement n, nan first n bits lucifer nan binary digits of e aliases related transcendental numbers. Beyond that row, location are nary much patterns. Some features:
- Rows tin beryllium interpreted arsenic strings, and location is besides a clip bid component. These strings tin beryllium divided into words, either short to emulate categorical features, aliases agelong for numerical features, to mimic endeavor datasets.
- The building successful nan dataset allows you to trial clustering algorithms: nan various strings tin beryllium clustered. The colors successful nan featured image correspond specified clusters. Towards nan end, nan number of clusters dramatically increases, pinch nan building becoming much and much fuzzy.
- The dataset besides allows you to trial predictive algorithms. In particular, predicting nan adjacent strings based connected humanities information (the erstwhile strings), pinch exertion to training ample connection models (LLMs).
- It tin beryllium utilized arsenic generic, very versatile type of synthetic data, aliases to create synthetic data. The digit sum usability plays nan domiciled of a response, summary, aliases aggregate feature.
- The iterated self-convolution utilized to create nan data, aliases its inverse – nan iterated quadrate guidelines of a string– is useful to creation efficient, accelerated pseudo-random number generators (PRNGs) linked to pattern-free transcendental numbers pinch infinite play (such arsenic e), and frankincense pinch overmuch amended randomness properties than classical congruential generators.
- The relationship to dynamical systems and quantum dynamics tin beryllium exploited for simulations, modeling purposes, and agent-based modeling.
Test your ain shape discovery algorithm connected nan cosmopolitan dataset, spot really acold it tin go. If it detects shape beyond statement n, these are mendacious positives, and you request to reside this rumor connected your side. It’s besides a awesome sandbox to benchmark shape discovery systems.
For customization based connected your endeavor needs, thief pinch information generation, interpretation, sample size, simulations, characteristic generation, and immoderate different questions astir building your ain endeavor type to reside your priorities,
contact nan author.
Python code, dataset, and method paper
The PDF pinch galore illustrations is disposable for free arsenic insubstantial 53, here. It besides features accelerated Python codification (with nexus to GitHub) to woody pinch gigantic numbers. The underlying mentation is explained successful detail, pinch respective modern references. The bluish links successful nan PDF are clickable erstwhile you download nan archive from GitHub and position it successful immoderate browser but whitethorn not beryllium clickable successful nan GitHub “view mode”. I dream GitHub hole this rumor successful nan future!
To nary miss early articles, subscribe to my AI newsletter, here.
About nan Author

Vincent Granville is simply a pioneering GenAI scientist, co-founder at BondingAI.io, nan LLM 2.0 level for hallucination-free, secure, in-house, lightning-fast Enterprise AI astatine standard pinch zero weight and nary GPU. He is besides writer (Elsevier, Wiley), publisher, and successful entrepreneur pinch multi-million-dollar exit. Vincent’s past firm acquisition includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. He completed a post-doc successful computational statistic astatine University of Cambridge.