Shortest String Length Matching a Regexp

Shortest String Length Matching a Regexp

Created: <2025-10-30 Thu>

Abstract

In one of my project I need to get the minimal possible regex length in order to sort them out from smaller to larger. So, I had to find a way to get such value. Less trivial then expected

Content

There are different approaches to this:

Add Length Manually 🐵
What I did till now because I couldn't figure it out and I'm lazy
Find a Library That Does It
I couldn't find one. Quite surprising!! Maybe I'm getting old and I can't find stuff online anymore. Plus, if I was a good developer/engineer, I would have started my own library to solve the problem once and for all, but as I said, I'm lazy. If you know about any library that does it out of the box, text me, send an email or whatever.
Brute Force It (probabilistic to be Fancy)
It might get the job done if you find a way to create attempts that matches the given regexp. Not ideal at all, but an improvement compared to the manual way.

You can imagine I'm not here to tell you to add them manually 😅

I recently got back to this and found this scalacheck regex lib that allows you to generate strings based on a regex, so I had a revelation: What if I could use this to solve my problem?

What I did at this point was to:

However, I couldn't shake it off my mind, I had to experiment. Try what I could do with what I have! And I'm here to show you. Sorry in advance for the bad code, it's not meant to be beautiful, but to test it's possible and who it run.

  #!/usr/bin/env scala-cli
//> using dep io.github.wolfendale::scalacheck-gen-regexp:1.1.0
//> using dep org.scalacheck::scalacheck:1.19.0

import wolfendale.scalacheck.regexp.RegexpGen
import org.scalacheck.Gen
import scala.util.matching.Regex
import org.scalacheck.rng.Seed
import java.time.Instant
import java.time.Duration


// Initial (bad) idea:
//   gonna generate 100 samples from a regex and then see if I hit the minimal

// Generate 100 sample for the regex.
def generate100(regex: Regex): Gen[List[String]] =
  Gen.listOfN(100, RegexpGen.from(regex.toString))

// Given a regex, I generate 100 samples and get the minimum length
// If I can't I explode
def minimumRegexLength(regex: Regex): Int =
  generate100(regex).sample.get.minBy(_.length).length

val inputExpected: Map[Regex, Int] = Map(
  "(posto|carico) i video".r -> 13,
  "(restiamo|teniamo) in contatto".r -> 19,
  "(attraverso i|nei) commenti".r -> 12,
  "cocod[eè]".r -> 6,
  "\\bmisc\\b".r -> 4,
  "\\bm[i]+[a]+[o]+\\b".r -> 4,
  "\\bsgrida[tr]".r -> 7,
  "\\brimprover".r -> 9,
  "(baci )?perugina".r -> 8,
  "velit[aà]".r -> 6,
  "ho perso (di nuovo )?qualcosa".r -> 17,
  "tutti (quanti )?mi criticheranno".r -> 22,
  "ti voglio (tanto )?bene".r -> 14,
  "vi voglio (tanto )*bene".r -> 14,
  "pu[oò] capitare".r -> 12,
  "dopo (che ho )?mangiato".r -> 13,
  "\\bops\\b".r -> 3,
  "mi sto sentendo (bene|in compagnia)".r -> 20,
  "(25|venticinque) (milioni|mila euro)".r -> 10
)

inputExpected.forall { case (input, expected) =>
  val result = minimumRegexLength(input)
  if (result != expected) {
    println(
      s"[RegexMinimumLength] Failed test for $input, expected $expected, got $result"
    )
    false
  } else true
}

// FAILED as expected. Output
// Got [RegexMinimumLength] Failed test for \bm[i]+[a]+[o]+\b, expected 4, got 46

// Gonna ask AI what are his suggestions.
// Together we come out with an idea:

// - Start small in size
// - Try to generate the output
// - If fail, increase the size and retry
// - Exit when you got something

def canGenerateSize(
  regex: Regex,
  targetSize: Int,
  trials: Int = 100
): Boolean = {
  val params = Gen.Parameters.default.withSize(targetSize)
  (1 to trials).exists { _ =>
    RegexpGen.from(regex.toString).apply(params, Seed.random())
      .exists(v => if (v.length == targetSize) {
        //println(s"[RegexMinimumLength] find a match for $regex: $v")
        true
      } else false
      )
  }
}

def minimalSize[T](
  regex: Regex,
  maxProbe: Int = 50
): Option[Int] = {
  val start: Instant = Instant.now()
  val result = (0 to maxProbe).find { size =>
    canGenerateSize(regex, size)
  }
  println(s"[RegexMinimumLength] generation took: ${Duration.between(start, Instant.now).toMillis}")
  result
}

inputExpected.forall { case (input, expected) =>
  val result = minimalSize(input)
  if (result != Some(expected)) {
    println(
      s"[RegexMinimumLength] Failed test for $input, expected $expected, got $result"
    )
    false
  } else true
}

// THIS WORKS!!! And generally returns pretty fast ~ 25 milliseconds
// It's probabilistic tho so it might be wrong, especially if the number
// of attempt are low or the regexp is particularly complex

Conclusions

I hope this could be useful to someone out there. At least this should be a better start then when I did approach the problem the first time. (and decided to give up)

EDIT: The solution works fine for simple regexp. Long/complex regexp will take more time. So for those you might be better off counting the length manually till someone delivers a lib that exposes this functionality. I learned this by seeing a degrade in my project performance 😅

Share Buttons