Shortest String Length Matching a Regexp
Shortest String Length Matching a Regexp
Created:
Abstract
In one of my project I need to get the minimal possible regex length in order to sort them out from smaller to larger. So, I had to find a way to get such value. Less trivial then expected
Content
There are different approaches to this:
- Add Length Manually 🐵
- What I did till now because I couldn't figure it out and I'm lazy
- Find a Library That Does It
- I couldn't find one. Quite surprising!! Maybe I'm getting old and I can't find stuff online anymore. Plus, if I was a good developer/engineer, I would have started my own library to solve the problem once and for all, but as I said, I'm lazy. If you know about any library that does it out of the box, text me, send an email or whatever.
- Brute Force It (probabilistic to be Fancy)
- It might get the job done if you find a way to create attempts that matches the given regexp. Not ideal at all, but an improvement compared to the manual way.
You can imagine I'm not here to tell you to add them manually 😅
I recently got back to this and found this scalacheck regex lib that allows you to generate strings based on a regex, so I had a revelation: What if I could use this to solve my problem?
What I did at this point was to:
- Add an Issue to my project, so I remember that this problem needs to be solved or improved.
- Add an issue to the scalacheck regex library, to ask if they know how to do it based on their internal.
However, I couldn't shake it off my mind, I had to experiment. Try what I could do with what I have! And I'm here to show you. Sorry in advance for the bad code, it's not meant to be beautiful, but to test it's possible and who it run.
#!/usr/bin/env scala-cli //> using dep io.github.wolfendale::scalacheck-gen-regexp:1.1.0 //> using dep org.scalacheck::scalacheck:1.19.0 import wolfendale.scalacheck.regexp.RegexpGen import org.scalacheck.Gen import scala.util.matching.Regex import org.scalacheck.rng.Seed import java.time.Instant import java.time.Duration // Initial (bad) idea: // gonna generate 100 samples from a regex and then see if I hit the minimal // Generate 100 sample for the regex. def generate100(regex: Regex): Gen[List[String]] = Gen.listOfN(100, RegexpGen.from(regex.toString)) // Given a regex, I generate 100 samples and get the minimum length // If I can't I explode def minimumRegexLength(regex: Regex): Int = generate100(regex).sample.get.minBy(_.length).length val inputExpected: Map[Regex, Int] = Map( "(posto|carico) i video".r -> 13, "(restiamo|teniamo) in contatto".r -> 19, "(attraverso i|nei) commenti".r -> 12, "cocod[eè]".r -> 6, "\\bmisc\\b".r -> 4, "\\bm[i]+[a]+[o]+\\b".r -> 4, "\\bsgrida[tr]".r -> 7, "\\brimprover".r -> 9, "(baci )?perugina".r -> 8, "velit[aà]".r -> 6, "ho perso (di nuovo )?qualcosa".r -> 17, "tutti (quanti )?mi criticheranno".r -> 22, "ti voglio (tanto )?bene".r -> 14, "vi voglio (tanto )*bene".r -> 14, "pu[oò] capitare".r -> 12, "dopo (che ho )?mangiato".r -> 13, "\\bops\\b".r -> 3, "mi sto sentendo (bene|in compagnia)".r -> 20, "(25|venticinque) (milioni|mila euro)".r -> 10 ) inputExpected.forall { case (input, expected) => val result = minimumRegexLength(input) if (result != expected) { println( s"[RegexMinimumLength] Failed test for $input, expected $expected, got $result" ) false } else true } // FAILED as expected. Output // Got [RegexMinimumLength] Failed test for \bm[i]+[a]+[o]+\b, expected 4, got 46 // Gonna ask AI what are his suggestions. // Together we come out with an idea: // - Start small in size // - Try to generate the output // - If fail, increase the size and retry // - Exit when you got something def canGenerateSize( regex: Regex, targetSize: Int, trials: Int = 100 ): Boolean = { val params = Gen.Parameters.default.withSize(targetSize) (1 to trials).exists { _ => RegexpGen.from(regex.toString).apply(params, Seed.random()) .exists(v => if (v.length == targetSize) { //println(s"[RegexMinimumLength] find a match for $regex: $v") true } else false ) } } def minimalSize[T]( regex: Regex, maxProbe: Int = 50 ): Option[Int] = { val start: Instant = Instant.now() val result = (0 to maxProbe).find { size => canGenerateSize(regex, size) } println(s"[RegexMinimumLength] generation took: ${Duration.between(start, Instant.now).toMillis}") result } inputExpected.forall { case (input, expected) => val result = minimalSize(input) if (result != Some(expected)) { println( s"[RegexMinimumLength] Failed test for $input, expected $expected, got $result" ) false } else true } // THIS WORKS!!! And generally returns pretty fast ~ 25 milliseconds // It's probabilistic tho so it might be wrong, especially if the number // of attempt are low or the regexp is particularly complex
Conclusions
I hope this could be useful to someone out there. At least this should be a better start then when I did approach the problem the first time. (and decided to give up)
EDIT: The solution works fine for simple regexp. Long/complex regexp will take more time. So for those you might be better off counting the length manually till someone delivers a lib that exposes this functionality. I learned this by seeing a degrade in my project performance 😅