Search safe number compression algorithm(feat. Python)

There was some taking place.

A selected worth interpreted in a sure information needed to be saved in elasticsearch and brought via a search, however the specific worth was a quantity with a size of greater than 300 characters.
So, I attempted to implement the search via fuzzy or more_like_this, however the search was not good as a result of it was too lengthy. I assumed it might be a lot better to compress and retailer this worth if I have been to make use of it just for search functions with out rewriting it anyway.

I merely left 0 to 9 and approached it with ASCII as if compressing numbers from 10 utilizing alphabets or different numbers. Individuals who have used search engines like google and yahoo corresponding to Google might know that search engines like google and yahoo have particular processing characters (corresponding to “, area, ‘, and many others.), so i created search-safe compression algorithms with out utilizing that particular chars.

import random


chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_'
decoder = {}
encoder = {}
quantity = 10
for char in chars:
    str_number = str(quantity)
    decoder[char] = str_number
    encoder[str_number] = char
    quantity += 1


def create_random_number_string(size: int):
    return ''.be part of(str(random.randrange(0, 10)) for _ in vary(size))


def compress(number_string: str):
    for quantity in encoder:
        number_string = number_string.change(quantity, encoder[number])
    return number_string


def extract(compressed: str):
    for char in decoder:
        compressed = compressed.change(char, decoder[char])
    return compressed


for _ in vary(5):
    random_length = random.randrange(150, 200)
    test_number_string = create_random_number_string(random_length)
    compressed = compress(test_number_string)
    extracted = extract(compressed)
    compressed_rate = ((len(test_number_string) - len(compressed)) / len(test_number_string))
    print(len(test_number_string), ':', test_number_string)
    print(len(compressed), ':', compressed)
    print(len(extracted), ':', extracted)
    print('%0.2fpercents' % (compressed_rate * 100, '%'))
    print("=======================================")
    print()

Enter fullscreen mode

Exit fullscreen mode

For five check instances

173 : 38217891529214746234295363014799197479060934296524329944079434354710188378778646041457672365438358060229029172126710046843892063571710775285494291604692579406512763592800742
118 : C2h89ft2e7Kn4t5Aue799j7L90Y9yt65o3t94E794yzLai8B87786K04eV67n6SCz80Ym90th2c67a0K84C9k6z7ha775s5N4tg0K9p79E65c76z9s007G
173 : 38217891529214746234295363014799197479060934296524329944079434354710188378778646041457672365438358060229029172126710046843892063571710775285494291604692579406512763592800742
31.79%
=======================================

175 : 7639716822342277816032528043792097211529552642763043314807290564221988329254460181232184798255115898904013545953569374090168012001578589355676046306782315697282643001276497663
119 : 76D7g8mym778g03ps04B9k972b5tTq4r6u4xe807t0U4mj883tpIYicwiL98p5bW9890Ed5J95z69BE90g80c00f78W9zU7YKu678nf697sq4u0c76N7663
175 : 7639716822342277816032528043792097211529552642763043314807290564221988329254460181232184798255115898904013545953569374090168012001578589355676046306782315697282643001276497663
32.00%
=======================================

169 : 7493139163694318425430833685052034163810084088304567519918460284909110852590873676525955716306253186813604736222677625163963843917395318561832048231972429826794656963474
110 : 7N3d9gA9Hi4p4u8x68O5kygCa08E88uJ675j9iK0sN091a85p9087A765p9T7gu6p3i68dYLAmq776pgD6C4DhDRiUi3kMnj7ot8q79KU96y74
169 : 7493139163694318425430833685052034163810084088304567519918460284909110852590873676525955716306253186813604736222677625163963843917395318561832048231972429826794656963474
34.91%
=======================================

164 : 12026339925028632674921994888837232115038090276115065754862563107795134822572088600680031246095798421381627915938027997533115145310302719886551641616281934706968564
111 : c0qx99p0s63q7N2j9M888Bn2bOC090r6bO6V5M6p63a7795dMmVk88Y068003cK09V98Gd8gr9f9C0r9975xb5eRaurj886Tg4ggsjy706968U4
164 : 12026339925028632674921994888837232115038090276115065754862563107795134822572088600680031246095798421381627915938027997533115145310302719886551641616281934706968564
32.32%
=======================================

176 : 73946801840514332235134053456152477400743521396323428813867397457955677468949406618792480382695298743841357684557186163646921605264305170043119348188501288987043495676184196146
121 : 7DK80iE5exmzdE5yUfo77E074z2d963n4s8d867D7J79T677K89NE66i79o80Cq95t874C4dV68JVi6gAK92g05q4u5h00Hb9y8i8Oc8898704y9U76i4j6e6
176 : 73946801840514332235134053456152477400743521396323428813867397457955677468949406618792480382695298743841357684557186163646921605264305170043119348188501288987043495676184196146
31.25%
=======================================
Enter fullscreen mode

Exit fullscreen mode

You’ll be able to verify the compression charge of barely greater than 30% on common.

It has a mean compression charge of 32.35 p.c in 10,000 check instances, so it appears to be considerably helpful. When used on a specific platform, it’s straightforward to extend the compression charge by including the search_safe character on that platform to chars.

After all, let’s not make the error of including multi-byte characters like Korean.

Add a Comment

Your email address will not be published. Required fields are marked *