Be Wary Of - A simple detector that can detect anything in any text(in english) that you should be wary of It can tell you which words: are inappropriate, are spelled wrong, and might not exist In the input field you can put multiple words, sentences, or paragraphs. It will automatically clean the data. It's the fastest and most advanced detector(that I know of) on scratch(faster than O(n)) If you want to test the speed, go to: https://randomwordgenerator.com/paragraph.php you can use this to generate like 50 paragraphs to test When you send a valid query to the project, it will display a bunch of debug variables: Words Checked - # of words checked % of Words Checked - percent of total words checked Total Words - words in your query Time Per Word - the amount of time it takes to process one word (variable). If it says 0, then that particular word was computed via the transposition table and took no meaningful time to compute. Time Left - The computer's most accurate, real-time estimate of how many seconds are left in the entire calculation. Time/Prediction - Time Left represents an average of earlier calculations. This slider controls how long the computer should continue to gather data before updating the Time Left display. A lower value indicates a more accurate but sporadic calculation. Default - 1.5 How it works - We first split our input into words. These words are than encoded(impossible to "decode" without a database) and checked with our list of bad words. This list is encoded in SHA256 to prevent curious people from seeing these words by accident or on purpose. Then it will be sent to Google to translate the text into english. We store this value in a variable. You would assume that since the text is already in english it would return the input without changing it. Instead, Google will return nothing if its a bad word. We can use this in combination with our SHA256 prediction to get more accurate results. Then if it isn't a bad word we take the stored google translate result and check it. If Google doesn't know of this word, it will return a word different from that we sent it. We then check if it is in our word list(check notes and credits section for more information) and if it isn't we can be reasonably sure that it doesn't exist. Due to scratch limitations many super common words return an error. We use a transposition table and a common word list to get around this(more information on this later). If it passes all of these tests(SHA256 + Translate bad word check + Translate spelling check + word lookup) than we add it to the transposition table. (It's important to note that this isn't a transposition table by definition, but is so similar that is what I call it) A transposition table(IN THIS CASE) is a cache of previous passed calculations. A word which has passed all of tests mentioned above is automatically added to this cache. Before we are about to run the calculations on the next word, we lookup this word in the transposition table. If it exists, then we know that this word will pass again, and skip the calculations. If not, we do that calculations like normal, and should it pass, we add to the table. This saves time trying to contact Google and SHA256 encoding. If you notice the time per word suddenly jump to 0 that means that it was computed via transposition lookup and took so little time, we couldn't measure it. Transposition tables have major cons. We still have to calculate them to add it to the transposition table which takes time, and if only a couple words are put in, we can't generate a good table to save time. Also, remember that problem from earlier that because of scratch limitations some super common words return an error. All of these problems can be mostly solved by starting off with a pre generated table. Words that don't fail the tests are still added but we start with more words. This allows for some super common words to get greenlit without ever having to get searched through the words list. It also allows for small queries to be automated faster since we start off with a transposition table. The words we start off with are about 2,000 of the most common words in english. I pasted this into the input and removed the bad words. So while increasing accuracy, we were able to reduce the time pre word(and thus the overall time of the calculations). The last thing we do behind the scenes is detect numbers(in number form such as 22) and alert the user. These numbers don't exist in our databases and would return a long error. We add 0 to the word. If its a number the adding 0 should not change the variable, and we can detect that. If it isn't a number it returns 0, it is not equal to the text we put in.
Resources used: SHA256 encoder - https://scratch.mit.edu/projects/611788242/ 2000 most common list - https://www.talkenglish.com/vocabulary/top-2000-vocabulary.aspx Word List - top 200,000 words from english wikipedia https://github.com/IlyaSemenov/wikipedia-word-frequency bad word list from better profanity (obviously not going to link that)