satyakimitra commited on
Commit
d49310b
·
1 Parent(s): 7146b67

BLOGPOST.md added

Browse files
Files changed (1) hide show
  1. docs/BLOGPOST.md +22 -21
docs/BLOGPOST.md CHANGED
@@ -44,8 +44,8 @@ Rather than betting everything on one metric, we designed a system that analyzes
44
 
45
  **The mathematics**: Perplexity is calculated as the exponential of the average negative log-probability of each word given its context:
46
 
47
- ```
48
- Perplexity = exp(-1/N × Σ log P(wᵢ | context))
49
  ```
50
 
51
  where N is the number of tokens, and P(wᵢ | context) is the probability the model assigns to word i given the preceding words.
@@ -60,8 +60,8 @@ where N is the number of tokens, and P(wᵢ | context) is the probability the mo
60
 
61
  **The mathematics**: We use Shannon entropy across the token distribution:
62
 
63
- ```
64
- H(X) = -Σ p(xᵢ) × log₂ p(xᵢ)
65
  ```
66
 
67
  where p(xᵢ) is the probability of token i appearing in the text.
@@ -77,16 +77,22 @@ where p(xᵢ) is the probability of token i appearing in the text.
77
  **The mathematics**: We calculate two complementary metrics:
78
 
79
  **Burstiness** measures the relationship between variability and central tendency:
 
 
80
  ```
81
- Burstiness = (σ - μ) / (σ + μ)
82
- ```
 
83
 
84
  **Uniformity** captures how consistent sentence lengths are:
 
 
85
  ```
86
- Uniformity = 1 - (σ / μ)
87
- ```
88
 
89
- where μ is mean sentence length and σ is standard deviation.
 
 
 
90
 
91
  **Why it matters**: Human writing exhibits natural "burstiness"—some short, punchy sentences followed by longer, complex ones. This creates rhythm and emphasis. AI writing tends toward consistent medium-length sentences, creating an almost metronome-like uniformity.
92
 
@@ -98,8 +104,8 @@ where μ is mean sentence length and σ is standard deviation.
98
 
99
  **The mathematics**: Using sentence embeddings, we calculate cosine similarity between adjacent sentences:
100
 
101
- ```
102
- Coherence = 1/n × Σ cos(eᵢ, eᵢ₊₁)
103
  ```
104
 
105
  where eᵢ represents the embedding vector for sentence i.
@@ -124,8 +130,8 @@ where eᵢ represents the embedding vector for sentence i.
124
 
125
  **The mathematics**: We generate multiple perturbed versions and measure deviation:
126
 
127
- ```
128
- Stability = 1/n × Σ |log P(x) - log P(x_perturbed_j)|
129
  ```
130
 
131
  **The insight**: This metric is based on cutting-edge research (DetectGPT). AI-generated text exhibits characteristic "curvature" in probability space. Because it originated from a model's probability distribution, small changes cause predictable shifts in likelihood. Human text behaves differently—it wasn't generated from this distribution, so perturbations show different patterns.
@@ -294,17 +300,17 @@ For production deployments, we pre-bake models into Docker images to avoid cold-
294
 
295
  While the technology is fascinating, a system is only valuable if it solves real problems for real users. The market validation is compelling:
296
 
297
- **Education sector** ($12B market):
298
  - Universities need academic integrity tools that are defensible in appeals
299
  - False accusations destroy student trust—accuracy matters more than speed
300
  - Need for integration with learning management systems (Canvas, Blackboard, Moodle)
301
 
302
- **Hiring platforms** ($5B market):
303
  - Resume screening at scale requires automated first-pass filtering
304
  - Cover letter authenticity affects candidate quality downstream
305
  - Integration with applicant tracking systems (Greenhouse, Lever, Workday)
306
 
307
- **Content publishing** ($3B market):
308
  - Publishers drowning in AI-generated submissions
309
  - SEO platforms fighting content farms
310
  - Media credibility depends on content authenticity
@@ -390,8 +396,3 @@ As AI writing tools become ubiquitous, the question isn't "Can we detect them?"
390
  **Version 1.0.0 | October 2025**
391
 
392
  ---
393
-
394
- ## Author:
395
- Satyaki Mitra — Data Scientist
396
-
397
- ---
 
44
 
45
  **The mathematics**: Perplexity is calculated as the exponential of the average negative log-probability of each word given its context:
46
 
47
+ ```math
48
+ Perplexity = \exp\left(-\frac{1}{N}\sum_{i=1}^N \log P(w_i\mid context)\right)
49
  ```
50
 
51
  where N is the number of tokens, and P(wᵢ | context) is the probability the model assigns to word i given the preceding words.
 
60
 
61
  **The mathematics**: We use Shannon entropy across the token distribution:
62
 
63
+ ```math
64
+ H(X) = -Σ p(x_i) * log₂ p(x_i)
65
  ```
66
 
67
  where p(xᵢ) is the probability of token i appearing in the text.
 
77
  **The mathematics**: We calculate two complementary metrics:
78
 
79
  **Burstiness** measures the relationship between variability and central tendency:
80
+ ```math
81
+ Burstiness = \frac{\sigma - \mu}{\sigma + \mu}
82
  ```
83
+ where:
84
+ - μ = mean sentence length
85
+ - σ = standard deviation of sentence length
86
 
87
  **Uniformity** captures how consistent sentence lengths are:
88
+ ```math
89
+ Uniformity = 1 - \frac{\sigma}{\mu}
90
  ```
 
 
91
 
92
+ where:
93
+ - μ = mean sentence length
94
+ - σ = standard deviation of sentence length
95
+
96
 
97
  **Why it matters**: Human writing exhibits natural "burstiness"—some short, punchy sentences followed by longer, complex ones. This creates rhythm and emphasis. AI writing tends toward consistent medium-length sentences, creating an almost metronome-like uniformity.
98
 
 
104
 
105
  **The mathematics**: Using sentence embeddings, we calculate cosine similarity between adjacent sentences:
106
 
107
+ ```math
108
+ Coherence = \frac{1}{n} \sum_{i=1}^{n-1} \cos(e_i, e_{i+1})
109
  ```
110
 
111
  where eᵢ represents the embedding vector for sentence i.
 
130
 
131
  **The mathematics**: We generate multiple perturbed versions and measure deviation:
132
 
133
+ ```math
134
+ Stability = \frac{1}{n} \sum_{j} \left| \log P(x) - \log P(x_{perturbed_j}) \right|
135
  ```
136
 
137
  **The insight**: This metric is based on cutting-edge research (DetectGPT). AI-generated text exhibits characteristic "curvature" in probability space. Because it originated from a model's probability distribution, small changes cause predictable shifts in likelihood. Human text behaves differently—it wasn't generated from this distribution, so perturbations show different patterns.
 
300
 
301
  While the technology is fascinating, a system is only valuable if it solves real problems for real users. The market validation is compelling:
302
 
303
+ **Education sector** :
304
  - Universities need academic integrity tools that are defensible in appeals
305
  - False accusations destroy student trust—accuracy matters more than speed
306
  - Need for integration with learning management systems (Canvas, Blackboard, Moodle)
307
 
308
+ **Hiring platforms** :
309
  - Resume screening at scale requires automated first-pass filtering
310
  - Cover letter authenticity affects candidate quality downstream
311
  - Integration with applicant tracking systems (Greenhouse, Lever, Workday)
312
 
313
+ **Content publishing** :
314
  - Publishers drowning in AI-generated submissions
315
  - SEO platforms fighting content farms
316
  - Media credibility depends on content authenticity
 
396
  **Version 1.0.0 | October 2025**
397
 
398
  ---