The invention relates, in an embodiment, to a computer-implemented method
for automatic charset detection, which includes detecting an encoding
scheme of a target document. The method includes training, using a
plurality of text document samples, to obtain a set of machine learning
models. Training includes using SIM (Similarity Algorithm) to generate
the set of machine learning models from feature vectors obtained from the
plurality of text document samples. The method also includes applying the
set of machine learning models against a set of target document feature
vectors converted from the target document to detect the encoding scheme.