I wrote a pre-processing script in Python for cleaning up my data before giving it to FastText. I was running this in Windows and instead of writing the file output code I decided to use piping (via the >
character).
Anyways, I was having issues with my nearest neighbor queries only returning single characters and when doing the training the word counts were completely wrong. I tried searching the internet for some sort of solution but alas didn't find anyone having this issue so I want to make sure I document it so others can find it in the future.
I noticed if I opened the result in Sublime, copied the text to a new tab in Sublime, then save it my issue would go away. This lead me to believe the issue was with my file and not FastText. The strange thing was that I used Intellij and Sublime for checking the encoding type on the file and both said it was UTF-8. My co-worker told me to try Notepad++ (hadn't used it in years) so I downloaded it and checked the encoding and it showed the file was not using UTF-8 (can't remember off the top of my head exactly what the encoding ended up being but it was definitely not UTF-8). This bothered me a little bit that I had to go through 2 IDEs before I found the correct encoding but that's life for you.
The Fix
You can fix the PowerShell encoding using this command:
$PSDefaultParameterValues = @{'Out-File:Encoding' = 'utf8'}
And now piping in PowerShell will use UTF-8 (you can also change it to whatever you want). This will have to be run every time PS is started.