未加星标

Changing source files encoding and some fun with PowerShell

字体大小 | |
[系统(windows) 所属分类 系统(windows) | 发布者 店小二05 | 时间 2016 | 作者 红领巾 ] 0人收藏点击收藏

One day I was asked to assist in creating a PDF-document with all source code of some our library. That weird task is needed for patenting our product. It's possible to do manually indeed but it's humiliating for a dev. Obviously it can be easy automated in many ways. It turned out to be a fun adventure.

First of all I describe a context - the library is .NET solution with almost 3000 C# files. It's a little bit big :).

The first issue I encountered was the fact that the solution contains files in different encodings. All comments were written in Russian. So some files were in ASCII-based encoding (windows-1251) but the others in utf-8. It's a mess. So the first idea was to normalize all files in utf-8 encoding despite the fact that comments themselves are not needed for this task. But besides comments sources contain strings with national letters so encoding is important anyway. By "important" I mean that it should be known to read and process a file.

In this post I'll talk about converting encoding and in the next about generating Word files.

There are three approaches in detecting encoding:

Use byte order mark (BOM) - it's a dummy approach to detect unicode/ascii, but actually it doesn't work as it's common practice to not have BOM in utf-8 files. Use some platform API, for example on Windows we have MLang COM component (mlang.dll) Try to detect encoding by heuristics on our own

I found a nice lib/tool wrapper for MLang here - http://www.codeproject.com/Articles/17201/Detect-Encoding-for-In-and-Outgoing-Text . But I wasn't keen to use COM.

Next I found this C# port of Mozilla Universal Charset Detector - https://github.com/errepi/ude . It worked pretty good for me. Not ideally. In some cases it didn't see the difference between windows-1215 and mac-Cyrillic content. But my goal wasn't to detect arbitrary encoding but only to differentiate UTF-8 and ASCII so it was good enough.

It should be understood that in anyway detecting encoding is a non deterministic task. We could guess basing on content but it's not 100% guaranteed.

I decided to use PowerShell to process files. Below you'll find a simple script to process all files by mask in a folder recursively. The script detects encoding and saves file back in UTF-8.

All my efforts ended with a script I published at GitHub - https://github.com/evil-shrike/SourceFilesProcessor . Here's a simplified version of it.

Let's go through the script.

First we need to load the library's assembly. We used Add-Type cmdlet for this. As current directory can differ from the directory where the script is located we're using absolute path to the library :

$scriptPath = Split-Path -Path $MyInvocation.MyCommand.Definition -Parent Add-Type -Path $scriptPath\Ude.dll

After the assembly is loaded we can use types from it. Usage of UDE lib is wrapped in GetFileEncoding function:

# Read content of the file as bytes [byte[]]$bytes = get-content -Encoding byte -Path $filePath # Create an instance of Ude.CharsetDetector $cdet = new-object -TypeName Ude.CharsetDetector # Pass the bytes to UDE to detect encoding $cdet.Feed($bytes, 0, $bytes.Length); $cdet.DataEnd(); # Get a result - it's a name of encoding return $cdet.Charset

Unfortunately the lib doesn't support System.Text.Encoding and just returns a string.

Next, as we detected the file encoding we can correctly read it as text. We're interested only in cases when encoding is neither utf-8 (already good) nor 7bit-ASCII (no national letters, also good).

As you can see I hard-coded the codepage here, considering that non-Unicode file should be in windows-1251. If you have files in different ASCII codepages then it'll be harder to detect correct encoding. UDF library returns name of encoding with the highest probability (it's called "Confidence" inside the lib) but internally it keeps results (probability) for all tested encoding.

$text = Get-Content $filePath -Encoding Byte -ReadCount 0 $text = [System.Text.Encoding]::GetEncoding(1251).GetString($text)

Now we have content of the file and it's only left to save it back in correct encoding.

[System.IO.File]::WriteAllText($filePath, $text)

What can be easier in PowerShell right? But why do I use .NET File.WriteAllText?

There're different methods to save files in PS:

Set-Content -Path "$filePath" -Value $text -Encoding utf8 -Force Out-File -filepath "$filePath" -InputObject $text -Encoding utf8 -Force

It turned out that both methods insert BOM in saved file and it's unavoidable. I didn't want to pollute my files with BOM. That's because I use File.WriteAllText . It just saves in UTF-8 w/o BOM.

That's all on the task of changing fles encoding. But there're some gotchas with PowerShell I encountered with while coding. In the following sections we'll share small tricks and tips on PowerShell.

Resolve-RelativePath

For some reason neither PowerShell nor .NET don't have an easy method to get relative path. Sad but true.

So here's such a method.

In C# it'd be:

string ResolveRelativePath(string path, string fromPath) { return Uri.UnescapeDataString( new Uri(fromPath) .MakeRelativeUri(new Uri(path)) .ToString() .Replace('/',Path.DirectorySeparatorChar) ); } It returns a relative path to folder/file path from folder fromPath.
If path = "c:\\temp\\folder\\file.ext" and fromPath = "c:\\temp" then result will be "folder\\file.ext".
in PowerShell:
function Resolve-RelativePath($path, $fromPath) { $path = Resolve-Path $path $fromPath = Resolve-Path $fromPath $fromUri = new-object -TypeName System.Uri -ArgumentList "$fromPath\" $pathUri = new-object -TypeName System.Uri -ArgumentList $path return [System.Uri]::UnescapeDataString( $fromUri.MakeRelativeUri($pathUri) .ToString().Replace('/', [System.IO.Path]::DirectorySeparatorChar) ) } Reading file as text

Get-Content cmdlet returns an array of string by default. Can you believe?

If we need just content of a file we must specify -Raw option:

$text = Get-Content $filePath -Encoding UTF8 -Raw Process files

Let's suppose that we need to update header in all files. File header is comments before the first line of code.

So we need to remove old header first that add a new one.

To remove the header we split text by newlines ( -split operator) and filter out with where .

# Remove all comment lines from the beginning of the file $bHeader = $true $lines = $text -split "\r\n" | where { if (!$bHeader) { return $true } if ($_ -eq "\r\n" -or $_ -eq "\n") { return $false } if ($_ -notmatch "^//") { $bHeader = $false return $true } return $false } # Add new header $lines = @($headerText) + $lines $text = [System.String]::Join([System.Environment]::NewLine, $lines)

Having an array of lines ( $lines ) we need to insert an item at the beginning. Adding item into array is not very obvious in PS.

Someone may think about $lines.Add($item) but it won't work as arrays are fixed length. So usualy people use += operator which creates a new array:

$lines += $headerText

But we need to insert not add:

$lines = @($headerText ) + $lines

本文系统(windows)相关术语:三级网络技术 计算机三级网络技术 网络技术基础 计算机网络技术

主题: PowerShellC#GitWindowsGitHubUTBOMWordHead
分页:12
转载请注明
本文标题:Changing source files encoding and some fun with PowerShell
本站链接:http://www.codesec.net/view/482761.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 系统(windows) | 评论(0) | 阅读(18)