I do know it’s 2023, however you’ll be able to’t get away from processing recordsdata. In a world of Occasions, APIs and Sockets, recordsdata nonetheless exist as a medium for shifting knowledge round. And a quite common one at that. In recent times I’ve discovered myself coping with Apache Parquet format recordsdata. And extra particularly I usually find yourself coping with them popping out of AWS S3. If you’re a client at all the AWS DMS product when replicating, you will see that out that parquet format is an effective way to take care of your knowledge as its designed for environment friendly storage and retrieval. There aren’t too many choices for parsing a parquet file with Golang, however I’ve discover a library I actually take pleasure in and the article under will describe methods to make the most effective use of it.
As at all times, right here is the hyperlink to the Github Repository if you wish to skip forward
What’s Apache Parquet
Apache Parquet is an open supply, column-oriented knowledge file format designed for environment friendly knowledge storage and retrieval. It supplies environment friendly knowledge compression and encoding schemes with enhanced efficiency to deal with complicated knowledge in bulk. Parquet is obtainable in a number of languages together with Java, C++, Python, and many others…
Downloading the Parquet File
For working with S3, I actually just like the Golang library referred to as s3manager
. Right here is the SDK documentation. What I like about it’s that may be a larger degree abstraction on high of the traditional S3 library. As an illustration, to obtain a file from a bucket, you merely do one thing like this
downloader := s3manager.NewDownloader(sess)
_, err = downloader.DownloadWithContext(ctx, file,
&s3.GetObjectInput{
Bucket: aws.String(bucket),
Key: aws.String(key),
})
The downloader will put the file within the path you specify within the DownloadWithContext technique within the “file” parameter. It’s only a string.
Parsing File with Golang
Parsing an Apache parquet file with Golang will appear tremendous household to different interface based mostly unmarshalling like DyanamoDB in addition to JSON. For similarities with DDB, you’ll be able to see how to do that within the referenced article
The parse operate seems like this
func ParseFile(fileName string) ([]ParquetUser, error) {
fr, err := ground.NewFileReader(fileName)
var fileContent []ParquetUser
if err != nil {
return nil, err
}
for fr.Subsequent() {
rec := &ParquetUser{}
if err := fr.Scan(rec); err != nil {
// proceed alongside is it is only a malformed row
if errors.Is(err, ErrIllegalRow) {
proceed
}
return nil, err
}
fileContent = append(fileContent, *rec)
}
return fileContent, nil
}
First off, discover that I open a FileReader from the parquet-go library.
From there, I create a slice for holding the output of what’s being unmarshalled.
Then we loop and scan. And for every name to Scan, the unmarshall technique that implements the parquet-go interface is named. That technique seems like this
func (r *ParquetUser) UnmarshalParquet(obj interfaces.UnmarshalObject) error {
id, err := obj.GetField("id").Int32()
if err != nil {
return errors.New(fmt.Sprintf("error unmarshalling row on area (id)"))
}
firstName, err := obj.GetField("firstName").ByteArray()
if err != nil {
return errors.New(fmt.Sprintf("error unmarshalling row on area (firstName)"))
}
lastName, err := obj.GetField("lastName").ByteArray()
if err != nil {
return errors.New(fmt.Sprintf("error unmarshalling row on area (lastName)"))
}
function, err := obj.GetField("function").ByteArray()
if err != nil {
return errors.New(fmt.Sprintf("error unmarshalling row on area (function)"))
}
// observe it is a time.Time however comes throughout as an Int64
lastUpdated, err := obj.GetField("lastUpdated").Int64()
if err != nil {
return errors.New(fmt.Sprintf("error unmarshalling row on area (lastUpdated)"))
}
parsed := time.UnixMicro(lastUpdated)
if err != nil {
log.WithFields(log.Fields{
"err": err,
}).Error("error parsing time")
return errors.New(fmt.Sprintf("(lastUpdated) isn't in the fitting format"))
}
r.Id = int(id)
r.FirstName = string(firstName)
r.LastName = string(lastName)
r.Position = string(function)
r.LastUpdated = parsed
return nil
}
Actually not an excessive amount of happening up there outdoors of fetching fields after which placing them into the structs fields. The one principal factor to level out that may be a “gotcha” is that the LastUpdated area is a time.Time
. The parquet-go library treats time as an Int64
. Observe this line for changing what comes out of the library right into a time.Time
parsed := time.UnixMicro(lastUpdated)
Operating the Program
From there, it’s only a matter of placing all of it collectively. Right here’s the physique of principal
func principal() {
file, err := DownloadFile(context.TODO(), sess, bucket, key)
if err != nil {
log.WithFields(log.Fields{
"err": err,
}).Error("error downloading the file")
}
contents, err := ParseFile(file)
if err != nil {
log.WithFields(log.Fields{
"err": err,
}).Error("error parsing the file")
}
err = DeleteFile(file)
if err != nil {
log.WithFields(log.Fields{
"err": err,
}).Error("error deleting the file")
}
for _, c := vary contents {
log.WithFields(log.Fields{
"file": c,
}).Debug("printing the file")
}
}
In a nutshell …
- Obtain the file
- Parse the file
- Delete the file
- Loop and print output
Useful Suggestions
- I’m utilizing VSCode much more lately and I’m kind of weaning myself off of Goland. So that you’ll discover a
launch.json
file within the.vscode
listing. There you’ll be able to set the atmosphere variables it’s worthwhile to run this system - Viewing parquet recordsdata is mostly a ache I’ve discovered. There are few instruments that I’ve preferred. On-line viewers get in the best way of my workflow. BUT I discovered this VSCode plugin to be FANTASTIC. Here is the link to the marketplace
Wrapping Up
Hopefully you discovered this beneficial. Like I discussed to start with, recordsdata aren’t going away as a knowledge medium. And Apache’s Parquet is a superb one once you take care of bigger datasets and it’ll be one of many choices you’ll be able to select when replicating with DMS because the output.
I proceed to simply love Golang’s simplicity and efficiency in addition to the event expertise. The parquet-go library has a couple of quirks however general, 5-star ranking for me.