June 2019

A new strategic market: we’ve arrived in Sweden!

Xpand IT is a Portuguese company supported by Portuguese investment, and it is extraordinary how quickly we have expanded within Portugal. At the end of 2018, the company realised a growth of 45% and a revenue of around 15 million euros, which led Xpand IT to be distinguished in the Financial Times’ Ranking of 2019 (FT1000: Europe’s Fastest Growing Companies). Xpand IT was one of just three Portuguese technology companies to be featured in this ranking.

However, Xpand IT always seeks to grow further. We want to share our expertise with all four corners of the world and deliver a little bit of our culture to all our customers. It is true that Xpand IT’s international involvement has been increasing substantially, with 46.5% of our revenue coming from international customers at the end of last year.

This growth has been supported by two main focal points: exploring strategic markets such as Germany and the United Kingdom (where we now have a branch and an office), and strong leverage of the product we register. Xray and Xporter, both associated with Atlassian ecosystems, are used by more than 5 thousand customers in more than 90 countries! And new products are expected this year, in both artificial intelligence (Digital Xperience) and business intelligence.

This year, Xpand IT’s internationalisation strategy is to invest in new strategic markets in Europe: namely the Nordic countries. Sweden will be the first country focused on, but the goal is to expand our initiatives to the rest of them: Norway, Denmark and Finland.

There are already various commercial initiatives in this market, and we can count on support from partners such as Microsoft, Hitachi Vantara and Cloudera, all already well-established in countries like Sweden. Moreover, cultural barriers and different time zones do not represent a significant impact, which make this strategy an attractive investment prospect for 2019.

In the words of Paulo Lopes, CEO & Senior Partner at Xpand IT: “We are extremely proud of the growth the company has experienced in recent years and expect this success to keep on going. Xpand IT has been undergoing its internationalisation process for a few years now. However, we are presently entering a 2nd phase, where we will actively invest in new markets where we know that our technological expertise paired with a unique team and unique culture can definitely make a difference. We believe that Sweden makes the right starting point for investment in the Nordic market. Soon we will be able to give you even more good news about this project!…”

Ana LamelasA new strategic market: we’ve arrived in Sweden!
read more

Zwoox – Simplify your Data Ingestion

Zwoox is a data ingestion tool, developed by Xpand IT, that facilitates data imports and structuring into a Hadoop cluster.

This tool is highly scalable, thanks to its complete integration with Cloudera Enterprise Data Hub, and takes full advantage of several different Hadoop technologies, such as Spark, Hbase and Kafka. Zwoox eliminates the need to encode data pipelines ‘by hand’, regardless of the data source.

One of Zwoox’s biggest advantages is its capability to accelerate data ingestions, offering numerous options for data import and allowing real-time RDBMS DML replications for Hadoop data structures.

Despite the number of different tools that allow data import for Hadoop clusters, only Zwoox is capable of executing the import in an accessible, efficient and highly scalable manner, maintaining data in HDFS (with Hive tables) or Kudu.

Some of the possibilities offered by Zwoox:

  • Automation and partitioning in HDFS;
  • Translation of data types;
  • Full or delta upload;
  • Audit charts (with full history) without impacting on performance;
  • Derivation of new columns with pre-defined functions or “pluggable” code;
  • Operational integration with Cloudera Manager.

This tool is available on Cloudera Solutions Center and will be available soon on Xpand IT’s website. Meanwhile, you can access our informative document. If you’d like to learn more about Zwoox or data ingestion, please contact us.

Ana LamelasZwoox – Simplify your Data Ingestion
read more

Biometric technology for recognition

Nowadays it is more essential than ever to ensure that users feel safe when using a service, a mobile app and when registering on a website. The user’s priority is to know that their data is properly protected. And consequently biometric technology for recognition plays an increasingly crucial role as one of the safest and most efficient ways to authenticate user access to mobile devices, personal email accounts and even online bank accounts.

Biometrics has become one of the fastest, safest and most efficient ways to provide protection to individuals, not only because it is a requirement of authentication for each person as a citizen of a country – considering that fingerprints are some of the data collected and stored for legal purposes and documents – but also because it is the most casual (and reliable) way to protect our cellphones. The advantages of using biometric technology for recognition are efficiency, precision, convenience and scalability.

In IT, biometrics is primarily found connected to identity verification by using a person’s physical or behavioral features – fingerprints, facial recognition, voice recognition and even retina/iris recognition. We are referring to technologies that measure and analyze features of the human body as a way to allow or deny access.

But how does this identification work in the backend? Software that recognises specific points of presented data as starting points. These starting points are then processed and transported to a database which, in turn, uses an algorithm that converts information into a numeric value. It is this value that is compared to a user’s registered biometric entry, the scanner detected and the user’s authentication approved or denied, depending on whether there is a match or not.

The process of recognition can be carried out in two ways: comparing one value to others or comparing one value to another. The process of recognition of one value to others happens when the sample of a user is submitted to a system and compared to samples of other individuals; while the process of authentication of one value to another works with only one user, comparing the provided data to previously submitted data – as with our mobile devices.

There are countless biometric readings, these being some of the most common:

  1. Fingerprinting (one of the most used, economical biometric technologies for recognition, since it has a significant degree of accuracy. In this type of verification, various points of a finger are analysed, such as endings and unique arches). Examples: apps from Médis, MBWay or Revolut;
  2. Facial recognition using a facial image of the user, composed of various identification points on the face, with the ability to define the distance between the eyes and the nose, for example, and the bone structure and lines of each feature of the face. This reading has some percentage of failure, depending on whether the user has a beard or sunglasses. Examples: Apple’s Face ID;
  3. Voice recognition (recognition is carried out from an analysis of the vocal patterns of an individual, adding a combination of physical and behavioral factors). However, it is not of the most reliable method of recognition). Examples: Siri, from Apple, or Alexa, from Amazon;
  4. Retina/iris recognition (being the least used, retina/iris recognition works by storing lines and geometric patterns – in the case of the iris – and with the blood vessels in the eyes – in the case of the retina. Reliability is very high, but so are the costs, which makes this method of recognition less often used). Read this article on identity recognition in the banking industry;
  5. Writing style (behavioural biometrics based on writing style) (lastly, a way to authenticate a user through their writing – for example, a signature – since the pressure on the paper, the speed of the writing and the movements in the air are very difficult to copy. This is one of the oldest authentication tools, used mainly in the banking industry). Read the article on Read API, Microsoft Azure.
Ana LamelasBiometric technology for recognition
read more

Using Salesforce with Pentaho Data Integration

Pentaho Data Integration is the tool of the trade to move data between systems, and it doesn’t have to be just a business intelligence process. We can actually use it as an agile tool for point-to-point integration between systems. PDI has its own Salesforce input step which makes it a good candidate for integration.

What is Salesforce?

Salesforce is a cloud solution for customer relationship management (CRM). As a next generation multi-tenant Platform as a Service (PaaS), its unique infrastructure enables you to focus your efforts where they are most essential: creating microservices that can be leveraged in innovative applications and really speeding up the CRM development process.

Salesforce is the right platform to give you a complete 360º vision of your customer and his interactions with your brand, whether this happens via your email campaigns, call centres, social networks, or a simple phone call. Marketing automation is, for example, just one of the many great things Salesforce brings to you in an all-in-one platform.

How do we use PDI to connect to Salesforce?

For this access we need all our Salesforce connection details: the username, password and the SOAP web service URL. PDI has to be compatible with the SOAP API version that you use. For example:

  PDI version   SOAP API version number
  2.0   1.0
  3.8   20.0
  4.2   21.0
  6.0   24.0
  7.0   37.0
  8.2   40.0


Nevertheless, even if Salesforce gives us a new version of the API we can still use the old API perfectly well. Just be careful, because if you’ve created new modules inside the platform, the new API won’t have these customisations, and so you’ll need to use the Salesforce Object Query Language (SOQL) to get the data. But don’t worry, we’ll explain it all in the next section.

How do we use PDI to connect to Salesforce?

The SOQL syntax is quite similar to SQL syntax, but with a few differences:

  1. The SOQL does not recognise any special characters (such as * or ; ) and so we have to use all the fields that we will get from Salesforce, and we cannot add the ; at EOF.
  2. We cannot use comments in a query; SOQL does not recognise this either.
  3. To create joins we need to know a few things:
    • For the native modules that we need linkage to (direct relationship), we need to add in final name a ‘s’. For example:

Get all Orders with and without has Products (OrderItem Module)

  • For the customisation modules that we need to get data from another module (direct relationship) we need add to final name the  ‘__r’ . For example:
    Filter  OrderItems by Product_Skins__c field inside Product 2 Module 

How do we extract data from Salesforce with PDI?

We can use the Salesforce input step inside PDI to get data from Salesforce using SOQL; just be aware you can only use up to 20,000 characters to create the query.

  • Connection parameters specified:
    • Salesforce web service URL:

<url of Salesforce Platform>/services/Soap/u/<number of API Soap updated>

  • Username: Username Access to the Platform  (i.e. myname@pentaho.com)
  • Password:Password + Token (the company provides the token for us add to the password in Kettle.Properties) i.e: PASSWORDTOKEN

Settings parameters specified:

    • Specify query: Without active (like we can see in the image below) we only need to choose the module (the table containing records that we need to access).

For the next tab (Content) we have the following parameters options:

  • If we want to get all records from Salesforce (I mean, if we want to get delete records and insert records) you need place a tick in Query All Records, and choose from the parameters below one of the following options:
    • All (get new/update records and delete records), Update (get only inserts and update records) ;
    • If you untick the tick from Query All Records parameters, we only get insert/update registers;
    • Delete (we get only delete records).

How does PDI know if records are new/updates or deletes?

The Salesforce has native fields very useful for controlling the process. However we cannot see these fields in layout or on builder schema in SF. We only can see the data associated with these specific fields if we’re using the SOQL or PDI to access these fields.

  • CreatedById and CreateDate are fields that shows the user and data time when records were created.
  • The LastModifiedDate and LastModifiedID shows the data time and the user who modified the record. We can use these fields to get data updated in SF.
  • Id (Salesforce Id) present in URL as a string of 18 characters (Java config.) displays the register.
    For example:
  • We have more one native field IsDeleted with data type = Boolean that shows if the record was removed (IsDelete = true) or not (IsDelete = false).

In Additional field option we have three options:

  • Time out is useful in asynchrony systems because we can configure the timeout interval in milliseconds before the step times out;
  • Use Compression is useful to get more performance from the process. Because when you tick it, the system will direct all calls to the API and send all in .qzip field;
  • Limit is for configuring the maximum number of records to retrieve from the query.

Inside the last tab, we can see all fields from the query inside the first tab. Without SOQL we get all the module fields. With SOQL we get all the fields inside on SELECT function.

And for these cases, we need to do the manually changes.
For more details:

The base64 displays images or PDFs present in SF.

If we need send images (.jpeg) or pdf (.pdf) directly to SF we load these type of fields  using JAVA to convert binary files to the base64.

For example, to send a PDF file to SF:

How to load data to Salesforce with PDI?

Send data to Salesforce from other databases or from Salesforce.

The connection option is equal as described in Salesforce Input.
In Settings Options we have new parameters:

  • Rollback all Changes on error – if we got any error nothing will integrate into SF;
  • Batch Size – we can bring a static number of the records and integrate them simultaneously (the same batch) to SF;
  • In Output Fields Label we need to add the field name that we want to get the Salesforce ID for each record integrated.

In the Fields Option, we need to put field mapping.

  • For Module Field, we need to put the API Name field in SF to get the new data;
  • In the Steam Field, we need to put the name of the field that will be integrated into the respective field in SF;
  • Use External id = N to all field updated inside the respective Module;
  • Use External id = Y to all records that we need updating but are not present in the current module, but present in another module.

Delete records inside Salesforce

We delete records from Salesforce with Delete Salesforce step. We need to specify the key field from Table Input that does the reference to the key in Salesforce (Saleforce Id).

Update Salesforce records

If we only want to update records in SF we need to use the Salesforce Update Step.
Inside Fields (Key included) Option we need to add the key to records for the specific module.

Upsert data to Salesforce

If we want to insert and update in the same Batch to SF, we need to use Salesforce Upsert.
The parameter Upsert Comparison Field helps match the data in SF.

Fátima MirandaUsing Salesforce with Pentaho Data Integration
read more

Meetup Data Science Hands-on by Lisbon Kaggle: hot topics

Data Science Hands-on: “Predicting movies’ worldwide revenue”

On May 4th, a day known worldwide as Star Wars Day (“May the fourth“), approximately 40 Data Science fans seized this occasion to learn more about this subject by practicing and sharing on yet another Lisbon Kaggle Meetup. The “Data Science Hands-on” Meetup took place at Instituto Superior Técnico (IST Campus) and it was precisely dedicated to cinema:

  • the problem addressed consisted in predicting movies’ revenue before their premiere!

This event was also sponsored by Xpand IT, in collaboration with Hackerschool Lisboa, a group of IST students interested in technology, who also evangelizes the practice of learn-by-doing.

First off, the event started with a presentation by Xpand IT’s own Ricardo Pires, who introduced the company and their units focused on data treatment and exploration. Participants received a sample of how these problems fit in a real-world context. Shortly after, professor Rui Henriques, who teaches Data Science at IST, explained his perspective on how to approach a Data Science problem, providing some tips related to the meetup’s challenge.

Data from this challenge leverage learning and provide an idea of a potentially real problem, as they are semi-structured and demand a great amount of effort to process.

An estimated 80% of Data Scientists’ daily work revolves around data treatment.

(Source: Forbes

After the two presentations, participants started to unravel the mysteries hidden within the data. They verified, for example, a generalized increase in revenue over the years. They also noticed that American movies had a superior revenue, compared to all the rest.

Tackling the challenge

On the first part, participants modelled the problem with simpler columns, structured as:

  • budget
  • popularity
  • runtime
  • data

By doing so, they’ve tried to obtain the first predictions for the movies’ revenue. On the image below, which represents Spearman’s rank correlation coefficient, we can verify that budget and popularity columns are the most correlated with revenue.

During the second phase, contestants tackled the semi-structured columns, applying the one-hot encoding technique, as:

  • director
  • cast

Through this deeper analysis of the data, teams found out that the movies that generated more revenue (see table below).

Other relevant aspect to consider is that popularity is not always directly related with revenue, such is the case with “Transformers: Dark of the Moon”, as it is represented as less popular, but with a high revenue nonetheless.

It is also interesting to observe the actors who generated more revenue on average:


At the end of the meetup, participants shared their implemented solutions:

  • The group with the best results applied Logistic Regression. Despite being a simple model, it can provide adequate results when the focus is data treatment.
  • Data treatment went through several techniques, such as detection of outliers, in movies with a very discrepant budget, replacing these values with the median.
  • Budget and revenue columns were transformed into their respective logarithm, in order to approximate them to a Gaussian distribution.
  • One of the advantages of using a simpler model is that these are also easier to explain to a business stakeholder.

The fourth of May was spent learning alongside the most wonderful people, enlightening in every way. In case you’re interested in Data Science, join the community and show up at our monthly events.

More information on the “Data Science Hands-on” Meetup.

Joana Pinto

Data Scientist, Xpand IT

Alexandre Gomes

Data Scientist, Xpand IT

Sara GodinhoMeetup Data Science Hands-on by Lisbon Kaggle: hot topics
read more