Pandas – Random Sample of Columns

In this tutorial, we’ll cover how to get a random sample of columns of a pandas dataframe.

The pandas dataframe sample() function is generally used to sample rows from a dataframe. But you can also use it to sample columns by passing 1 or 'columns' to the axis parameter. The following is the syntax:

df_sub = df.sample(axis='columns')

Here, df is the dataframe from which you want to sample the columns. By default, the sample() function returns one item, in the above case, a random column. But you can specify the number of columns to sample using the n parameter. You can also sample based on a fraction instead of a count using the frac parameter.

Note: Fix the random_state to get reproducible results.

First, let’s create a sample dataframe that we’ll be using throughout this tutorial to sample the columns from.

import pandas as pd

data = {
    'Name': ['Microsoft Corporation', 'Google, LLC', 'Tesla, Inc.',\
             'Apple Inc.', 'Netflix, Inc.'],
    'Symbol': ['MSFT', 'GOOG', 'TSLA', 'AAPL', 'NFLX'],
    'Shares': [100, 50, 150, 200, 80]
}

df = pd.DataFrame(data)
df
snapshot of a dataframe of a sample stock portfolio to sample the columns from

Now, let’s look at some of the different use-cases of sampling columns from a dataframe via the pandas dataframe sample() function by keeping the axis as 'columns'

The pandas dataframe sample() function, by default returns a single item, in our case, a column. You can specify the number of random columns to be sampled by passing it to the n parameter. See the example below.

df_sub = df.sample(n=2, axis='columns', random_state=2)
print(df_sub)

Output:

   Shares Symbol
0     100   MSFT
1      50   GOOG
2     150   TSLA
3     200   AAPL
4      80   NFLX

The returned dataframe has two random columns Shares and Symbol from the original dataframe df.

If you want to sample columns based on a fraction instead of a count, example, two-thirds of all the columns, you can use the frac parameter.

df_sub = df.sample(frac=0.67, axis='columns', random_state=2)
print(df_sub)

Output:

   Shares Symbol
0     100   MSFT
1      50   GOOG
2     150   TSLA
3     200   AAPL
4      80   NFLX

In the above example, we sample 67%, that is, two-thirds columns from the dataframe df by passing the fraction 0.67 to the frac parameter.

The pandas dataframe sample() function also let’s you sample items with replacement. Meaning, you can sample the same column more than once. To enable sampling items with replacement, pass replace=True to the sample() function.

df_sub = df.sample(n=3, replace=True, axis='columns', random_state=2)
print(df_sub)

Output:

                    Name Symbol                   Name
0  Microsoft Corporation   MSFT  Microsoft Corporation
1            Google, LLC   GOOG            Google, LLC
2            Tesla, Inc.   TSLA            Tesla, Inc.
3             Apple Inc.   AAPL             Apple Inc.
4          Netflix, Inc.   NFLX          Netflix, Inc.

In the above example, you can see that the column Name is sampled twice. This happened because here we’re sampling with replacement.

For more on the pandas dataframe sample() function. Refer to its official documentation.

With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.