Replace NAs with their respective column means from very large text file












1















I have a large text file: 400k rows and 10k columns, all numeric data values as 0,1,2. File size ranging 5-10GBs. I have a few missing values: NAs in the file. I want to replace the NA values with the column means, i.e. NA value in column 'x' must be replaced by the mean value of column 'x'. These are the steps that I want to do :




  1. Compute means of each column of my text file (excluding the header and starting from column7th)

  2. Replace NA in each column with their respective column means

  3. Write the modified file back as a txt file


Data subset:



IID  FID  PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 NA 0 1
2346 2346 0 0 2 -9 1 2 NA 1
1334 1334 0 0 2 -9 2 NA 0 2
4566 4566 0 0 2 -9 2 2 NA 0
4567 4567 0 0 1 -9 NA NA 1 1

# total 400k rows and 10k columns


Desired Output:



# Assuming only 5 rows as given in the above example.
# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25
# Mean of column X2 = (2 + 2)/2 = 2
# Mean of column X3 = (0 + 0 + 1)/3 = 0.33
# Mean of column X4 = No NAs, so no replacements

# Replacing NAs with respective means:

IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 2 0 1
2346 2346 0 0 2 -9 1 2 0.33 1
1334 1334 0 0 2 -9 2 2 0 2
4566 4566 0 0 2 -9 2 2 0.33 0
4567 4567 0 0 1 -9 1.25 2 1 1


I tried this:



file="path/to/data.txt"

#get total number of columns
number_cols=$(awk -F' ' '{print NF; exit}' $file)

for ((i=7; i<=$number_cols; i=i+1))
do
echo $i
# getting the mean of each column
mean+=$(awk '{ total += $i } END { print total/NR }' $file)
done

# array of column means
echo ${mean[@]}

# find and replace (newstr must be replaced by respective column means)
find $file -type f -exec sed -i 's/NA/newstr/g' {} ;


However, this code is incomplete. The for loop is very slow since my data is huge. Is there another way to do this faster? I did this in Python and R, but it is too slow. I am open to get this done in any programming language as long as it is fast. Can someone please help me write the script?



Thanks










share|improve this question

























  • Please add your desired output for that sample input to your question.

    – Cyrus
    Jan 18 at 19:14











  • Added the desired output, thanks

    – user3720537
    Jan 18 at 20:02











  • I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.

    – MaxPowers
    Jan 18 at 21:10






  • 1





    (1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.

    – tshiono
    Jan 19 at 2:50











  • (3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.

    – tshiono
    Jan 19 at 2:50
















1















I have a large text file: 400k rows and 10k columns, all numeric data values as 0,1,2. File size ranging 5-10GBs. I have a few missing values: NAs in the file. I want to replace the NA values with the column means, i.e. NA value in column 'x' must be replaced by the mean value of column 'x'. These are the steps that I want to do :




  1. Compute means of each column of my text file (excluding the header and starting from column7th)

  2. Replace NA in each column with their respective column means

  3. Write the modified file back as a txt file


Data subset:



IID  FID  PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 NA 0 1
2346 2346 0 0 2 -9 1 2 NA 1
1334 1334 0 0 2 -9 2 NA 0 2
4566 4566 0 0 2 -9 2 2 NA 0
4567 4567 0 0 1 -9 NA NA 1 1

# total 400k rows and 10k columns


Desired Output:



# Assuming only 5 rows as given in the above example.
# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25
# Mean of column X2 = (2 + 2)/2 = 2
# Mean of column X3 = (0 + 0 + 1)/3 = 0.33
# Mean of column X4 = No NAs, so no replacements

# Replacing NAs with respective means:

IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 2 0 1
2346 2346 0 0 2 -9 1 2 0.33 1
1334 1334 0 0 2 -9 2 2 0 2
4566 4566 0 0 2 -9 2 2 0.33 0
4567 4567 0 0 1 -9 1.25 2 1 1


I tried this:



file="path/to/data.txt"

#get total number of columns
number_cols=$(awk -F' ' '{print NF; exit}' $file)

for ((i=7; i<=$number_cols; i=i+1))
do
echo $i
# getting the mean of each column
mean+=$(awk '{ total += $i } END { print total/NR }' $file)
done

# array of column means
echo ${mean[@]}

# find and replace (newstr must be replaced by respective column means)
find $file -type f -exec sed -i 's/NA/newstr/g' {} ;


However, this code is incomplete. The for loop is very slow since my data is huge. Is there another way to do this faster? I did this in Python and R, but it is too slow. I am open to get this done in any programming language as long as it is fast. Can someone please help me write the script?



Thanks










share|improve this question

























  • Please add your desired output for that sample input to your question.

    – Cyrus
    Jan 18 at 19:14











  • Added the desired output, thanks

    – user3720537
    Jan 18 at 20:02











  • I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.

    – MaxPowers
    Jan 18 at 21:10






  • 1





    (1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.

    – tshiono
    Jan 19 at 2:50











  • (3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.

    – tshiono
    Jan 19 at 2:50














1












1








1








I have a large text file: 400k rows and 10k columns, all numeric data values as 0,1,2. File size ranging 5-10GBs. I have a few missing values: NAs in the file. I want to replace the NA values with the column means, i.e. NA value in column 'x' must be replaced by the mean value of column 'x'. These are the steps that I want to do :




  1. Compute means of each column of my text file (excluding the header and starting from column7th)

  2. Replace NA in each column with their respective column means

  3. Write the modified file back as a txt file


Data subset:



IID  FID  PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 NA 0 1
2346 2346 0 0 2 -9 1 2 NA 1
1334 1334 0 0 2 -9 2 NA 0 2
4566 4566 0 0 2 -9 2 2 NA 0
4567 4567 0 0 1 -9 NA NA 1 1

# total 400k rows and 10k columns


Desired Output:



# Assuming only 5 rows as given in the above example.
# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25
# Mean of column X2 = (2 + 2)/2 = 2
# Mean of column X3 = (0 + 0 + 1)/3 = 0.33
# Mean of column X4 = No NAs, so no replacements

# Replacing NAs with respective means:

IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 2 0 1
2346 2346 0 0 2 -9 1 2 0.33 1
1334 1334 0 0 2 -9 2 2 0 2
4566 4566 0 0 2 -9 2 2 0.33 0
4567 4567 0 0 1 -9 1.25 2 1 1


I tried this:



file="path/to/data.txt"

#get total number of columns
number_cols=$(awk -F' ' '{print NF; exit}' $file)

for ((i=7; i<=$number_cols; i=i+1))
do
echo $i
# getting the mean of each column
mean+=$(awk '{ total += $i } END { print total/NR }' $file)
done

# array of column means
echo ${mean[@]}

# find and replace (newstr must be replaced by respective column means)
find $file -type f -exec sed -i 's/NA/newstr/g' {} ;


However, this code is incomplete. The for loop is very slow since my data is huge. Is there another way to do this faster? I did this in Python and R, but it is too slow. I am open to get this done in any programming language as long as it is fast. Can someone please help me write the script?



Thanks










share|improve this question
















I have a large text file: 400k rows and 10k columns, all numeric data values as 0,1,2. File size ranging 5-10GBs. I have a few missing values: NAs in the file. I want to replace the NA values with the column means, i.e. NA value in column 'x' must be replaced by the mean value of column 'x'. These are the steps that I want to do :




  1. Compute means of each column of my text file (excluding the header and starting from column7th)

  2. Replace NA in each column with their respective column means

  3. Write the modified file back as a txt file


Data subset:



IID  FID  PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 NA 0 1
2346 2346 0 0 2 -9 1 2 NA 1
1334 1334 0 0 2 -9 2 NA 0 2
4566 4566 0 0 2 -9 2 2 NA 0
4567 4567 0 0 1 -9 NA NA 1 1

# total 400k rows and 10k columns


Desired Output:



# Assuming only 5 rows as given in the above example.
# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25
# Mean of column X2 = (2 + 2)/2 = 2
# Mean of column X3 = (0 + 0 + 1)/3 = 0.33
# Mean of column X4 = No NAs, so no replacements

# Replacing NAs with respective means:

IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 2 0 1
2346 2346 0 0 2 -9 1 2 0.33 1
1334 1334 0 0 2 -9 2 2 0 2
4566 4566 0 0 2 -9 2 2 0.33 0
4567 4567 0 0 1 -9 1.25 2 1 1


I tried this:



file="path/to/data.txt"

#get total number of columns
number_cols=$(awk -F' ' '{print NF; exit}' $file)

for ((i=7; i<=$number_cols; i=i+1))
do
echo $i
# getting the mean of each column
mean+=$(awk '{ total += $i } END { print total/NR }' $file)
done

# array of column means
echo ${mean[@]}

# find and replace (newstr must be replaced by respective column means)
find $file -type f -exec sed -i 's/NA/newstr/g' {} ;


However, this code is incomplete. The for loop is very slow since my data is huge. Is there another way to do this faster? I did this in Python and R, but it is too slow. I am open to get this done in any programming language as long as it is fast. Can someone please help me write the script?



Thanks







linux bash shell






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 18 at 21:54









Poshi

2,3161821




2,3161821










asked Jan 18 at 18:57









user3720537user3720537

142




142













  • Please add your desired output for that sample input to your question.

    – Cyrus
    Jan 18 at 19:14











  • Added the desired output, thanks

    – user3720537
    Jan 18 at 20:02











  • I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.

    – MaxPowers
    Jan 18 at 21:10






  • 1





    (1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.

    – tshiono
    Jan 19 at 2:50











  • (3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.

    – tshiono
    Jan 19 at 2:50



















  • Please add your desired output for that sample input to your question.

    – Cyrus
    Jan 18 at 19:14











  • Added the desired output, thanks

    – user3720537
    Jan 18 at 20:02











  • I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.

    – MaxPowers
    Jan 18 at 21:10






  • 1





    (1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.

    – tshiono
    Jan 19 at 2:50











  • (3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.

    – tshiono
    Jan 19 at 2:50

















Please add your desired output for that sample input to your question.

– Cyrus
Jan 18 at 19:14





Please add your desired output for that sample input to your question.

– Cyrus
Jan 18 at 19:14













Added the desired output, thanks

– user3720537
Jan 18 at 20:02





Added the desired output, thanks

– user3720537
Jan 18 at 20:02













I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.

– MaxPowers
Jan 18 at 21:10





I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.

– MaxPowers
Jan 18 at 21:10




1




1





(1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.

– tshiono
Jan 19 at 2:50





(1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.

– tshiono
Jan 19 at 2:50













(3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.

– tshiono
Jan 19 at 2:50





(3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.

– tshiono
Jan 19 at 2:50












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54259950%2freplace-nas-with-their-respective-column-means-from-very-large-text-file%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54259950%2freplace-nas-with-their-respective-column-means-from-very-large-text-file%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Liquibase includeAll doesn't find base path

How to use setInterval in EJS file?

Petrus Granier-Deferre